Skip to content

Commit

Permalink
DOC: Overhaul documentation per Tom Deans' writing style
Browse files Browse the repository at this point in the history
  • Loading branch information
pan14001 committed May 8, 2019
1 parent f28c6b1 commit 477f870
Showing 1 changed file with 31 additions and 25 deletions.
56 changes: 31 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,24 @@
# GNU Parallel setup for SLURM
# GNU Parallel with SLURM

## Summary

There is a little bit of setup work to get GNU Parallel to work with
the SLURM scheduler. Namely one has to:
This repository has simple, real-world examples to run your code in
parallel and works with any program or programming language.

- Create an `*.sshloginfile` containing a list of hostnames
and CPU counts that have been assigned by SLURM.
- Export the environment, including the current directory.
It includes the `parallel_opts.sh` script to setup GNU Parallel with
the SLURM scheduler; namely the script:

The `parallel_opts.sh` takes care of both these job setup steps and
adds sensible default options to GNU parallel.
1. Creates an `*.sshloginfile` containing a list of hostnames and CPU
counts that have been assigned by SLURM.
2. Maintains the environment, including the current directory.

## Usage

Clone this Git repository into your home directory:

``` sh
# From the command-line
cd # Go to home directory
cd # Go to your home directory
git clone https://github.uconn.edu/HPC/parallel-slurm.git
```

Expand All @@ -33,8 +33,8 @@ parallel $parallel_opts ... YOUR_PROGRAM ...

## Examples

See the `*.slurm` example files. Run each of them using `sbatch` as
explained below:
There are `*.slurm` files in the [examples/](examples/) directory that
are described below:

### Example 01: Hostname

Expand Down Expand Up @@ -62,19 +62,21 @@ cn327

### Example 02: Resumable

A typical problem that parallel tasks need to deal with is recovering
from failure. Tasks can fail when they hit the SLURM job time limit.
Or they can fail due to the stochastic nature of a simulation
intermittently not converging; in other words re-running the job can
produce success.
Parallel tasks often need to recover from failure. Tasks can fail
when they start late and are killed by the SLURM job time limit. Or
tasks can fail due a simulation intermittently not converging. In
both cases, re-running the failed task can produce success.

This example shows how to automatically resume jobs and retry only
failed tasks. This works using the `--joblog` and `--resume` options
to GNU Parallel. Using `--resume` tells GNU Parallel to ignore
completed jobs. The joblog remembers that the work was complete and
does not needlessly re-run completed tasks. If for some reason you
need to re-run the a completed task you would need to delete the
*.joblog file.
This example shows how to automatically retry failed tasks. This
requires the `--joblog` and `--resume-failed` options of GNU Parallel.
The `--joblog` records completed tasks in a file - both successes and
failures. Using `--resume-failed` tells GNU Parallel to ignore
successful tasks but to continue running failed, incomplete, and
unattempted tasks.

If for some reason you need to re-run any successful completed tasks,
you would need to delete relevant line in the joblog file, or delete
the entire joblog file to re-run everything.

To run the example:

Expand All @@ -87,8 +89,12 @@ touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

The output shows that some tasks intermittently failing and some
succeeding. But always by the 5th job all of them succeed.
The output below shows some tasks intermittently failing and some
succeeding from our random number generator script. This example has
been intentionally setup so that 3 out of 4 tasks will fail. by the
5th SLURM job submission all the tasks will succeed. This is why the
last job 2339010 had nothing to do and completed immediately in 0
seconds.

```
Started SLURM job 2339006
Expand Down

0 comments on commit 477f870

Please sign in to comment.