Skip to content

Commit

Permalink
DOC: Properly document the 2 examples
Browse files Browse the repository at this point in the history
  • Loading branch information
pan14001 committed May 7, 2019
1 parent 01744ce commit 46a9ad9
Showing 1 changed file with 67 additions and 11 deletions.
78 changes: 67 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,29 +31,85 @@ module load parallel
parallel $parallel_opts ... YOUR_PROGRAM ...
```

## Example
## Examples

See the `submit.slurm` example file. Run it using:
See the `*.slurm` example files. Run each of them using `sbatch` as
explained below:

### Example 01: Hostname

This minimal example simply outputs the compute node names in
`submit.out`.

``` sh
# From the command-line
cd ~/parallel-slurm
sbatch submit.slurm
cd ~/parallel-slurm/examples
sbatch 01-submit-hostname.slurm
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

You should see the output of the compute node names in submit.out.
For example:
The last few lines of your output should show on which nodes your 5
CPUs were allocated and the `hostname` command was run; for example:

``` sh
# Inside your submit.out
cn328
cn327
cn327
cn328
cn327
```

Note that if you resubmit the job you will not see any output. This
is because of the `--joblog` and `--resume` options; the job remembers
that the work was complete and does not needlessly re-run the program.
To re-run the program you would need to delete the *.joblog file.
### Example 02: Resumable

A typical problem that parallel tasks need to deal with is recovering
from failure. Tasks can fail when they hit the SLURM job time limit.
Or they can fail due to the stochastic nature of a simulation
intermittently not converging; in other words re-running the job can
produce success.

This example shows how to automatically resume jobs and retry only
failed tasks. This works using the `--joblog` and `--resume` options
to GNU Parallel. Using `--resume` tells GNU Parallel to ignore
completed jobs. The joblog remembers that the work was complete and
does not needlessly re-run completed tasks. If for some reason you
need to re-run the a completed task you would need to delete the
*.joblog file.

To run the example:

``` sh
# From the command-line
cd ~/parallel-slurm/examples
rm -f joblog submit.out
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

The output shows that some tasks intermittently failing and some
succeeding. But always by the 5th job all of them succeed.

```
Started SLURM job 2339006
Task 5 started (seed 2339006, random number 0) ... succeeded!
Task 1 started (seed 2339006, random number 1) ... failed!
Task 2 started (seed 2339006, random number 2) ... failed!
Task 3 started (seed 2339006, random number 3) ... failed!
Task 4 started (seed 2339006, random number 4) ... succeeded!
Completed SLURM job 2339006 in 00:00:05
Started SLURM job 2339007
Task 3 started (seed 2339007, random number 1) ... failed!
Task 1 started (seed 2339007, random number 2) ... failed!
Task 2 started (seed 2339007, random number 4) ... succeeded!
Completed SLURM job 2339007 in 00:00:05
Started SLURM job 2339008
Task 1 started (seed 2339008, random number 3) ... failed!
Task 3 started (seed 2339008, random number 4) ... succeeded!
Completed SLURM job 2339008 in 00:00:05
Started SLURM job 2339009
Task 1 started (seed 2339009, random number 4) ... succeeded!
Completed SLURM job 2339009 in 00:00:04
Started SLURM job 2339010
Completed SLURM job 2339010 in 00:00:00
```

0 comments on commit 46a9ad9

Please sign in to comment.