From 46a9ad9e86b46540a8e49da48889076dee1741d1 Mon Sep 17 00:00:00 2001 From: Pariksheet Nanda Date: Tue, 7 May 2019 13:36:33 -0400 Subject: [PATCH] DOC: Properly document the 2 examples --- README.md | 78 +++++++++++++++++++++++++++++++++++++++++++++++-------- 1 file changed, 67 insertions(+), 11 deletions(-) diff --git a/README.md b/README.md index 197ba36..668441b 100644 --- a/README.md +++ b/README.md @@ -31,21 +31,28 @@ module load parallel parallel $parallel_opts ... YOUR_PROGRAM ... ``` -## Example +## Examples -See the `submit.slurm` example file. Run it using: +See the `*.slurm` example files. Run each of them using `sbatch` as +explained below: + +### Example 01: Hostname + +This minimal example simply outputs the compute node names in +`submit.out`. ``` sh # From the command-line -cd ~/parallel-slurm -sbatch submit.slurm +cd ~/parallel-slurm/examples +sbatch 01-submit-hostname.slurm +touch submit.out && tail -f submit.out +# Hit Ctrl+C to exit ``` -You should see the output of the compute node names in submit.out. -For example: +The last few lines of your output should show on which nodes your 5 +CPUs were allocated and the `hostname` command was run; for example: ``` sh -# Inside your submit.out cn328 cn327 cn327 @@ -53,7 +60,56 @@ cn328 cn327 ``` -Note that if you resubmit the job you will not see any output. This -is because of the `--joblog` and `--resume` options; the job remembers -that the work was complete and does not needlessly re-run the program. -To re-run the program you would need to delete the *.joblog file. +### Example 02: Resumable + +A typical problem that parallel tasks need to deal with is recovering +from failure. Tasks can fail when they hit the SLURM job time limit. +Or they can fail due to the stochastic nature of a simulation +intermittently not converging; in other words re-running the job can +produce success. + +This example shows how to automatically resume jobs and retry only +failed tasks. This works using the `--joblog` and `--resume` options +to GNU Parallel. Using `--resume` tells GNU Parallel to ignore +completed jobs. The joblog remembers that the work was complete and +does not needlessly re-run completed tasks. If for some reason you +need to re-run the a completed task you would need to delete the +*.joblog file. + +To run the example: + +``` sh +# From the command-line +cd ~/parallel-slurm/examples +rm -f joblog submit.out +for i in {1..5}; do sbatch 02-submit-resumable.slurm; done +touch submit.out && tail -f submit.out +# Hit Ctrl+C to exit +``` + +The output shows that some tasks intermittently failing and some +succeeding. But always by the 5th job all of them succeed. + +``` +Started SLURM job 2339006 +Task 5 started (seed 2339006, random number 0) ... succeeded! +Task 1 started (seed 2339006, random number 1) ... failed! +Task 2 started (seed 2339006, random number 2) ... failed! +Task 3 started (seed 2339006, random number 3) ... failed! +Task 4 started (seed 2339006, random number 4) ... succeeded! +Completed SLURM job 2339006 in 00:00:05 +Started SLURM job 2339007 +Task 3 started (seed 2339007, random number 1) ... failed! +Task 1 started (seed 2339007, random number 2) ... failed! +Task 2 started (seed 2339007, random number 4) ... succeeded! +Completed SLURM job 2339007 in 00:00:05 +Started SLURM job 2339008 +Task 1 started (seed 2339008, random number 3) ... failed! +Task 3 started (seed 2339008, random number 4) ... succeeded! +Completed SLURM job 2339008 in 00:00:05 +Started SLURM job 2339009 +Task 1 started (seed 2339009, random number 4) ... succeeded! +Completed SLURM job 2339009 in 00:00:04 +Started SLURM job 2339010 +Completed SLURM job 2339010 in 00:00:00 +```