From 477f8707069029e953a8e819819b18d755955fa2 Mon Sep 17 00:00:00 2001 From: Pariksheet Nanda Date: Wed, 8 May 2019 10:57:27 -0400 Subject: [PATCH] DOC: Overhaul documentation per Tom Deans' writing style --- README.md | 56 ++++++++++++++++++++++++++++++------------------------- 1 file changed, 31 insertions(+), 25 deletions(-) diff --git a/README.md b/README.md index 668441b..fc07abb 100644 --- a/README.md +++ b/README.md @@ -1,16 +1,16 @@ -# GNU Parallel setup for SLURM +# GNU Parallel with SLURM ## Summary -There is a little bit of setup work to get GNU Parallel to work with -the SLURM scheduler. Namely one has to: +This repository has simple, real-world examples to run your code in +parallel and works with any program or programming language. -- Create an `*.sshloginfile` containing a list of hostnames - and CPU counts that have been assigned by SLURM. -- Export the environment, including the current directory. +It includes the `parallel_opts.sh` script to setup GNU Parallel with +the SLURM scheduler; namely the script: -The `parallel_opts.sh` takes care of both these job setup steps and -adds sensible default options to GNU parallel. +1. Creates an `*.sshloginfile` containing a list of hostnames and CPU + counts that have been assigned by SLURM. +2. Maintains the environment, including the current directory. ## Usage @@ -18,7 +18,7 @@ Clone this Git repository into your home directory: ``` sh # From the command-line -cd # Go to home directory +cd # Go to your home directory git clone https://github.uconn.edu/HPC/parallel-slurm.git ``` @@ -33,8 +33,8 @@ parallel $parallel_opts ... YOUR_PROGRAM ... ## Examples -See the `*.slurm` example files. Run each of them using `sbatch` as -explained below: +There are `*.slurm` files in the [examples/](examples/) directory that +are described below: ### Example 01: Hostname @@ -62,19 +62,21 @@ cn327 ### Example 02: Resumable -A typical problem that parallel tasks need to deal with is recovering -from failure. Tasks can fail when they hit the SLURM job time limit. -Or they can fail due to the stochastic nature of a simulation -intermittently not converging; in other words re-running the job can -produce success. +Parallel tasks often need to recover from failure. Tasks can fail +when they start late and are killed by the SLURM job time limit. Or +tasks can fail due a simulation intermittently not converging. In +both cases, re-running the failed task can produce success. -This example shows how to automatically resume jobs and retry only -failed tasks. This works using the `--joblog` and `--resume` options -to GNU Parallel. Using `--resume` tells GNU Parallel to ignore -completed jobs. The joblog remembers that the work was complete and -does not needlessly re-run completed tasks. If for some reason you -need to re-run the a completed task you would need to delete the -*.joblog file. +This example shows how to automatically retry failed tasks. This +requires the `--joblog` and `--resume-failed` options of GNU Parallel. +The `--joblog` records completed tasks in a file - both successes and +failures. Using `--resume-failed` tells GNU Parallel to ignore +successful tasks but to continue running failed, incomplete, and +unattempted tasks. + +If for some reason you need to re-run any successful completed tasks, +you would need to delete relevant line in the joblog file, or delete +the entire joblog file to re-run everything. To run the example: @@ -87,8 +89,12 @@ touch submit.out && tail -f submit.out # Hit Ctrl+C to exit ``` -The output shows that some tasks intermittently failing and some -succeeding. But always by the 5th job all of them succeed. +The output below shows some tasks intermittently failing and some +succeeding from our random number generator script. This example has +been intentionally setup so that 3 out of 4 tasks will fail. by the +5th SLURM job submission all the tasks will succeed. This is why the +last job 2339010 had nothing to do and completed immediately in 0 +seconds. ``` Started SLURM job 2339006