DOC: Overhaul documentation per Tom Deans' writing style

HPC · May 8, 2019 · 477f870 · 477f870
1 parent f28c6b1
commit 477f870
Showing 1 changed file with 31 additions and 25 deletions.
diff --git a/README.md b/README.md
@@ -1,24 +1,24 @@
-# GNU Parallel setup for SLURM
+# GNU Parallel with SLURM
 
 ## Summary
 
-There is a little bit of setup work to get GNU Parallel to work with
-the SLURM scheduler.  Namely one has to:
+This repository has simple, real-world examples to run your code in
+parallel and works with any program or programming language.
 
-- Create an `*.sshloginfile` containing a list of hostnames
-  and CPU counts that have been assigned by SLURM.
-- Export the environment, including the current directory.
+It includes the `parallel_opts.sh` script to setup GNU Parallel with
+the SLURM scheduler; namely the script:
 
-The `parallel_opts.sh` takes care of both these job setup steps and
-adds sensible default options to GNU parallel.
+1. Creates an `*.sshloginfile` containing a list of hostnames and CPU
+   counts that have been assigned by SLURM.
+2. Maintains the environment, including the current directory.
 
 ## Usage
 
 Clone this Git repository into your home directory:
 
 ``` sh
 # From the command-line
-cd				# Go to home directory
+cd				# Go to your home directory
 git clone https://github.uconn.edu/HPC/parallel-slurm.git
 ```
 
@@ -33,8 +33,8 @@ parallel $parallel_opts ... YOUR_PROGRAM ...
 
 ## Examples
 
-See the `*.slurm` example files.  Run each of them using `sbatch` as
-explained below:
+There are `*.slurm` files in the [examples/](examples/) directory that
+are described below:
 
 ### Example 01: Hostname
 
@@ -62,19 +62,21 @@ cn327
 
 ### Example 02: Resumable
 
-A typical problem that parallel tasks need to deal with is recovering
-from failure.  Tasks can fail when they hit the SLURM job time limit.
-Or they can fail due to the stochastic nature of a simulation
-intermittently not converging; in other words re-running the job can
-produce success.
+Parallel tasks often need to recover from failure.  Tasks can fail
+when they start late and are killed by the SLURM job time limit.  Or
+tasks can fail due a simulation intermittently not converging.  In
+both cases, re-running the failed task can produce success.
 
-This example shows how to automatically resume jobs and retry only
-failed tasks.  This works using the `--joblog` and `--resume` options
-to GNU Parallel.  Using `--resume` tells GNU Parallel to ignore
-completed jobs.  The joblog remembers that the work was complete and
-does not needlessly re-run completed tasks.  If for some reason you
-need to re-run the a completed task you would need to delete the
-*.joblog file.
+This example shows how to automatically retry failed tasks.  This
+requires the `--joblog` and `--resume-failed` options of GNU Parallel.
+The `--joblog` records completed tasks in a file - both successes and
+failures.  Using `--resume-failed` tells GNU Parallel to ignore
+successful tasks but to continue running failed, incomplete, and
+unattempted tasks.
+
+If for some reason you need to re-run any successful completed tasks,
+you would need to delete relevant line in the joblog file, or delete
+the entire joblog file to re-run everything.
 
 To run the example:
 
@@ -87,8 +89,12 @@ touch submit.out && tail -f submit.out
 # Hit Ctrl+C to exit
 ```
 
-The output shows that some tasks intermittently failing and some
-succeeding.  But always by the 5th job all of them succeed.
+The output below shows some tasks intermittently failing and some
+succeeding from our random number generator script.  This example has
+been intentionally setup so that 3 out of 4 tasks will fail. by the
+5th SLURM job submission all the tasks will succeed.  This is why the
+last job 2339010 had nothing to do and completed immediately in 0
+seconds.
 
 ```
 Started SLURM job 2339006