README.md

# GNU Parallel with SLURM

## Summary

This repository has simple, real-world examples to run your code in
parallel and works with any program or programming language.

It includes the `parallel_opts.sh` script to setup GNU Parallel with
the SLURM scheduler; namely the script:

1. Creates an `*.sshloginfile` containing a list of hostnames and CPU
   counts that have been assigned by SLURM.
2. Maintains the environment, including the current directory.

## Usage

Clone this Git repository into your home directory:

``` sh
# From the command-line
cd				# Go to your home directory
git clone https://github.uconn.edu/HPC/parallel-slurm.git
```

Add the following 3 lines to your SLURM job submission file

``` sh
# Inside your SLURM submission file
parallel_opts=$(~/parallel-slurm/parallel_opts.sh)
module load parallel
parallel $parallel_opts ... YOUR_PROGRAM ...
```

## Examples

There are `*.slurm` files in the [examples/](examples/) directory that
are described below:

### Example 01: Hostname

This minimal example simply outputs the compute node names in
`submit.out`.

``` sh
# From the command-line
cd ~/parallel-slurm/examples
sbatch 01-submit-hostname.slurm
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

The last few lines of your output should show on which nodes your 5
CPUs were allocated and the `hostname` command was run; for example:

``` sh
cn328
cn327
cn327
cn328
cn327
```

### Example 02: Resumable

Parallel tasks often need to recover from failure.  Tasks can fail
when they start late and are killed by the SLURM job time limit.  Or
tasks can fail due a simulation intermittently not converging.  In
both cases, re-running the failed task can produce success.

This example shows how to automatically retry failed tasks.  This
requires the `--joblog` and `--resume-failed` options of GNU Parallel.
The `--joblog` records completed tasks in a file - both successes and
failures.  Using `--resume-failed` tells GNU Parallel to ignore
successful tasks but to continue running failed, incomplete, and
unattempted tasks.

If for some reason you need to re-run any successful completed tasks,
you would need to delete relevant line in the joblog file, or delete
the entire joblog file to re-run everything.

To run the example:

``` sh
# From the command-line
cd ~/parallel-slurm/examples
rm -f joblog submit.out
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

The output below shows some tasks intermittently failing and some
succeeding from our random number generator script.  This example has
been intentionally setup so that 3 out of 4 tasks will fail. by the
5th SLURM job submission all the tasks will succeed.  This is why the
last job had nothing to do and completed immediately in 1 second.

```
Started SLURM job 2346932
Task 5 started (seed 2346932, random number 0) ... succeeded!
Task 3 started (seed 2346932, random number 1) ... failed!
Task 1 started (seed 2346932, random number 2) ... failed!
Task 4 started (seed 2346932, random number 3) ... failed!
Task 2 started (seed 2346932, random number 4) ... succeeded!
Completed SLURM job 2346932 in   00:00:05 
Started SLURM job 2346933
Task 4 started (seed 2346933, random number 2) ... failed!
Task 1 started (seed 2346933, random number 3) ... failed!
Task 3 started (seed 2346933, random number 4) ... succeeded!
Completed SLURM job 2346933 in   00:00:05 
Started SLURM job 2346934
Task 4 started (seed 2346934, random number 1) ... failed!
Task 1 started (seed 2346934, random number 4) ... succeeded!
Completed SLURM job 2346934 in   00:00:04 
Started SLURM job 2346935
Task 4 started (seed 2346935, random number 0) ... succeeded!
Completed SLURM job 2346935 in   00:00:00 
Started SLURM job 2346936
Completed SLURM job 2346936 in   00:00:01 
```

The `joblog` file shows the failing jobs with "Exitval" of 1:

```console
$ cat joblog
Seq	Host	Starttime	JobRuntime	Send	Receive	Exitval	Signal	Command
5	cn332	1557788313.413	     0.199	0	62	0	0	./script_that_sometimes_fails.sh 5
3	cn332	1557788313.187	     1.162	0	59	1	0	./script_that_sometimes_fails.sh 3
1	cn332	1557788312.971	     2.197	0	59	1	0	./script_that_sometimes_fails.sh 1
4	cn332	1557788313.296	     3.175	0	59	1	0	./script_that_sometimes_fails.sh 4
2	cn332	1557788313.080	     4.209	0	62	0	0	./script_that_sometimes_fails.sh 2
4	cn332	1557788318.093	     2.180	0	59	1	0	./script_that_sometimes_fails.sh 4
1	cn332	1557788317.867	     3.220	0	59	1	0	./script_that_sometimes_fails.sh 1
3	cn332	1557788317.976	     4.207	0	62	0	0	./script_that_sometimes_fails.sh 3
4	cn332	1557788322.804	     1.471	0	59	1	0	./script_that_sometimes_fails.sh 4
1	cn332	1557788322.695	     4.200	0	62	0	0	./script_that_sometimes_fails.sh 1
4	cn332	1557788327.417	     0.162	0	62	0	0	./script_that_sometimes_fails.sh 4
```

### Example 3: Parameter Sweep

A parameter sweep is running the same program with combinations of
input parameters.

This example is nearly the same as the previous example, but instead
of using the the task ID as the input, it uses a long list of record
IDs.  The record ID is read by the program to find the corresponding
input parameters, the program calculates the result from the
parameters and saves the result in a directory.

Once all records are completed, the results from the directory can be
aggregated by you into a single results file.  Aggregating files
doesn't really require a SLURM job.

GNU Parallel automatically feeds record IDs to task workers as the
task workers complete records and become available.

```sh
# From the command-line
module purge
module load python/2.7.6-gcc-unicode
cd ~/parallel-slurm/examples
rm -rf joblog submit.out results/
for i in {1..3}; do sbatch 03-submit-param-sweep.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```

Output:

```
Started SLURM job 2346922
Running 60 of total 60 simulations.
1: Fitting model to parameters: x = 0.1, y = -0.1, z = control ...
1: ... done!  Saved result  4.196 to results/01.dat
2: Fitting model to parameters: x = 0.1, y = -0.1, z = positive ...
2: ... done!  Saved result  1.574 to results/02.dat
3: Fitting model to parameters: x = 0.1, y = -0.1, z = negative ...
3: ... done!  Saved result 12.589 to results/03.dat
...
44: Fitting model to parameters: x = 0.8, y = -0.1, z = positive ...
44: ... done!  Saved result  1.278 to results/44.dat
slurmstepd: *** JOB 2346922 ON cn338 CANCELLED AT 2019-05-13T18:52:16 DUE TO TIME LIMIT ***
Started SLURM job 2346923
Running 15 of total 60 simulations.
45: Fitting model to parameters: x = 0.8, y = -0.1, z = negative ...
45: ... done!  Saved result 10.226 to results/45.dat
...
59: Fitting model to parameters: x = 1.0, y = 0.1, z = positive ...
59: ... done!  Saved result  1.250 to results/59.dat
60: Fitting model to parameters: x = 1.0, y = 0.1, z = negative ...
60: ... done!  Saved result 10.000 to results/60.dat
Completed SLURM job 2346923 in   00:00:32 
Started SLURM job 2346924
Nothing to run; all 60 simulations complete.
Completed SLURM job 2346924 in   00:00:01 
```

```console
$ head -n 4 joblog; echo "..."; tail -n 3 joblog
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       cn236   1557787112.519       5.414      0       119     0       0       python model.py 1
3       cn236   1557787112.750       5.250      0       120     0       0       python model.py 3
2       cn236   1557787117.940       5.276      0       429     0       0       python model.py 2
...
58      cn338   1557787220.703       5.284      0       120     0       0       python model.py 58
59      cn338   1557787220.815       5.267      0       121     0       0       python model.py 59
60      cn338   1557787225.850       5.263      0       121     0       0       python model.py 60
```

## Next Steps

Hopefully these examples have inspired you to use GNU Parallel to
parallelize your code.  Now you can:

1. Read the [main help page](https://www.gnu.org/software/parallel/man.html)
2. Read the [tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html)

If you have further questions, contact hpc@uconn.edu