Skip to content
Permalink
master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Go to file
 
 
Cannot retrieve contributors at this time
# GNU Parallel with SLURM
## Summary
This repository has simple, real-world examples to run your code in
parallel and works with any program or programming language.
It includes the `parallel_opts.sh` script to setup GNU Parallel with
the SLURM scheduler; namely the script:
1. Creates an `*.sshloginfile` containing a list of hostnames and CPU
counts that have been assigned by SLURM.
2. Maintains the environment, including the current directory.
## Usage
Clone this Git repository into your home directory:
``` sh
# From the command-line
cd # Go to your home directory
git clone https://github.uconn.edu/HPC/parallel-slurm.git
```
Add the following 3 lines to your SLURM job submission file
``` sh
# Inside your SLURM submission file
parallel_opts=$(~/parallel-slurm/parallel_opts.sh)
module load parallel
parallel $parallel_opts ... YOUR_PROGRAM ...
```
## Examples
There are `*.slurm` files in the [examples/](examples/) directory that
are described below:
### Example 01: Hostname
This minimal example simply outputs the compute node names in
`submit.out`.
``` sh
# From the command-line
cd ~/parallel-slurm/examples
sbatch 01-submit-hostname.slurm
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```
The last few lines of your output should show on which nodes your 5
CPUs were allocated and the `hostname` command was run; for example:
``` sh
cn328
cn327
cn327
cn328
cn327
```
### Example 02: Resumable
Parallel tasks often need to recover from failure. Tasks can fail
when they start late and are killed by the SLURM job time limit. Or
tasks can fail due a simulation intermittently not converging. In
both cases, re-running the failed task can produce success.
This example shows how to automatically retry failed tasks. This
requires the `--joblog` and `--resume-failed` options of GNU Parallel.
The `--joblog` records completed tasks in a file - both successes and
failures. Using `--resume-failed` tells GNU Parallel to ignore
successful tasks but to continue running failed, incomplete, and
unattempted tasks.
If for some reason you need to re-run any successful completed tasks,
you would need to delete relevant line in the joblog file, or delete
the entire joblog file to re-run everything.
To run the example:
``` sh
# From the command-line
cd ~/parallel-slurm/examples
rm -f joblog submit.out
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```
The output below shows some tasks intermittently failing and some
succeeding from our random number generator script. This example has
been intentionally setup so that 3 out of 4 tasks will fail. by the
5th SLURM job submission all the tasks will succeed. This is why the
last job had nothing to do and completed immediately in 1 second.
```
Started SLURM job 2346932
Task 5 started (seed 2346932, random number 0) ... succeeded!
Task 3 started (seed 2346932, random number 1) ... failed!
Task 1 started (seed 2346932, random number 2) ... failed!
Task 4 started (seed 2346932, random number 3) ... failed!
Task 2 started (seed 2346932, random number 4) ... succeeded!
Completed SLURM job 2346932 in 00:00:05
Started SLURM job 2346933
Task 4 started (seed 2346933, random number 2) ... failed!
Task 1 started (seed 2346933, random number 3) ... failed!
Task 3 started (seed 2346933, random number 4) ... succeeded!
Completed SLURM job 2346933 in 00:00:05
Started SLURM job 2346934
Task 4 started (seed 2346934, random number 1) ... failed!
Task 1 started (seed 2346934, random number 4) ... succeeded!
Completed SLURM job 2346934 in 00:00:04
Started SLURM job 2346935
Task 4 started (seed 2346935, random number 0) ... succeeded!
Completed SLURM job 2346935 in 00:00:00
Started SLURM job 2346936
Completed SLURM job 2346936 in 00:00:01
```
The `joblog` file shows the failing jobs with "Exitval" of 1:
```console
$ cat joblog
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
5 cn332 1557788313.413 0.199 0 62 0 0 ./script_that_sometimes_fails.sh 5
3 cn332 1557788313.187 1.162 0 59 1 0 ./script_that_sometimes_fails.sh 3
1 cn332 1557788312.971 2.197 0 59 1 0 ./script_that_sometimes_fails.sh 1
4 cn332 1557788313.296 3.175 0 59 1 0 ./script_that_sometimes_fails.sh 4
2 cn332 1557788313.080 4.209 0 62 0 0 ./script_that_sometimes_fails.sh 2
4 cn332 1557788318.093 2.180 0 59 1 0 ./script_that_sometimes_fails.sh 4
1 cn332 1557788317.867 3.220 0 59 1 0 ./script_that_sometimes_fails.sh 1
3 cn332 1557788317.976 4.207 0 62 0 0 ./script_that_sometimes_fails.sh 3
4 cn332 1557788322.804 1.471 0 59 1 0 ./script_that_sometimes_fails.sh 4
1 cn332 1557788322.695 4.200 0 62 0 0 ./script_that_sometimes_fails.sh 1
4 cn332 1557788327.417 0.162 0 62 0 0 ./script_that_sometimes_fails.sh 4
```
### Example 3: Parameter Sweep
A parameter sweep is running the same program with combinations of
input parameters.
This example is nearly the same as the previous example, but instead
of using the the task ID as the input, it uses a long list of record
IDs. The record ID is read by the program to find the corresponding
input parameters, the program calculates the result from the
parameters and saves the result in a directory.
Once all records are completed, the results from the directory can be
aggregated by you into a single results file. Aggregating files
doesn't really require a SLURM job.
GNU Parallel automatically feeds record IDs to task workers as the
task workers complete records and become available.
```sh
# From the command-line
module purge
module load python/2.7.6-gcc-unicode
cd ~/parallel-slurm/examples
rm -rf joblog submit.out results/
for i in {1..3}; do sbatch 03-submit-param-sweep.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
```
Output:
```
Started SLURM job 2346922
Running 60 of total 60 simulations.
1: Fitting model to parameters: x = 0.1, y = -0.1, z = control ...
1: ... done! Saved result 4.196 to results/01.dat
2: Fitting model to parameters: x = 0.1, y = -0.1, z = positive ...
2: ... done! Saved result 1.574 to results/02.dat
3: Fitting model to parameters: x = 0.1, y = -0.1, z = negative ...
3: ... done! Saved result 12.589 to results/03.dat
...
44: Fitting model to parameters: x = 0.8, y = -0.1, z = positive ...
44: ... done! Saved result 1.278 to results/44.dat
slurmstepd: *** JOB 2346922 ON cn338 CANCELLED AT 2019-05-13T18:52:16 DUE TO TIME LIMIT ***
Started SLURM job 2346923
Running 15 of total 60 simulations.
45: Fitting model to parameters: x = 0.8, y = -0.1, z = negative ...
45: ... done! Saved result 10.226 to results/45.dat
...
59: Fitting model to parameters: x = 1.0, y = 0.1, z = positive ...
59: ... done! Saved result 1.250 to results/59.dat
60: Fitting model to parameters: x = 1.0, y = 0.1, z = negative ...
60: ... done! Saved result 10.000 to results/60.dat
Completed SLURM job 2346923 in 00:00:32
Started SLURM job 2346924
Nothing to run; all 60 simulations complete.
Completed SLURM job 2346924 in 00:00:01
```
```console
$ head -n 4 joblog; echo "..."; tail -n 3 joblog
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 cn236 1557787112.519 5.414 0 119 0 0 python model.py 1
3 cn236 1557787112.750 5.250 0 120 0 0 python model.py 3
2 cn236 1557787117.940 5.276 0 429 0 0 python model.py 2
...
58 cn338 1557787220.703 5.284 0 120 0 0 python model.py 58
59 cn338 1557787220.815 5.267 0 121 0 0 python model.py 59
60 cn338 1557787225.850 5.263 0 121 0 0 python model.py 60
```
## Next Steps
Hopefully these examples have inspired you to use GNU Parallel to
parallelize your code. Now you can:
1. Read the [main help page](https://www.gnu.org/software/parallel/man.html)
2. Read the [tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html)
If you have further questions, contact hpc@uconn.edu