This repository has simple, real-world examples to run your code in parallel and works with any program or programming language.
It includes the parallel_opts.sh
script to setup GNU Parallel with
the SLURM scheduler; namely the script:
- Creates an
*.sshloginfile
containing a list of hostnames and CPU counts that have been assigned by SLURM. - Maintains the environment, including the current directory.
Clone this Git repository into your home directory:
# From the command-line
cd # Go to your home directory
git clone https://github.uconn.edu/HPC/parallel-slurm.git
Add the following 3 lines to your SLURM job submission file
# Inside your SLURM submission file
parallel_opts=$(~/parallel-slurm/parallel_opts.sh)
module load parallel
parallel $parallel_opts ... YOUR_PROGRAM ...
There are *.slurm
files in the examples/ directory that
are described below:
This minimal example simply outputs the compute node names in
submit.out
.
# From the command-line
cd ~/parallel-slurm/examples
sbatch 01-submit-hostname.slurm
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
The last few lines of your output should show on which nodes your 5
CPUs were allocated and the hostname
command was run; for example:
cn328
cn327
cn327
cn328
cn327
Parallel tasks often need to recover from failure. Tasks can fail when they start late and are killed by the SLURM job time limit. Or tasks can fail due a simulation intermittently not converging. In both cases, re-running the failed task can produce success.
This example shows how to automatically retry failed tasks. This
requires the --joblog
and --resume-failed
options of GNU Parallel.
The --joblog
records completed tasks in a file - both successes and
failures. Using --resume-failed
tells GNU Parallel to ignore
successful tasks but to continue running failed, incomplete, and
unattempted tasks.
If for some reason you need to re-run any successful completed tasks, you would need to delete relevant line in the joblog file, or delete the entire joblog file to re-run everything.
To run the example:
# From the command-line
cd ~/parallel-slurm/examples
rm -f joblog submit.out
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
The output below shows some tasks intermittently failing and some succeeding from our random number generator script. This example has been intentionally setup so that 3 out of 4 tasks will fail. by the 5th SLURM job submission all the tasks will succeed. This is why the last job had nothing to do and completed immediately in 1 second.
Started SLURM job 2346932
Task 5 started (seed 2346932, random number 0) ... succeeded!
Task 3 started (seed 2346932, random number 1) ... failed!
Task 1 started (seed 2346932, random number 2) ... failed!
Task 4 started (seed 2346932, random number 3) ... failed!
Task 2 started (seed 2346932, random number 4) ... succeeded!
Completed SLURM job 2346932 in 00:00:05
Started SLURM job 2346933
Task 4 started (seed 2346933, random number 2) ... failed!
Task 1 started (seed 2346933, random number 3) ... failed!
Task 3 started (seed 2346933, random number 4) ... succeeded!
Completed SLURM job 2346933 in 00:00:05
Started SLURM job 2346934
Task 4 started (seed 2346934, random number 1) ... failed!
Task 1 started (seed 2346934, random number 4) ... succeeded!
Completed SLURM job 2346934 in 00:00:04
Started SLURM job 2346935
Task 4 started (seed 2346935, random number 0) ... succeeded!
Completed SLURM job 2346935 in 00:00:00
Started SLURM job 2346936
Completed SLURM job 2346936 in 00:00:01
The joblog
file shows the failing jobs with "Exitval" of 1:
$ cat joblog
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
5 cn332 1557788313.413 0.199 0 62 0 0 ./script_that_sometimes_fails.sh 5
3 cn332 1557788313.187 1.162 0 59 1 0 ./script_that_sometimes_fails.sh 3
1 cn332 1557788312.971 2.197 0 59 1 0 ./script_that_sometimes_fails.sh 1
4 cn332 1557788313.296 3.175 0 59 1 0 ./script_that_sometimes_fails.sh 4
2 cn332 1557788313.080 4.209 0 62 0 0 ./script_that_sometimes_fails.sh 2
4 cn332 1557788318.093 2.180 0 59 1 0 ./script_that_sometimes_fails.sh 4
1 cn332 1557788317.867 3.220 0 59 1 0 ./script_that_sometimes_fails.sh 1
3 cn332 1557788317.976 4.207 0 62 0 0 ./script_that_sometimes_fails.sh 3
4 cn332 1557788322.804 1.471 0 59 1 0 ./script_that_sometimes_fails.sh 4
1 cn332 1557788322.695 4.200 0 62 0 0 ./script_that_sometimes_fails.sh 1
4 cn332 1557788327.417 0.162 0 62 0 0 ./script_that_sometimes_fails.sh 4
A parameter sweep is running the same program with combinations of input parameters.
This example is nearly the same as the previous example, but instead of using the the task ID as the input, it uses a long list of record IDs. The record ID is read by the program to find the corresponding input parameters, the program calculates the result from the parameters and saves the result in a directory.
Once all records are completed, the results from the directory can be aggregated by you into a single results file. Aggregating files doesn't really require a SLURM job.
GNU Parallel automatically feeds record IDs to task workers as the task workers complete records and become available.
# From the command-line
module purge
module load python/2.7.6-gcc-unicode
cd ~/parallel-slurm/examples
rm -rf joblog submit.out results/
for i in {1..3}; do sbatch 03-submit-param-sweep.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit
Output:
Started SLURM job 2346922
Running 60 of total 60 simulations.
1: Fitting model to parameters: x = 0.1, y = -0.1, z = control ...
1: ... done! Saved result 4.196 to results/01.dat
2: Fitting model to parameters: x = 0.1, y = -0.1, z = positive ...
2: ... done! Saved result 1.574 to results/02.dat
3: Fitting model to parameters: x = 0.1, y = -0.1, z = negative ...
3: ... done! Saved result 12.589 to results/03.dat
...
44: Fitting model to parameters: x = 0.8, y = -0.1, z = positive ...
44: ... done! Saved result 1.278 to results/44.dat
slurmstepd: *** JOB 2346922 ON cn338 CANCELLED AT 2019-05-13T18:52:16 DUE TO TIME LIMIT ***
Started SLURM job 2346923
Running 15 of total 60 simulations.
45: Fitting model to parameters: x = 0.8, y = -0.1, z = negative ...
45: ... done! Saved result 10.226 to results/45.dat
...
59: Fitting model to parameters: x = 1.0, y = 0.1, z = positive ...
59: ... done! Saved result 1.250 to results/59.dat
60: Fitting model to parameters: x = 1.0, y = 0.1, z = negative ...
60: ... done! Saved result 10.000 to results/60.dat
Completed SLURM job 2346923 in 00:00:32
Started SLURM job 2346924
Nothing to run; all 60 simulations complete.
Completed SLURM job 2346924 in 00:00:01
$ head -n 4 joblog; echo "..."; tail -n 3 joblog
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command
1 cn236 1557787112.519 5.414 0 119 0 0 python model.py 1
3 cn236 1557787112.750 5.250 0 120 0 0 python model.py 3
2 cn236 1557787117.940 5.276 0 429 0 0 python model.py 2
...
58 cn338 1557787220.703 5.284 0 120 0 0 python model.py 58
59 cn338 1557787220.815 5.267 0 121 0 0 python model.py 59
60 cn338 1557787225.850 5.263 0 121 0 0 python model.py 60
Hopefully these examples have inspired you to use GNU Parallel to parallelize your code. Now you can:
- Read the main help page
- Read the tutorial
If you have further questions, contact hpc@uconn.edu