GNU Parallel with SLURM

Summary

This repository has simple, real-world examples to run your code in parallel and works with any program or programming language.

It includes the parallel_opts.sh script to setup GNU Parallel with the SLURM scheduler; namely the script:

Creates an *.sshloginfile containing a list of hostnames and CPU counts that have been assigned by SLURM.
Maintains the environment, including the current directory.

Usage

Clone this Git repository into your home directory:

# From the command-line
cd				# Go to your home directory
git clone https://github.uconn.edu/HPC/parallel-slurm.git

Add the following 3 lines to your SLURM job submission file

# Inside your SLURM submission file
parallel_opts=$(~/parallel-slurm/parallel_opts.sh)
module load parallel
parallel $parallel_opts ... YOUR_PROGRAM ...

Examples

There are *.slurm files in the examples/ directory that are described below:

Example 01: Hostname

This minimal example simply outputs the compute node names in submit.out.

# From the command-line
cd ~/parallel-slurm/examples
sbatch 01-submit-hostname.slurm
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit

The last few lines of your output should show on which nodes your 5 CPUs were allocated and the hostname command was run; for example:

cn328
cn327
cn327
cn328
cn327

Example 02: Resumable

Parallel tasks often need to recover from failure. Tasks can fail when they start late and are killed by the SLURM job time limit. Or tasks can fail due a simulation intermittently not converging. In both cases, re-running the failed task can produce success.

This example shows how to automatically retry failed tasks. This requires the --joblog and --resume-failed options of GNU Parallel. The --joblog records completed tasks in a file - both successes and failures. Using --resume-failed tells GNU Parallel to ignore successful tasks but to continue running failed, incomplete, and unattempted tasks.

If for some reason you need to re-run any successful completed tasks, you would need to delete relevant line in the joblog file, or delete the entire joblog file to re-run everything.

To run the example:

# From the command-line
cd ~/parallel-slurm/examples
rm -f joblog submit.out
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit

The output below shows some tasks intermittently failing and some succeeding from our random number generator script. This example has been intentionally setup so that 3 out of 4 tasks will fail. by the 5th SLURM job submission all the tasks will succeed. This is why the last job had nothing to do and completed immediately in 1 second.

Started SLURM job 2346932
Task 5 started (seed 2346932, random number 0) ... succeeded!
Task 3 started (seed 2346932, random number 1) ... failed!
Task 1 started (seed 2346932, random number 2) ... failed!
Task 4 started (seed 2346932, random number 3) ... failed!
Task 2 started (seed 2346932, random number 4) ... succeeded!
Completed SLURM job 2346932 in   00:00:05 
Started SLURM job 2346933
Task 4 started (seed 2346933, random number 2) ... failed!
Task 1 started (seed 2346933, random number 3) ... failed!
Task 3 started (seed 2346933, random number 4) ... succeeded!
Completed SLURM job 2346933 in   00:00:05 
Started SLURM job 2346934
Task 4 started (seed 2346934, random number 1) ... failed!
Task 1 started (seed 2346934, random number 4) ... succeeded!
Completed SLURM job 2346934 in   00:00:04 
Started SLURM job 2346935
Task 4 started (seed 2346935, random number 0) ... succeeded!
Completed SLURM job 2346935 in   00:00:00 
Started SLURM job 2346936
Completed SLURM job 2346936 in   00:00:01

The joblog file shows the failing jobs with "Exitval" of 1:

$ cat joblog
Seq	Host	Starttime	JobRuntime	Send	Receive	Exitval	Signal	Command
5	cn332	1557788313.413	     0.199	0	62	0	0	./script_that_sometimes_fails.sh 5
3	cn332	1557788313.187	     1.162	0	59	1	0	./script_that_sometimes_fails.sh 3
1	cn332	1557788312.971	     2.197	0	59	1	0	./script_that_sometimes_fails.sh 1
4	cn332	1557788313.296	     3.175	0	59	1	0	./script_that_sometimes_fails.sh 4
2	cn332	1557788313.080	     4.209	0	62	0	0	./script_that_sometimes_fails.sh 2
4	cn332	1557788318.093	     2.180	0	59	1	0	./script_that_sometimes_fails.sh 4
1	cn332	1557788317.867	     3.220	0	59	1	0	./script_that_sometimes_fails.sh 1
3	cn332	1557788317.976	     4.207	0	62	0	0	./script_that_sometimes_fails.sh 3
4	cn332	1557788322.804	     1.471	0	59	1	0	./script_that_sometimes_fails.sh 4
1	cn332	1557788322.695	     4.200	0	62	0	0	./script_that_sometimes_fails.sh 1
4	cn332	1557788327.417	     0.162	0	62	0	0	./script_that_sometimes_fails.sh 4

Example 3: Parameter Sweep

A parameter sweep is running the same program with combinations of input parameters.

This example is nearly the same as the previous example, but instead of using the the task ID as the input, it uses a long list of record IDs. The record ID is read by the program to find the corresponding input parameters, the program calculates the result from the parameters and saves the result in a directory.

Once all records are completed, the results from the directory can be aggregated by you into a single results file. Aggregating files doesn't really require a SLURM job.

GNU Parallel automatically feeds record IDs to task workers as the task workers complete records and become available.

# From the command-line
module purge
module load python/2.7.6-gcc-unicode
cd ~/parallel-slurm/examples
rm -rf joblog submit.out results/
for i in {1..3}; do sbatch 03-submit-param-sweep.slurm; done
touch submit.out && tail -f submit.out
# Hit Ctrl+C to exit

Output:

Started SLURM job 2346922
Running 60 of total 60 simulations.
1: Fitting model to parameters: x = 0.1, y = -0.1, z = control ...
1: ... done!  Saved result  4.196 to results/01.dat
2: Fitting model to parameters: x = 0.1, y = -0.1, z = positive ...
2: ... done!  Saved result  1.574 to results/02.dat
3: Fitting model to parameters: x = 0.1, y = -0.1, z = negative ...
3: ... done!  Saved result 12.589 to results/03.dat
...
44: Fitting model to parameters: x = 0.8, y = -0.1, z = positive ...
44: ... done!  Saved result  1.278 to results/44.dat
slurmstepd: *** JOB 2346922 ON cn338 CANCELLED AT 2019-05-13T18:52:16 DUE TO TIME LIMIT ***
Started SLURM job 2346923
Running 15 of total 60 simulations.
45: Fitting model to parameters: x = 0.8, y = -0.1, z = negative ...
45: ... done!  Saved result 10.226 to results/45.dat
...
59: Fitting model to parameters: x = 1.0, y = 0.1, z = positive ...
59: ... done!  Saved result  1.250 to results/59.dat
60: Fitting model to parameters: x = 1.0, y = 0.1, z = negative ...
60: ... done!  Saved result 10.000 to results/60.dat
Completed SLURM job 2346923 in   00:00:32 
Started SLURM job 2346924
Nothing to run; all 60 simulations complete.
Completed SLURM job 2346924 in   00:00:01

$ head -n 4 joblog; echo "..."; tail -n 3 joblog
Seq     Host    Starttime       JobRuntime      Send    Receive Exitval Signal  Command
1       cn236   1557787112.519       5.414      0       119     0       0       python model.py 1
3       cn236   1557787112.750       5.250      0       120     0       0       python model.py 3
2       cn236   1557787117.940       5.276      0       429     0       0       python model.py 2
...
58      cn338   1557787220.703       5.284      0       120     0       0       python model.py 58
59      cn338   1557787220.815       5.267      0       121     0       0       python model.py 59
60      cn338   1557787225.850       5.263      0       121     0       0       python model.py 60

Next Steps

Hopefully these examples have inspired you to use GNU Parallel to parallelize your code. Now you can:

Read the main help page
Read the tutorial

If you have further questions, contact hpc@uconn.edu

License

HPC/parallel-slurm

About

Resources

License

Stars

Watchers

Forks

Releases

Languages