Permalink
Cannot retrieve contributors at this time
Name already in use
A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
parallel-slurm/README.md
Go to fileThis commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
219 lines (180 sloc)
7.99 KB
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# GNU Parallel with SLURM | |
## Summary | |
This repository has simple, real-world examples to run your code in | |
parallel and works with any program or programming language. | |
It includes the `parallel_opts.sh` script to setup GNU Parallel with | |
the SLURM scheduler; namely the script: | |
1. Creates an `*.sshloginfile` containing a list of hostnames and CPU | |
counts that have been assigned by SLURM. | |
2. Maintains the environment, including the current directory. | |
## Usage | |
Clone this Git repository into your home directory: | |
``` sh | |
# From the command-line | |
cd # Go to your home directory | |
git clone https://github.uconn.edu/HPC/parallel-slurm.git | |
``` | |
Add the following 3 lines to your SLURM job submission file | |
``` sh | |
# Inside your SLURM submission file | |
parallel_opts=$(~/parallel-slurm/parallel_opts.sh) | |
module load parallel | |
parallel $parallel_opts ... YOUR_PROGRAM ... | |
``` | |
## Examples | |
There are `*.slurm` files in the [examples/](examples/) directory that | |
are described below: | |
### Example 01: Hostname | |
This minimal example simply outputs the compute node names in | |
`submit.out`. | |
``` sh | |
# From the command-line | |
cd ~/parallel-slurm/examples | |
sbatch 01-submit-hostname.slurm | |
touch submit.out && tail -f submit.out | |
# Hit Ctrl+C to exit | |
``` | |
The last few lines of your output should show on which nodes your 5 | |
CPUs were allocated and the `hostname` command was run; for example: | |
``` sh | |
cn328 | |
cn327 | |
cn327 | |
cn328 | |
cn327 | |
``` | |
### Example 02: Resumable | |
Parallel tasks often need to recover from failure. Tasks can fail | |
when they start late and are killed by the SLURM job time limit. Or | |
tasks can fail due a simulation intermittently not converging. In | |
both cases, re-running the failed task can produce success. | |
This example shows how to automatically retry failed tasks. This | |
requires the `--joblog` and `--resume-failed` options of GNU Parallel. | |
The `--joblog` records completed tasks in a file - both successes and | |
failures. Using `--resume-failed` tells GNU Parallel to ignore | |
successful tasks but to continue running failed, incomplete, and | |
unattempted tasks. | |
If for some reason you need to re-run any successful completed tasks, | |
you would need to delete relevant line in the joblog file, or delete | |
the entire joblog file to re-run everything. | |
To run the example: | |
``` sh | |
# From the command-line | |
cd ~/parallel-slurm/examples | |
rm -f joblog submit.out | |
for i in {1..5}; do sbatch 02-submit-resumable.slurm; done | |
touch submit.out && tail -f submit.out | |
# Hit Ctrl+C to exit | |
``` | |
The output below shows some tasks intermittently failing and some | |
succeeding from our random number generator script. This example has | |
been intentionally setup so that 3 out of 4 tasks will fail. by the | |
5th SLURM job submission all the tasks will succeed. This is why the | |
last job had nothing to do and completed immediately in 1 second. | |
``` | |
Started SLURM job 2346932 | |
Task 5 started (seed 2346932, random number 0) ... succeeded! | |
Task 3 started (seed 2346932, random number 1) ... failed! | |
Task 1 started (seed 2346932, random number 2) ... failed! | |
Task 4 started (seed 2346932, random number 3) ... failed! | |
Task 2 started (seed 2346932, random number 4) ... succeeded! | |
Completed SLURM job 2346932 in 00:00:05 | |
Started SLURM job 2346933 | |
Task 4 started (seed 2346933, random number 2) ... failed! | |
Task 1 started (seed 2346933, random number 3) ... failed! | |
Task 3 started (seed 2346933, random number 4) ... succeeded! | |
Completed SLURM job 2346933 in 00:00:05 | |
Started SLURM job 2346934 | |
Task 4 started (seed 2346934, random number 1) ... failed! | |
Task 1 started (seed 2346934, random number 4) ... succeeded! | |
Completed SLURM job 2346934 in 00:00:04 | |
Started SLURM job 2346935 | |
Task 4 started (seed 2346935, random number 0) ... succeeded! | |
Completed SLURM job 2346935 in 00:00:00 | |
Started SLURM job 2346936 | |
Completed SLURM job 2346936 in 00:00:01 | |
``` | |
The `joblog` file shows the failing jobs with "Exitval" of 1: | |
```console | |
$ cat joblog | |
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command | |
5 cn332 1557788313.413 0.199 0 62 0 0 ./script_that_sometimes_fails.sh 5 | |
3 cn332 1557788313.187 1.162 0 59 1 0 ./script_that_sometimes_fails.sh 3 | |
1 cn332 1557788312.971 2.197 0 59 1 0 ./script_that_sometimes_fails.sh 1 | |
4 cn332 1557788313.296 3.175 0 59 1 0 ./script_that_sometimes_fails.sh 4 | |
2 cn332 1557788313.080 4.209 0 62 0 0 ./script_that_sometimes_fails.sh 2 | |
4 cn332 1557788318.093 2.180 0 59 1 0 ./script_that_sometimes_fails.sh 4 | |
1 cn332 1557788317.867 3.220 0 59 1 0 ./script_that_sometimes_fails.sh 1 | |
3 cn332 1557788317.976 4.207 0 62 0 0 ./script_that_sometimes_fails.sh 3 | |
4 cn332 1557788322.804 1.471 0 59 1 0 ./script_that_sometimes_fails.sh 4 | |
1 cn332 1557788322.695 4.200 0 62 0 0 ./script_that_sometimes_fails.sh 1 | |
4 cn332 1557788327.417 0.162 0 62 0 0 ./script_that_sometimes_fails.sh 4 | |
``` | |
### Example 3: Parameter Sweep | |
A parameter sweep is running the same program with combinations of | |
input parameters. | |
This example is nearly the same as the previous example, but instead | |
of using the the task ID as the input, it uses a long list of record | |
IDs. The record ID is read by the program to find the corresponding | |
input parameters, the program calculates the result from the | |
parameters and saves the result in a directory. | |
Once all records are completed, the results from the directory can be | |
aggregated by you into a single results file. Aggregating files | |
doesn't really require a SLURM job. | |
GNU Parallel automatically feeds record IDs to task workers as the | |
task workers complete records and become available. | |
```sh | |
# From the command-line | |
module purge | |
module load python/2.7.6-gcc-unicode | |
cd ~/parallel-slurm/examples | |
rm -rf joblog submit.out results/ | |
for i in {1..3}; do sbatch 03-submit-param-sweep.slurm; done | |
touch submit.out && tail -f submit.out | |
# Hit Ctrl+C to exit | |
``` | |
Output: | |
``` | |
Started SLURM job 2346922 | |
Running 60 of total 60 simulations. | |
1: Fitting model to parameters: x = 0.1, y = -0.1, z = control ... | |
1: ... done! Saved result 4.196 to results/01.dat | |
2: Fitting model to parameters: x = 0.1, y = -0.1, z = positive ... | |
2: ... done! Saved result 1.574 to results/02.dat | |
3: Fitting model to parameters: x = 0.1, y = -0.1, z = negative ... | |
3: ... done! Saved result 12.589 to results/03.dat | |
... | |
44: Fitting model to parameters: x = 0.8, y = -0.1, z = positive ... | |
44: ... done! Saved result 1.278 to results/44.dat | |
slurmstepd: *** JOB 2346922 ON cn338 CANCELLED AT 2019-05-13T18:52:16 DUE TO TIME LIMIT *** | |
Started SLURM job 2346923 | |
Running 15 of total 60 simulations. | |
45: Fitting model to parameters: x = 0.8, y = -0.1, z = negative ... | |
45: ... done! Saved result 10.226 to results/45.dat | |
... | |
59: Fitting model to parameters: x = 1.0, y = 0.1, z = positive ... | |
59: ... done! Saved result 1.250 to results/59.dat | |
60: Fitting model to parameters: x = 1.0, y = 0.1, z = negative ... | |
60: ... done! Saved result 10.000 to results/60.dat | |
Completed SLURM job 2346923 in 00:00:32 | |
Started SLURM job 2346924 | |
Nothing to run; all 60 simulations complete. | |
Completed SLURM job 2346924 in 00:00:01 | |
``` | |
```console | |
$ head -n 4 joblog; echo "..."; tail -n 3 joblog | |
Seq Host Starttime JobRuntime Send Receive Exitval Signal Command | |
1 cn236 1557787112.519 5.414 0 119 0 0 python model.py 1 | |
3 cn236 1557787112.750 5.250 0 120 0 0 python model.py 3 | |
2 cn236 1557787117.940 5.276 0 429 0 0 python model.py 2 | |
... | |
58 cn338 1557787220.703 5.284 0 120 0 0 python model.py 58 | |
59 cn338 1557787220.815 5.267 0 121 0 0 python model.py 59 | |
60 cn338 1557787225.850 5.263 0 121 0 0 python model.py 60 | |
``` | |
## Next Steps | |
Hopefully these examples have inspired you to use GNU Parallel to | |
parallelize your code. Now you can: | |
1. Read the [main help page](https://www.gnu.org/software/parallel/man.html) | |
2. Read the [tutorial](https://www.gnu.org/software/parallel/parallel_tutorial.html) | |
If you have further questions, contact hpc@uconn.edu |