Python script that runs as cron job and notifies HPC users about 'non-runable' jobs, that is pending jobs that will never run.
The script evaluates the 'reason' code for each pending job.
Currently, it only notifies users whose jobs have a reason of 'PartitionTimeLimit', which means that the user set a time limit for their job which exceeds that allowed for the partition.
Any user with one or more 'non-runable' jobs are sent one email that lists all the non-runable jobs, and a record of the email is stored in an SQLITE3 database. Even those the cron job runs daily, only one email is sent per week.
Usage: wait_notify [-Icx] [-n N] [-t ADMIN_EMAIL]
Read Slurm's sinfo output to determine which pending jobs are in a stuck state that will not run, and email the jobs's owner so they can cancel them and re-run if desired.
Each email is logged in an SQLITE3 database, and emails will not be sent if an email has already gone out in the previous week.
OPTIONS -I Run the first time to initialize Sqlite3 database that records emails. -x Send email to users with stuck jobs. OPTIONS USEFUL FOR TESTING -c Check only. List jobs that are in a cancelled state -t ADMIN_EMAIL Useful for testing. Use only with -x. Emails will be sent not to users but to the ADMIN_EMAIL instead. -n N Only email the first N users. -f Force email, even if one was sent in past week
See files user_notify_config.py wait_notify_config.py