Synopsis
HyperQueue manages large workflows on HPC systems. It allows you to execute a large number of tasks without having to manually submit individual Slurm jobs.
It is also possible to add tasks to a job or to extend the pool of running workers to increase the capacity.
The software is developed by it4innovation
Introduction
HyperQueue consists of two components:
- Server: a lightweight process running on a login node. The server accepts tasks added to a queue (via CLI).
- Worker: a process running on one or more compute nodes. The worker registers itself at the server on the login node and retrieves tasks to be executed on the node.
Usage on Snellius
Step 1: Load the HyperQueue module
HyperQueue is installed on the software stack and can be loaded as follows:
module load 2023 module load HyperQueue/0.19.0
Step 2: Launch the HyperQueue server
Submitting tasks to the queue or launching a worker requires a running server on one of the login nodes. The server should run persistently in the background as long as there are active workers connected to the system.
nohup hq server start &
Step 3: Adding jobs to the queue
When the server is running, new tasks can be added to the queue:
hq submit --cwd hello3 --pin taskset --cpus=1 /path/to/program arg1 arg2
Step 4: Launch worker on a compute nodes
A HyperQueue worker lives on the compute nodes inside a Slurm job. After the server has been launched on the login node, workers can register from a compute node indicating that they are ready to take compute tasks. The following Slurm script launches a HyperQueue worker on a compute node:
#!/bin/bash #SBATCH --job-name=HyperQueue #SBATCH --nodes=2 #SBATCH --time=00:04:00 #SBATCH --partition=rome module load 2023 module load HyperQueue/0.19.0 module load OpenMPI/4.1.5-GCC-12.3.0 # load further modules required by the application srun --overlap hq worker start --manager slurm --idle-timeout=5min
The job can be submitted as follows:
sbatch hyper_queue.sh
The order of Step 3 und Step 4 does not matter and can be repeated multiple times:
- New workers can be launched at any time to increase the compute capacity of the HyperQueue server.
- New tasks can be added to the queue at any time
Step 5: Stop the worker process
When the HyperQueue is not used anymore, please stop the server as follows (from any interactive node):
hq server stop
Further Reading
HyperQueue offers much more features that are not covered in this documentation.
More details on the software such as use cases, command line options, APIs etc. can be found on the HyperQueue software web page.