HyperQueue

Synopsis

HyperQueue manages large workflows on HPC systems. It allows you to execute a large number of tasks without having to manually submit individual Slurm jobs.
It is also possible to add tasks to a job or to extend the pool of running workers to increase the capacity.
The software is developed by it4innovation

Introduction

HyperQueue consists of two components:

Server: a lightweight process running on a login node. The server accepts tasks added to a queue (via CLI).
Worker: a process running on one or more compute nodes. The worker registers itself at the server on the login node and retrieves tasks to be executed on the node.

Usage on Snellius

Step 1: Load the HyperQueue module

HyperQueue is installed on the software stack and can be loaded as follows:

module load 2023
module load HyperQueue/0.19.0

Step 2: Launch the HyperQueue server

Submitting tasks to the queue or launching a worker requires a running server on one of the login nodes. The server should run persistently in the background as long as there are active workers connected to the system.

nohup hq server start &

Step 3: Adding jobs to the queue

When the server is running, new tasks can be added to the queue:

hq submit --cwd hello3 --pin taskset --cpus=1 /path/to/program arg1 arg2

Step 4: Launch worker on a compute nodes

A HyperQueue worker lives on the compute nodes inside a Slurm job. After the server has been launched on the login node, workers can register from a compute node indicating that they are ready to take compute tasks. The following Slurm script launches a HyperQueue worker on a compute node:

hyper_queue

#!/bin/bash
#SBATCH --job-name=HyperQueue
#SBATCH --nodes=2
#SBATCH --time=00:04:00
#SBATCH --partition=rome

module load 2023
module load HyperQueue/0.19.0
module load OpenMPI/4.1.5-GCC-12.3.0
# load further modules required by the application

srun --overlap hq worker start --manager slurm --idle-timeout=5min

The job can be submitted as follows:

sbatch hyper_queue.sh

The order of Step 3 und Step 4 does not matter and can be repeated multiple times:

New workers can be launched at any time to increase the compute capacity of the HyperQueue server.
New tasks can be added to the queue at any time

Step 5: Stop the worker process

When the HyperQueue is not used anymore, please stop the server as follows (from any interactive node):

hq server stop

Space shortcuts

Page tree

Introduction

Usage on Snellius

Step 1: Load the HyperQueue module

Step 2: Launch the HyperQueue server

Step 4: Launch worker on a compute nodes