Interactive jobs are useful mostly for testing purposes, as you can experiment with executing your programs and immediately see any resulting error message.
You can start an interactive job using the --pty bash -il
flag, e.g. using
srun -n 1 -c 16 -t 1:00:00 --pty /bin/bash
The interactive job is put in the queue, like any other batch job. Note therefore that - depending on the load of the system - it can take a while before your interactive session starts. Once a node is available, the terminal where you submitted the interactive job will automatically start the bash on the node that is allocated to you. Now, you can interactively execute commands within this bash instance on the node.
As with regular batch jobs, the walltime determines how long the node is reserved. When the wall time expires, you are logged out automatically and your terminal will return to the login node. If you logout before the walltime expires, the interactive job automatically finishes - hence if you submitted an interactive job of 1 hour, but you log out after 5 minutes, your budget will only be charged for 5 minutes.
If desired, you can request multiple nodes (assuming the nodes you are requesting have 16 CPU cores), e.g. using
srun -n 32 -t 4:00:00 -W 0 --pty /bin/bash
This may be useful in cases where you are debugging a jobscript or software that is running on multiple nodes. At the start of your interactive session, you are logged in to one of the two nodes allocated to you but can use the SLURM_JOB_NODELIST to check which other nodes were allocated. Then, you can use ssh to login to one of these other nodes.
Using salloc
Alternatively, SLURM provides a way to allocate resources with the salloc
command, e.g.:salloc -n 32 -t 4:00:00
squeue
command:
$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 111111 gpu interact user1 R 20:30 1 tcn21
and login to the node to work there interactively for the specified time. For example, for the case above:
$ ssh tcn21
You can read more information about the salloc
command and the possible flags to use here.
salloc starts a new shell
The salloc
command normally takes a user command to execute in the allocation, e.g. salloc -t 2:00:00 -p thin ./mycomputation
. If you do not provide such a command then salloc
will start your default shell instead:
# Case 1: salloc with a command to execute # Note the "tcn426" line, the output of "srun hostname" on the allocated node snellius paulm@int3 12:36 ~$ salloc -t 00:00:05 -p thin srun hostname salloc: Pending job allocation 1572110 salloc: job 1572110 queued and waiting for resources salloc: job 1572110 has been allocated resources salloc: Granted job allocation 1572110 salloc: Waiting for resource configuration salloc: Nodes tcn426 are ready for job tcn426 salloc: Relinquishing job allocation 1572110 salloc: Job allocation 1572110 has been revoked. # salloc is done at this point, and its process is no longer running snellius paulm@int3 12:39 ~$ ps faux | grep paulm root 3287368 0.0 0.0 138752 9988 ? Ss 12:34 0:00 \_ sshd: paulm [priv] paulm 3289620 0.0 0.0 138752 5524 ? S 12:34 0:00 | \_ sshd: paulm@pts/168 paulm 3289621 0.0 0.0 22568 7276 pts/168 Ss 12:34 0:00 | \_ -bash paulm 3313407 0.0 0.0 53620 5668 pts/168 R+ 12:40 0:00 | \_ ps faux paulm 3313408 0.0 0.0 12136 1076 pts/168 S+ 12:40 0:00 | \_ grep --color=auto paulm # Case 2: salloc without a command to execute snellius paulm@int3 12:40 ~$ salloc -t 00:00:05 -p thin salloc: Pending job allocation 1572114 salloc: job 1572114 queued and waiting for resources salloc: job 1572114 has been allocated resources salloc: Granted job allocation 1572114 salloc: Waiting for resource configuration salloc: Nodes tcn370 are ready for job # We're still on the int3 login node, but the current shell we're working in # (process 3316940) has been started by salloc snellius paulm@int3 12:41 ~$ ps faux | grep paulm root 3287368 0.0 0.0 138752 9988 ? Ss 12:34 0:00 \_ sshd: paulm [priv] paulm 3289620 0.0 0.0 138752 5524 ? S 12:34 0:00 | \_ sshd: paulm@pts/168 paulm 3289621 0.0 0.0 22568 7276 pts/168 Ss 12:34 0:00 | \_ -bash paulm 3314101 0.0 0.0 127380 6728 pts/168 Sl 12:40 0:00 | \_ salloc -t 00:00:05 -p thin paulm 3316940 0.3 0.0 22572 7480 pts/168 S 12:41 0:00 | \_ /bin/bash paulm 3318849 0.0 0.0 53624 5720 pts/168 R+ 12:41 0:00 | \_ ps faux paulm 3318850 0.0 0.0 12140 1148 pts/168 S+ 12:41 0:00 | \_ grep --color=auto paulm # Note the output! The hostname command is not executed on int3, but on the allocated node tcn370 snellius paulm@int3 12:41 ~$ srun hostname tcn370 # As we only indicated a very short wallclock time the allocation quickly ends snellius paulm@int3 12:41 ~$ salloc: Job 1572114 has exceeded its time limit and its allocation has been revoked. # Since there's no allocation anymore, the srun command fails snellius paulm@int3 12:43 ~$ srun hostname srun: error: Slurm job 1572114 has expired srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 1572114 # However, we're currently still working within the shell (process 3316940) that was started by salloc! snellius paulm@int3 12:43 ~$ ps faux | grep paulm root 3287368 0.0 0.0 138752 9988 ? Ss 12:34 0:00 \_ sshd: paulm [priv] paulm 3289620 0.0 0.0 138752 5524 ? S 12:34 0:00 | \_ sshd: paulm@pts/168 paulm 3289621 0.0 0.0 22568 7276 pts/168 Ss 12:34 0:00 | \_ -bash paulm 3314101 0.0 0.0 127380 6728 pts/168 Sl 12:40 0:00 | \_ salloc -t 00:00:05 -p thin paulm 3316940 0.0 0.0 22572 7484 pts/168 S 12:41 0:00 | \_ /bin/bash paulm 3327419 0.0 0.0 53624 5664 pts/168 R+ 12:43 0:00 | \_ ps faux paulm 3327420 0.0 0.0 12140 1080 pts/168 S+ 12:43 0:00 | \_ grep --color=auto paulm snellius paulm@int3 12:51 ~$ sleep 10 & [1] 3367698 snellius paulm@int3 12:51 ~$ ps faux | grep paulm root 3287368 0.0 0.0 138752 9988 ? Ss 12:34 0:00 \_ sshd: paulm [priv] paulm 3289620 0.0 0.0 138752 5524 ? S 12:34 0:00 | \_ sshd: paulm@pts/168 paulm 3289621 0.0 0.0 22568 7276 pts/168 Ss 12:34 0:00 | \_ -bash paulm 3314101 0.0 0.0 127380 6728 pts/168 Sl 12:40 0:00 | \_ salloc -t 00:00:05 -p thin paulm 3316940 0.0 0.0 22572 7484 pts/168 S 12:41 0:00 | \_ /bin/bash paulm 3367698 0.0 0.0 7312 904 pts/168 S 12:51 0:00 | \_ sleep 10 <------------------- paulm 3367994 0.0 0.0 53624 5696 pts/168 R+ 12:51 0:00 | \_ ps faux paulm 3367995 0.0 0.0 12140 1048 pts/168 S+ 12:51 0:00 | \_ grep --color=auto paulm
So if you run salloc
without a command you need to do an exit
after the allocation expires, to stop the shell launched by salloc
. Otherwise, you might end up (when repeatedly using salloc
) keeping alive a whole hierarchy of salloc
processes.