Introduction

A GPU node on Snellius is available for interactive software development and compiling codes that utilize GPUs.
This page includes instructions on how to connect to this node and an example compilation.
This node is meant for users who want to compile their GPU codes on Snellius and perform small test runs, not lasting more than a few minutes.
Just like any A100 GPU node on Snellius this node also consists of 4 GPU cards, which are divided into 7 MIG instances each, resulting in a total of 28 MIG instances available to the users.

Restrictions for the interactive GPU node

  • A user needs to have access to the gpu partition and at least 1 SBU of GPU budget.
  • This node is not exposed to the external world therefore this is not a login node, meaning accessing this node is via the login nodes (i.e. the 'int') nodes.
  • This node is not meant for production runs, it is meant for sanity checking of your code.
  • This is a shared node and regular usage policy applies, meaning your jobs will automatically be killed if your host job is running there for more than 15 minutes.
  • Once you assign yourself a MIG instance, you will need to load the necessary modules to run properly in your environment, including at least the CUDA runtime libraries.
  • MIG instances are slices of a GPU in terms of memory and cuda cores, meaning a full GPU will not be available to you on this node.
    If you need a full GPU for your testing, allocate a full node using salloc or an interactive session using srun .

  • A MIG instance is not exclusive, therefore you may run out of GPU memory when running your software.

Logging into the GPU nodes

Log into int3 using ssh from a login node:

[satishk@int4 ~]$ ssh int3
Last login: Mon Jun 19 11:03:45 2023 from 172.18.63.192

Troubleshooting

If you cannot ssh into int3, check that you have access to the GPU partition (' partition '  contains ' gpu '):

[satishk@int4 ~]$ sacctmgr show user -s
      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
   satishk    satishk      None   snellius    satishk        gpu         1                                                                                         gpu           

And check that you have a postitive GPU budget: 

[satishk@int4 ~]$ accinfo --product gpu


If you have access to the GPU partition and have a positive budget, but still cannot login to int3 , please contact the service desk ( https://servicedesk.surf.nl ). 

Compilation and testing

In this section we compile a simple CUDA application performing ping-pong cycles using CUDA aware MPI. In this case the GPU instances are treated as individual devices and assigned to each MPI rank using a wrapper script. 

  • Please note that the modules used in this example are from the 2022 environment, this can vary based on the environment your application uses.


MIG instances

If we run nvidia-smi we see that the four available GPUs are split into a total of 4 x 7 = 28 MIG devices:

 
[satishk@int3 ~]$ nvidia-smi 
Tue Jun 27 15:47:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:31:00.0 Off |                  Off |
| N/A   28C    P0    38W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:32:00.0 Off |                  Off |
| N/A   28C    P0    38W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                  Off |
| N/A   27C    P0    39W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:E3:00.0 Off |                  Off |
| N/A   28C    P0    37W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   10   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   11   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   12   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   13   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+


Assign yourself a different MIG instance

MIG instances are not exclusive which means that another user may already be utlizing the MIG instance you are trying to use, in which case you can assign yourself another MIG instance.

First, check which processes are running on a particular MIG instance using the  nvidia-smi command.
Then, to assign yourself a different MIG instance, you can use the code snippet below: 

[satishk@int3 ~]$ mig=($(nvidia-smi -L | sed -nr "s|^.*UUID:\s*(MIG-[^)]+)\)|\1|p"))
[satishk@int3 ~]$ 
[satishk@int3 ~]$ # Examples of MIG-ids
[satishk@int3 ~]$ echo ${mig[0]}
MIG-bdc1d762-d094-5868-b40a-902670ebb9c9
[satishk@int3 ~]$ echo ${mig[1]}
MIG-b190659d-78f2-514d-a2b7-afdae3427be8
[satishk@int3 ~]$ echo ${mig[2]}
MIG-bb348148-354e-5f5c-8bd8-cc0692b6429a
[satishk@int3 ~]$
[satishk@int3 ~]$ # Setting CUDA_VISIBLE_DEVICES to only see a specific MIG-id
[satishk@int3 ~]$ export CUDA_VISIBLE_DEVICES=${mig[14]}
[satishk@int3 ~]$ echo $CUDA_VISIBLE_DEVICES 
MIG-56588794-51ce-5f03-b567-ec36a8e651a7

As you can see in the snippet above, first you need to load the ids of the MIG instances into a bash array.
Then you can assign specific ids to the environment variable CUDA_VISIBLE_DEVICES, which in this case is the id of the 15th MIG-instance.

Related articles

Related articles appear here based on the labels you select. Click to edit the macro and add or change labels.

Related issues