Interactive development GPU node

Introduction

A GPU node on Snellius is available for interactive software development and compiling codes that utilize GPUs.
This page includes instructions on how to connect to this node and an example compilation.
This node is meant for users who want to compile their GPU codes on Snellius and perform small test runs, not lasting more than a few minutes.
Just like any A100 GPU node on Snellius this node also consists of 4 GPU cards, which are divided into 7 MIG instances each, resulting in a total of 28 MIG instances available to the users.

Restrictions for the interactive GPU node

A user needs to have access to the gpu partition and at least 1 SBU of GPU budget.
This node is not exposed to the external world therefore this is not a login node, meaning accessing this node is via the login nodes (i.e. the 'int') nodes.
This node is not meant for production runs, it is meant for sanity checking of your code.
This is a shared node and regular usage policy applies, meaning your jobs will automatically be killed if your host job is running there for more than 15 minutes.
Once you assign yourself a MIG instance, you will need to load the necessary modules to run properly in your environment, including at least the CUDA runtime libraries.
MIG instances are slices of a GPU in terms of memory and cuda cores, meaning a full GPU will not be available to you on this node.
If you need a full GPU for your testing, allocate a full node using salloc or an interactive session using srun .
A MIG instance is not exclusive, therefore you may run out of GPU memory when running your software.

Logging into the GPU nodes

Log into gcn1 using ssh from a login node:

[satishk@int4 ~]$ ssh gcn1
Last login: Mon Jun 19 11:03:45 2023 from 172.18.63.192

Troubleshooting

If you cannot ssh into gcn1, check that you have access to the GPU partition (' partition ' contains ' gpu '):

[satishk@int4 ~]$ sacctmgr show user -s
      User   Def Acct     Admin    Cluster    Account  Partition     Share   Priority MaxJobs MaxNodes  MaxCPUs MaxSubmit     MaxWall  MaxCPUMins                  QOS   Def QOS 
---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- 
   satishk    satishk      None   snellius    satishk        gpu         1                                                                                         gpu

And check that you have a postitive GPU budget:

[satishk@int4 ~]$ accinfo --product gpu

If you have access to the GPU partition and have a positive budget, but still cannot login to gcn1 , please contact the service desk ( https://servicedesk.surf.nl ).

Compilation and testing

In this section we compile a simple CUDA application performing ping-pong cycles using CUDA aware MPI. In this case the GPU instances are treated as individual devices and assigned to each MPI rank using a wrapper script.

Code: (File name: pp_cuda_aware.cu ):

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

// Macro for checking errors in CUDA API calls
#define cudaErrorCheck(call)                                                              \
do{                                                                                       \
	cudaError_t cuErr = call;                                                             \
	if(cudaSuccess != cuErr){                                                             \
		printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\
		exit(0);                                                                            \
	}                                                                                     \
}while(0)

int main(int argc, char *argv[])
{
	/* -------------------------------------------------------------------------------------------
		MPI Initialization 
	--------------------------------------------------------------------------------------------*/
	MPI_Init(&argc, &argv);

	int size;
	MPI_Comm_size(MPI_COMM_WORLD, &size);

	int rank;
	MPI_Comm_rank(MPI_COMM_WORLD, &rank);

	MPI_Status stat[10];

	int runtimeversion;

	cudaRuntimeGetVersion(&runtimeversion);
	
	printf("running with %d runtime version of CUDA\n", runtimeversion);

	if(size != 2){
		if(rank == 0){
			printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size);
		}
		MPI_Finalize();
		exit(0);
	}

         int dummy;
	// Get Mapped MPI ranks to GPUs
	cudaErrorCheck( cudaGetDevice(&dummy) );


        if(rank == 0)
        {
          int a = 0;
          cudaGetDeviceCount(&a);

          printf("Done! mapping MPI ranks to GPU instances: %d \n", a);
        }

	/* -------------------------------------------------------------------------------------------
		Loop from 8 B to 1 GB
	--------------------------------------------------------------------------------------------*/

	for(int i=0; i<=27; i++){

		long int N = 1 << i;
	
		// Allocate memory for A on CPU
		double *A = (double*)malloc(N*sizeof(double));

		// Initialize all elements of A to 0.0
		for(int i=0; i<N; i++){
			A[i] = 0.0;
		}

		double *d_A;
		cudaErrorCheck( cudaMalloc(&d_A, N*sizeof(double)) );
		cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) );
	
		int tag1 = 10;
		int tag2 = 20;
	
		int loop_count = 50;

		MPI_Request request[10];

		// Warm-up loop
		for(int i=1; i<=5; i++){
			if(rank == 0){
				MPI_Isend(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD, &request[i-1]);
				MPI_Irecv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &request[i+4]);
			}
			else if(rank == 1){
				MPI_Irecv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &request[i-1]);
				MPI_Isend(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD, &request[i+4]);
			}
		}

		MPI_Waitall(10, request, &stat[0]);

		// Time ping-pong for loop_count iterations of data transfer size 8*N bytes
		double start_time, stop_time, elapsed_time;
		start_time = MPI_Wtime();
	
		for(int i=1; i<=loop_count; i++){
			if(rank == 0){
				MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD);
				MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat[0]);
			}
			else if(rank == 1){
				MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat[0]);
				MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD);
			}
		}

		stop_time = MPI_Wtime();
		elapsed_time = stop_time - start_time;

		long int num_B = 8*N;
		long int B_in_GB = 1 << 30;
		double num_GB = (double)num_B / (double)B_in_GB;
		double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count);

		if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer );

		cudaErrorCheck( cudaFree(d_A) );
		free(A);
	}

	MPI_Finalize();

	return 0;
}

Wrapper script: (File name: mpi_wrapper.sh)

#!/bin/bash
# Wrapper for gromacs to set the CUDA_VISIBLE_DEVICES based on MIG instances and local rank

mig_uid_array=( $(nvidia-smi -L | sed -nr "s|^.*UUID:\s*(MIG-[^)]+)\)|\1|p") )
export CUDA_VISIBLE_DEVICES=${mig_uid_array[${OMPI_COMM_WORLD_LOCAL_RANK}]}

echo "Rank ${OMPI_COMM_WORLD_LOCAL_RANK} set CUDA_VISIBLE_DEVICES: ${CUDA_VISIBLE_DEVICES}"

/gpfs/home5/satishk/temp/gpu_test_mpi/pp_cuda_aware "$@"

This wrapper script needs to be made an executable using chmod .

MakeFile:

MPICOMP = mpicxx
CUCOMP  = nvcc
CUFLAGS = -arch=sm_80

INCLUDES  = -I$(EBROOTOPENMPI)/include
LIBRARIES = -L$(EBROOTOPENMPI)/lib -lmpi -lcudart

pp_cuda_aware: ping_pong_cuda_aware.o
	$(MPICOMP) $(LIBRARIES) ping_pong_cuda_aware.o -o pp_cuda_aware

ping_pong_cuda_aware.o: ping_pong_cuda_aware.cu
	$(CUCOMP) $(CUFLAGS) $(INCLUDES) -c ping_pong_cuda_aware.cu

.PHONY: clean

clean:
	rm -f pp_cuda_aware *.o

Load required modules:

[satishk@gcn1 ~]$ module load 2022
[satishk@gcn1 ~]$ module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 
[satishk@gcn1 ~]$ module load gompi/2022a
[satishk@gcn1 ~]$

Compilation and execution:

[satishk@gcn1 gpu_test_mpi]$ make
nvcc -arch=sm_80 -I/sw/arch/RHEL8/EB_production/2022/software/OpenMPI/4.1.4-GCC-11.3.0/include -c ping_pong_cuda_aware.cu
mpicxx -L/sw/arch/RHEL8/EB_production/2022/software/OpenMPI/4.1.4-GCC-11.3.0/lib -lmpi -lcudart ping_pong_cuda_aware.o -o pp_cuda_aware
[satishk@gcn1 gpu_test_mpi]$ mpirun -np 2 mpi_wrapper.sh
Rank 0 set CUDA_VISIBLE_DEVICES: MIG-bdc1d762-d094-5868-b40a-902670ebb9c9
Rank 1 set CUDA_VISIBLE_DEVICES: MIG-b190659d-78f2-514d-a2b7-afdae3427be8
running with 11070 runtime version of CUDA
running with 11070 runtime version of CUDA
Done! mapping MPI ranks to GPU instances: 1
Transfer size (B):          8, Transfer Time (s):     0.000003813, Bandwidth (GB/s):     0.001954246
Transfer size (B):         16, Transfer Time (s):     0.000001626, Bandwidth (GB/s):     0.009163179
Transfer size (B):         32, Transfer Time (s):     0.000001739, Bandwidth (GB/s):     0.017136931
Transfer size (B):         64, Transfer Time (s):     0.000002078, Bandwidth (GB/s):     0.028682555
Transfer size (B):        128, Transfer Time (s):     0.000002150, Bandwidth (GB/s):     0.055456757
Transfer size (B):        256, Transfer Time (s):     0.000002320, Bandwidth (GB/s):     0.102783021
Transfer size (B):        512, Transfer Time (s):     0.000002568, Bandwidth (GB/s):     0.185666900
Transfer size (B):       1024, Transfer Time (s):     0.000003589, Bandwidth (GB/s):     0.265715536
Transfer size (B):       2048, Transfer Time (s):     0.000005871, Bandwidth (GB/s):     0.324891774
Transfer size (B):       4096, Transfer Time (s):     0.000010126, Bandwidth (GB/s):     0.376709996
Transfer size (B):       8192, Transfer Time (s):     0.000013458, Bandwidth (GB/s):     0.566901088
Transfer size (B):      16384, Transfer Time (s):     0.000012195, Bandwidth (GB/s):     1.251223981
Transfer size (B):      32768, Transfer Time (s):     0.000014791, Bandwidth (GB/s):     2.063313187
Transfer size (B):      65536, Transfer Time (s):     0.000020139, Bandwidth (GB/s):     3.030677932
Transfer size (B):     131072, Transfer Time (s):     0.000030869, Bandwidth (GB/s):     3.954511491
Transfer size (B):     262144, Transfer Time (s):     0.000052286, Bandwidth (GB/s):     4.669319091
Transfer size (B):     524288, Transfer Time (s):     0.000095298, Bandwidth (GB/s):     5.123729761
Transfer size (B):    1048576, Transfer Time (s):     0.000144456, Bandwidth (GB/s):     6.760272742
Transfer size (B):    2097152, Transfer Time (s):     0.000243459, Bandwidth (GB/s):     8.022384182
Transfer size (B):    4194304, Transfer Time (s):     0.000441475, Bandwidth (GB/s):     8.848184409
Transfer size (B):    8388608, Transfer Time (s):     0.000836773, Bandwidth (GB/s):     9.336463486
Transfer size (B):   16777216, Transfer Time (s):     0.001649125, Bandwidth (GB/s):     9.474722707
Transfer size (B):   33554432, Transfer Time (s):     0.003269833, Bandwidth (GB/s):     9.557062684
Transfer size (B):   67108864, Transfer Time (s):     0.006536275, Bandwidth (GB/s):     9.562021694
Transfer size (B):  134217728, Transfer Time (s):     0.013065378, Bandwidth (GB/s):     9.567270430
Transfer size (B):  268435456, Transfer Time (s):     0.026134100, Bandwidth (GB/s):     9.566045857
Transfer size (B):  536870912, Transfer Time (s):     0.052366281, Bandwidth (GB/s):     9.548128945
Transfer size (B): 1073741824, Transfer Time (s):     0.140979870, Bandwidth (GB/s):     7.093211250

Please note that the modules used in this example are from the 2022 environment, this can vary based on the environment your application uses.

MIG instances

If we run nvidia-smi we see that the four available GPUs are split into a total of 4 x 7 = 28 MIG devices:

 
[satishk@gcn1 ~]$ nvidia-smi 
Tue Jun 27 15:47:50 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 520.61.05    Driver Version: 520.61.05    CUDA Version: 11.8     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-SXM...  On   | 00000000:31:00.0 Off |                  Off |
| N/A   28C    P0    38W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA A100-SXM...  On   | 00000000:32:00.0 Off |                  Off |
| N/A   28C    P0    38W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   2  NVIDIA A100-SXM...  On   | 00000000:CA:00.0 Off |                  Off |
| N/A   27C    P0    39W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+
|   3  NVIDIA A100-SXM...  On   | 00000000:E3:00.0 Off |                  Off |
| N/A   28C    P0    37W / 400W |     45MiB / 40960MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  1   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   10   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   11   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   12   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  2   13   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    7   0   0  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    8   0   1  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3    9   0   2  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   11   0   3  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   12   0   4  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   13   0   5  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  3   14   0   6  |      6MiB /  4864MiB | 14    N/A |  1   0    0    0    0 |
|                  |      0MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Assign yourself a different MIG instance

MIG instances are not exclusive which means that another user may already be utlizing the MIG instance you are trying to use, in which case you can assign yourself another MIG instance.

First, check which processes are running on a particular MIG instance using the nvidia-smi command.
Then, to assign yourself a different MIG instance, you can use the code snippet below:

[satishk@gcn1 ~]$ mig=($(nvidia-smi -L | sed -nr "s|^.*UUID:\s*(MIG-[^)]+)\)|\1|p"))
[satishk@gcn1 ~]$ 
[satishk@gcn1 ~]$ # Examples of MIG-ids
[satishk@gcn1 ~]$ echo ${mig[0]}
MIG-bdc1d762-d094-5868-b40a-902670ebb9c9
[satishk@gcn1 ~]$ echo ${mig[1]}
MIG-b190659d-78f2-514d-a2b7-afdae3427be8
[satishk@gcn1 ~]$ echo ${mig[2]}
MIG-bb348148-354e-5f5c-8bd8-cc0692b6429a
[satishk@gcn1 ~]$
[satishk@gcn1 ~]$ # Setting CUDA_VISIBLE_DEVICES to only see a specific MIG-id
[satishk@gcn1 ~]$ export CUDA_VISIBLE_DEVICES=${mig[14]}
[satishk@gcn1 ~]$ echo $CUDA_VISIBLE_DEVICES 
MIG-56588794-51ce-5f03-b567-ec36a8e651a7

As you can see in the snippet above, first you need to load the ids of the MIG instances into a bash array.
Then you can assign specific ids to the environment variable CUDA_VISIBLE_DEVICES, which in this case is the id of the 15th MIG-instance.

Space shortcuts

Page tree