Introduction
A GPU node on Snellius is available for interactive software development and compiling codes that utilize GPUs.
This page includes instructions on how to connect to this node and an example compilation.
This node is meant for users who want to compile their GPU codes on Snellius and perform small test runs, not lasting more than a few minutes.
Just like any A100 GPU node on Snellius this node also consists of 4 GPU cards, which are divided into 7 MIG instances each, resulting in a total of 28 MIG instances available to the users.
Restrictions for the interactive GPU node
- A user needs to have access to the
gpu
partition and at least 1 SBU of GPU budget. - This node is not exposed to the external world therefore this is not a login node, meaning accessing this node is via the login nodes (i.e. the '
int
') nodes. - This node is not meant for production runs, it is meant for sanity checking of your code.
- This is a shared node and regular usage policy applies, meaning your jobs will automatically be killed if your host job is running there for more than 15 minutes.
- Once you assign yourself a MIG instance, you will need to load the necessary modules to run properly in your environment, including at least the CUDA runtime libraries.
MIG instances are slices of a GPU in terms of memory and cuda cores, meaning a full GPU will not be available to you on this node.
If you need a full GPU for your testing, allocate a full node usingsalloc
or an interactive session usingsrun
.- A MIG instance is not exclusive, therefore you may run out of GPU memory when running your software.
Logging into the GPU nodes
Log into int3
using ssh from a login node:
[satishk@int4 ~]$ ssh int3 Last login: Mon Jun 19 11:03:45 2023 from 172.18.63.192
Troubleshooting
If you cannot ssh into int3, check that you have access to the GPU partition (' partition
' contains ' gpu
'):
[satishk@int4 ~]$ sacctmgr show user -s User Def Acct Admin Cluster Account Partition Share Priority MaxJobs MaxNodes MaxCPUs MaxSubmit MaxWall MaxCPUMins QOS Def QOS ---------- ---------- --------- ---------- ---------- ---------- --------- ---------- ------- -------- -------- --------- ----------- ----------- -------------------- --------- satishk satishk None snellius satishk gpu 1 gpu
And check that you have a postitive GPU budget:
[satishk@int4 ~]$ accinfo --product gpu
If you have access to the GPU partition and have a positive budget, but still cannot login to int3
, please contact the service desk ( https://servicedesk.surf.nl ).
Compilation and testing
In this section we compile a simple CUDA application performing ping-pong cycles using CUDA aware MPI. In this case the GPU instances are treated as individual devices and assigned to each MPI rank using a wrapper script.
Code: (File name: pp_cuda_aware.cu ):
#include <stdio.h> #include <stdlib.h> #include <mpi.h> // Macro for checking errors in CUDA API calls #define cudaErrorCheck(call) \ do{ \ cudaError_t cuErr = call; \ if(cudaSuccess != cuErr){ \ printf("CUDA Error - %s:%d: '%s'\n", __FILE__, __LINE__, cudaGetErrorString(cuErr));\ exit(0); \ } \ }while(0) int main(int argc, char *argv[]) { /* ------------------------------------------------------------------------------------------- MPI Initialization --------------------------------------------------------------------------------------------*/ MPI_Init(&argc, &argv); int size; MPI_Comm_size(MPI_COMM_WORLD, &size); int rank; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Status stat[10]; int runtimeversion; cudaRuntimeGetVersion(&runtimeversion); printf("running with %d runtime version of CUDA\n", runtimeversion); if(size != 2){ if(rank == 0){ printf("This program requires exactly 2 MPI ranks, but you are attempting to use %d! Exiting...\n", size); } MPI_Finalize(); exit(0); } int dummy; // Get Mapped MPI ranks to GPUs cudaErrorCheck( cudaGetDevice(&dummy) ); if(rank == 0) { int a = 0; cudaGetDeviceCount(&a); printf("Done! mapping MPI ranks to GPU instances: %d \n", a); } /* ------------------------------------------------------------------------------------------- Loop from 8 B to 1 GB --------------------------------------------------------------------------------------------*/ for(int i=0; i<=27; i++){ long int N = 1 << i; // Allocate memory for A on CPU double *A = (double*)malloc(N*sizeof(double)); // Initialize all elements of A to 0.0 for(int i=0; i<N; i++){ A[i] = 0.0; } double *d_A; cudaErrorCheck( cudaMalloc(&d_A, N*sizeof(double)) ); cudaErrorCheck( cudaMemcpy(d_A, A, N*sizeof(double), cudaMemcpyHostToDevice) ); int tag1 = 10; int tag2 = 20; int loop_count = 50; MPI_Request request[10]; // Warm-up loop for(int i=1; i<=5; i++){ if(rank == 0){ MPI_Isend(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD, &request[i-1]); MPI_Irecv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &request[i+4]); } else if(rank == 1){ MPI_Irecv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &request[i-1]); MPI_Isend(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD, &request[i+4]); } } MPI_Waitall(10, request, &stat[0]); // Time ping-pong for loop_count iterations of data transfer size 8*N bytes double start_time, stop_time, elapsed_time; start_time = MPI_Wtime(); for(int i=1; i<=loop_count; i++){ if(rank == 0){ MPI_Send(d_A, N, MPI_DOUBLE, 1, tag1, MPI_COMM_WORLD); MPI_Recv(d_A, N, MPI_DOUBLE, 1, tag2, MPI_COMM_WORLD, &stat[0]); } else if(rank == 1){ MPI_Recv(d_A, N, MPI_DOUBLE, 0, tag1, MPI_COMM_WORLD, &stat[0]); MPI_Send(d_A, N, MPI_DOUBLE, 0, tag2, MPI_COMM_WORLD); } } stop_time = MPI_Wtime(); elapsed_time = stop_time - start_time; long int num_B = 8*N; long int B_in_GB = 1 << 30; double num_GB = (double)num_B / (double)B_in_GB; double avg_time_per_transfer = elapsed_time / (2.0*(double)loop_count); if(rank == 0) printf("Transfer size (B): %10li, Transfer Time (s): %15.9f, Bandwidth (GB/s): %15.9f\n", num_B, avg_time_per_transfer, num_GB/avg_time_per_transfer ); cudaErrorCheck( cudaFree(d_A) ); free(A); } MPI_Finalize(); return 0; }
Wrapper script: (File name: mpi_wrapper.sh)
#!/bin/bash # Wrapper for gromacs to set the CUDA_VISIBLE_DEVICES based on MIG instances and local rank mig_uid_array=( $(nvidia-smi -L | sed -nr "s|^.*UUID:\s*(MIG-[^)]+)\)|\1|p") ) export CUDA_VISIBLE_DEVICES=${mig_uid_array[${OMPI_COMM_WORLD_LOCAL_RANK}]} echo "Rank ${OMPI_COMM_WORLD_LOCAL_RANK} set CUDA_VISIBLE_DEVICES: ${CUDA_VISIBLE_DEVICES}" /gpfs/home5/satishk/temp/gpu_test_mpi/pp_cuda_aware "$@"
This wrapper script needs to be made an executable using
chmod
.MakeFile:
MPICOMP = mpicxx CUCOMP = nvcc CUFLAGS = -arch=sm_80 INCLUDES = -I$(EBROOTOPENMPI)/include LIBRARIES = -L$(EBROOTOPENMPI)/lib -lmpi -lcudart pp_cuda_aware: ping_pong_cuda_aware.o $(MPICOMP) $(LIBRARIES) ping_pong_cuda_aware.o -o pp_cuda_aware ping_pong_cuda_aware.o: ping_pong_cuda_aware.cu $(CUCOMP) $(CUFLAGS) $(INCLUDES) -c ping_pong_cuda_aware.cu .PHONY: clean clean: rm -f pp_cuda_aware *.o
Load required modules:
[satishk@int3 ~]$ module load 2022 [satishk@int3 ~]$ module load UCX-CUDA/1.12.1-GCCcore-11.3.0-CUDA-11.7.0 [satishk@int3 ~]$ module load gompi/2022a [satishk@int3 ~]$
Compilation and execution:
[satishk@int3 gpu_test_mpi]$ make nvcc -arch=sm_80 -I/sw/arch/RHEL8/EB_production/2022/software/OpenMPI/4.1.4-GCC-11.3.0/include -c ping_pong_cuda_aware.cu mpicxx -L/sw/arch/RHEL8/EB_production/2022/software/OpenMPI/4.1.4-GCC-11.3.0/lib -lmpi -lcudart ping_pong_cuda_aware.o -o pp_cuda_aware [satishk@int3 gpu_test_mpi]$ mpirun -np 2 mpi_wrapper.sh Rank 0 set CUDA_VISIBLE_DEVICES: MIG-bdc1d762-d094-5868-b40a-902670ebb9c9 Rank 1 set CUDA_VISIBLE_DEVICES: MIG-b190659d-78f2-514d-a2b7-afdae3427be8 running with 11070 runtime version of CUDA running with 11070 runtime version of CUDA Done! mapping MPI ranks to GPU instances: 1 Transfer size (B): 8, Transfer Time (s): 0.000003813, Bandwidth (GB/s): 0.001954246 Transfer size (B): 16, Transfer Time (s): 0.000001626, Bandwidth (GB/s): 0.009163179 Transfer size (B): 32, Transfer Time (s): 0.000001739, Bandwidth (GB/s): 0.017136931 Transfer size (B): 64, Transfer Time (s): 0.000002078, Bandwidth (GB/s): 0.028682555 Transfer size (B): 128, Transfer Time (s): 0.000002150, Bandwidth (GB/s): 0.055456757 Transfer size (B): 256, Transfer Time (s): 0.000002320, Bandwidth (GB/s): 0.102783021 Transfer size (B): 512, Transfer Time (s): 0.000002568, Bandwidth (GB/s): 0.185666900 Transfer size (B): 1024, Transfer Time (s): 0.000003589, Bandwidth (GB/s): 0.265715536 Transfer size (B): 2048, Transfer Time (s): 0.000005871, Bandwidth (GB/s): 0.324891774 Transfer size (B): 4096, Transfer Time (s): 0.000010126, Bandwidth (GB/s): 0.376709996 Transfer size (B): 8192, Transfer Time (s): 0.000013458, Bandwidth (GB/s): 0.566901088 Transfer size (B): 16384, Transfer Time (s): 0.000012195, Bandwidth (GB/s): 1.251223981 Transfer size (B): 32768, Transfer Time (s): 0.000014791, Bandwidth (GB/s): 2.063313187 Transfer size (B): 65536, Transfer Time (s): 0.000020139, Bandwidth (GB/s): 3.030677932 Transfer size (B): 131072, Transfer Time (s): 0.000030869, Bandwidth (GB/s): 3.954511491 Transfer size (B): 262144, Transfer Time (s): 0.000052286, Bandwidth (GB/s): 4.669319091 Transfer size (B): 524288, Transfer Time (s): 0.000095298, Bandwidth (GB/s): 5.123729761 Transfer size (B): 1048576, Transfer Time (s): 0.000144456, Bandwidth (GB/s): 6.760272742 Transfer size (B): 2097152, Transfer Time (s): 0.000243459, Bandwidth (GB/s): 8.022384182 Transfer size (B): 4194304, Transfer Time (s): 0.000441475, Bandwidth (GB/s): 8.848184409 Transfer size (B): 8388608, Transfer Time (s): 0.000836773, Bandwidth (GB/s): 9.336463486 Transfer size (B): 16777216, Transfer Time (s): 0.001649125, Bandwidth (GB/s): 9.474722707 Transfer size (B): 33554432, Transfer Time (s): 0.003269833, Bandwidth (GB/s): 9.557062684 Transfer size (B): 67108864, Transfer Time (s): 0.006536275, Bandwidth (GB/s): 9.562021694 Transfer size (B): 134217728, Transfer Time (s): 0.013065378, Bandwidth (GB/s): 9.567270430 Transfer size (B): 268435456, Transfer Time (s): 0.026134100, Bandwidth (GB/s): 9.566045857 Transfer size (B): 536870912, Transfer Time (s): 0.052366281, Bandwidth (GB/s): 9.548128945 Transfer size (B): 1073741824, Transfer Time (s): 0.140979870, Bandwidth (GB/s): 7.093211250
- Please note that the modules used in this example are from the 2022 environment, this can vary based on the environment your application uses.
MIG instances
If we run nvidia-smi
we see that the four available GPUs are split into a total of 4 x 7 = 28
MIG devices:
[satishk@int3 ~]$ nvidia-smi Tue Jun 27 15:47:50 2023 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 520.61.05 Driver Version: 520.61.05 CUDA Version: 11.8 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 NVIDIA A100-SXM... On | 00000000:31:00.0 Off | Off | | N/A 28C P0 38W / 400W | 45MiB / 40960MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ | 1 NVIDIA A100-SXM... On | 00000000:32:00.0 Off | Off | | N/A 28C P0 38W / 400W | 45MiB / 40960MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ | 2 NVIDIA A100-SXM... On | 00000000:CA:00.0 Off | Off | | N/A 27C P0 39W / 400W | 45MiB / 40960MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ | 3 NVIDIA A100-SXM... On | 00000000:E3:00.0 Off | Off | | N/A 28C P0 37W / 400W | 45MiB / 40960MiB | N/A Default | | | | Enabled | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | MIG devices: | +------------------+----------------------+-----------+-----------------------+ | GPU GI CI MIG | Memory-Usage | Vol| Shared | | ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG| | | | ECC| | |==================+======================+===========+=======================| | 0 7 0 0 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 8 0 1 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 9 0 2 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 11 0 3 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 12 0 4 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 13 0 5 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 0 14 0 6 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 7 0 0 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 8 0 1 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 9 0 2 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 11 0 3 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 12 0 4 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 13 0 5 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 1 14 0 6 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 7 0 0 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 8 0 1 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 9 0 2 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 10 0 3 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 11 0 4 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 12 0 5 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 2 13 0 6 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 7 0 0 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 8 0 1 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 9 0 2 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 11 0 3 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 12 0 4 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 13 0 5 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ | 3 14 0 6 | 6MiB / 4864MiB | 14 N/A | 1 0 0 0 0 | | | 0MiB / 8191MiB | | | +------------------+----------------------+-----------+-----------------------+ +-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
Assign yourself a different MIG instance
MIG instances are not exclusive which means that another user may already be utlizing the MIG instance you are trying to use, in which case you can assign yourself another MIG instance.
First, check which processes are running on a particular MIG instance using the nvidia-smi
command.
Then, to assign yourself a different MIG instance, you can use the code snippet below:
[satishk@int3 ~]$ mig=($(nvidia-smi -L | sed -nr "s|^.*UUID:\s*(MIG-[^)]+)\)|\1|p")) [satishk@int3 ~]$ [satishk@int3 ~]$ # Examples of MIG-ids [satishk@int3 ~]$ echo ${mig[0]} MIG-bdc1d762-d094-5868-b40a-902670ebb9c9 [satishk@int3 ~]$ echo ${mig[1]} MIG-b190659d-78f2-514d-a2b7-afdae3427be8 [satishk@int3 ~]$ echo ${mig[2]} MIG-bb348148-354e-5f5c-8bd8-cc0692b6429a [satishk@int3 ~]$ [satishk@int3 ~]$ # Setting CUDA_VISIBLE_DEVICES to only see a specific MIG-id [satishk@int3 ~]$ export CUDA_VISIBLE_DEVICES=${mig[14]} [satishk@int3 ~]$ echo $CUDA_VISIBLE_DEVICES MIG-56588794-51ce-5f03-b567-ec36a8e651a7
As you can see in the snippet above, first you need to load the ids of the MIG instances into a bash array.
Then you can assign specific ids to the environment variable CUDA_VISIBLE_DEVICES
, which in this case is the id of the 15th MIG-instance.