Introduction

In recent years, hardware vendors have invented various datatypes and specialized instructions/cores in order to speed up scientific computation tasks such as deep learning. Many approaches use reduced precision datatypes in the computations to increase throughput, while at the same time trying to prevent affecting the convergence behaviour of the final algorithm.

In this tutorial, we discuss the particular datatypes and tensor operations supported on A100 GPUs, how they can be leveraged to increase throughput on deep learning tasks, as well as their potential impact on convergence.

TL;DR: how to get the best performance out of the A100 GPU's

  1. These tips are largely based on the nvidia cuDNN developer guide.
  2. These tips are also provided as-is, i.e. they are intended for advanced users.
    If you're unsure how these changes may affect your model convergence, we suggest some caution before applying them.
  • Use the lowest precision data format possible (through Automatic Mixed Precision), this will usually be:
    • Half-Precision (FP16) (in which case you also need to scale your gradients with e.g. `torch.cuda.amp.GradScaler`)
    • BFloat16.
  • Sizes should be multiples of 8 whenever possible, e.g.:

    • Batch size
    • Channel size
    • Vocabulary size
    • Sequence length
    • In/Output size of linear (fully-connected) layers
  • Convert your model and batched data to 'channels last' i.e. NHWC (2D) or NDHWC (3D) data format.
    • This is the default in TensorFlow
    • This requires a few extra lines of code in PyTorch (be mindful of the in-/out-of-place nature of .to() !):
      • model.to(memory_format=torch.channels_last)
      • batch = batch.to(memory_format=torch.channels_last)
  • Use one of the following activation functions: {relu, tanh, sigmoid, elu, gelu, softplus, swish}
  • When using convolutions:
    • Make sure the size of your channel dim equals 0 mod 8 whenever possible
    • Make sure the size of your channel dim is larger or equal to 32 whenever possible
    • When using grouped convolutions:
      • Make sure the size of your input channel dim is equal to the size of your output channel dim
      • Make sure the size of your input channel dim equals of one of the following: {1,4,8,16,32,64,128,256}
  • When using linear (fully-connected) layers in PyTorch:
    •  `torch.set_float32_matmul_precision('high')` or  `torch.set_float32_matmul_precision('medium')`
    • `torch.backends.cuda.matmul.allow_tf32 = True` and/or `torch.backends.cuda.matmul.allow_bf16_reduced_precision_reduction = True`

Datatypes

Various different datatypes will be discussed in this tutorial, so we will introduce them here; datatypes are split into 'floating point' datatypes, and integers:

Full nameShorthandExponent bitsMantissa ('precision') bitsMinimal (sub/de)normal valueMinimal normal valueMaximal valueNotes
Double-Precision FloatFP641152

≈ 4.9 × 10−324

≈ 2.2 × 10−308

≈ 1.8 × 10308Most used for HPC applications such as weather simulations, default precision in Python
Single-Precision FloatFP32823

≈ 1.4 × 10−45

≈ 1.2 × 10−38≈ 3.4 × 1038Default precision in most ML frameworks when TF32 is not available
Half-Precision FloatFP16510

≈ 6.0 × 10−8

≈ 6.1 × 10−5≈ 6.6 × 105Default precision in most ML frameworks for 'mixed-precision' training
BFloat16BF1687

≈ 9.2 × 10−41

≈ 1.2 × 10−38≈ 3.4 × 1038Same range as FP32, but reduced precision
TensorFloat32TF32810

≈ 1.1 × 10−41

≈ 1.2 × 10−38≈ 3.4 × 1038Same range as FP32, and same precision as FP16; used by default by most ML frameworks on supported hardware

Next we consider all n-bit integers: typically, 4, 8, 16, 32, 64 or 128 bits are used, depending on how large the range is that needs to be represented.
A special case of n-bit integers are binary or boolean values: 1 bit which can only represent a 0 or 1.

Subnormal floats

Subnormal floats, also called 'denormal' floats are floating point values which fall outside of the regular range of a specific floating point datatype; more specifically between the lowest and highest normal representable values above and below 0 respectively.

In 'normal' floats, the value of the mantissa will never have leading zero's since the exponent will be increased until the number is 'normal' (for example, the number 0.0123 would be written as 1.23 × 10−2).
When a value falls below the regular 'normal' range of the datatype, and the hardware and software have subnormal floats enabled, the value will become a subnormal number.

Subnormal floats prevent unexpected underflow or divisions by zero, but come with a performance penalty; sometimes up to 20%: https://developer.nvidia.com/blog/cuda-pro-tip-flush-denormals-confidence/.

In PyTorch subnormals are enabled by default, and values between subnormal values are rounded to the nearest representable value:

min_denormal_fp16_value = 0.000000059604645

>>> torch.tensor([min_denormal_fp16_value/1.9999998], device='cuda:0', dtype=torch.float16)
tensor([5.9605e-08], device='cuda:0', dtype=torch.float16)

>>> torch.tensor([min_denormal_fp16_value/1.9999999], device='cuda:0', dtype=torch.float16)
tensor([0.], device='cuda:0', dtype=torch.float16)


What are Tensor Cores?

Tensor Cores are cores that specialize in General Matrix-Matrix Multiplication (GEMM) operations, i.e.

D = A × B + C

(where A, B, C and D are matrices), which are at the core of neural network training and inference.

NVIDIA Volta Tensor Cores

Tensor Cores were first introduced in the NVIDIA Volta GPUs, where each tensor core could execute 64 FP16 fused multiply-add operations (FMA) with accumulation in FP32 in a single clock cycle. Thus, Tensor Cores in Volta GPUs were able to perform multiplication of two 4×4 matrices (A and B, both in FP16), add a 4×4 matrix (C, in FP32) and store the result in a 4×4 matrix (D, in FP32) in one clock cycle. The combination of using different numercial precisions became known as mixed precision, and typically refers to this mix of using FP16 for multiplication and FP32 for accumluation.

NVIDIA Ampere Tensor Cores

Ampere Tensor Cores differ from Volta Tensor Cores in two fundamental ways:

  1. They can operate on larger matrices. E.g. they can execute 256 FP16 FMA operations in a single clock cycle, and thus perform a GEMM operation where A is 8×8 and B, C and D are 8×4.
  2. They can operate on more datatypes: FP64, TF32, FP16, BF16, INT8, INT4, Binary are all supported as input types (note that FP32 is not supported).

Sparse Matrix Multiply-Accumulate (MMA) operations

The A100 GPUs provide hardware support for MMA operations on matrices that satisfy a very specific sparsity: if out of every 4 (row-wise) elements at most 2 are non-zero, the specific sparse MMA operation can be used to increase the maximum throughput of the operations by a factor of two. At the time of writing, support on the software is limited to the low level cuSPARSELt library, which would allow you to exploit these instructions. Higher level frameworks like PyTorch and TensorFlow do not (yet) appear to support this at the time of writing (December 2021). More on the Sparse MMA can be found in the NVIDIA Ampere whitepaper.

Theoretical performance of A100 GPUs

If you ever had a look at the theoretical performance of A100 GPUs, you might have been confused by how many items the peak performance table lists, and in which cases you might expect which performance. The following table (based on Table 3 and 4 of the NVIDIA AMPERE whitepaper) tries to clarify what your expected performance is based on the input accuracy (i.e. datatype of A and B), accumulator accuracy (typically determined by the datatype of C and D) for MMA operations:

InputAccumulatorMMA Performance (Tensor Core)SPARSE MMA performance (Tensor Core)Non-MMA performance (CUDA Core)
FP64FP6419.5 TFLOPS-9.7 TFLOPS
FP32FP32

19.5 TFLOPS

-19.5 TFLOPS
TF32FP32156 TFLOPS312 TFLOPS-
FP16FP16/FP32312 TFLOPS624 TFLOPS78 TFLOPS
BF16BF16/FP3231262439 TFLOPS
INT32---19.5 TOPS
INT8INT32624 TOPS1248 TOPS-
INT4INT321248 TOPS2496 TOPS-
BinaryINT324992 TOPS--

Table: Theoretical performance of operations on a single A100 GPU (source: NVIDIA AMPERE whitepaper). TFLOPS: Tera (10^12) floating point operations per second. TOPS: Tera (non-floating point) operations per second.
Of course, not all operations you want to do are MMA operations. For 'normal' floating point (ant integer) arthmatic, the regular gpu cores are used.

it might be interesting to note that INT32 and FP32 CUDA cores can simultaneously execute operations although it is unclear to the authors if this feature is used by any Deep Learning framework.

Real-world performance of A100 GPUs

We use the following benchmark script to illustrate the performance difference between pure FP32, using TensorFloat32 and using mixed precision (i.e. FP16 inputs and FP32 accumulators) to train a network from tf.keras.applications on synthetic data:

# -------------
# benchmark.py
# -------------

import argparse
import os
import numpy as np
import timeit

import tensorflow as tf
from tensorflow.keras import applications
from tensorflow.keras import mixed_precision

# Benchmark settings
parser = argparse.ArgumentParser(description='TensorFlow Synthetic Benchmark',
                                 formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument('--mixed-prec', action='store_true', default=False,
                    help='Use mixed precision for training')
parser.add_argument('--disable-tf32', action='store_true', default=False,
                    help='Disable the use of TensorFloat32 for training')

parser.add_argument('--model', type=str, default='ResNet50',
                    help='model to benchmark')
parser.add_argument('--batch-size', type=int, default=128,
                    help='input batch size')

parser.add_argument('--num-warmup-batches', type=int, default=2,
                    help='number of warm-up batches that don\'t count towards benchmark')
parser.add_argument('--num-batches-per-iter', type=int, default=10,
                    help='number of batches per benchmark iteration')
parser.add_argument('--num-iters', type=int, default=10,
                    help='number of benchmark iterations')
args = parser.parse_args()

tf.config.threading.set_inter_op_parallelism_threads(1)

tf.config.threading.set_intra_op_parallelism_threads(int(os.environ['OMP_NUM_THREADS']))

if args.mixed_prec:
    print('Running with mixed_float16 as global policy for the precision')
    mixed_precision.set_global_policy('mixed_float16')

if args.disable_tf32:
    print('Disabling TF32 execution')
    tf.config.experimental.enable_tensor_float_32_execution(False)

# Fix seed so that it runs the same every time
tf.random.set_seed(42)

# Set up standard model.
model = getattr(applications, args.model)(weights=None)
opt = tf.optimizers.SGD(0.01)
if args.mixed_prec:
    print('Running with loss scaling for mixed precision')
    opt = mixed_precision.LossScaleOptimizer(opt)

data = tf.random.uniform([args.batch_size, 224, 224, 3])
target = tf.random.uniform([args.batch_size, 1], minval=0, maxval=999, dtype=tf.int64)

print('Model: %s' % args.model)
print('Batch size: %d' % args.batch_size)

@tf.function
def benchmark_step():
    # Record gradients with GradientTape
    with tf.GradientTape() as tape:
        probs = model(data, training=True)
        loss = tf.losses.sparse_categorical_crossentropy(target, probs)
        if args.mixed_prec:
            scaled_loss = opt.get_scaled_loss(loss)
    if args.mixed_prec:
        scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)
        gradients = opt.get_unscaled_gradients(scaled_gradients)
    else:
        gradients = tape.gradient(loss, model.trainable_variables)
    opt.apply_gradients(zip(gradients, model.trainable_variables))

    # Return the loss so we can inspect the effect of datatype accuracy
    return tf.math.reduce_mean(loss)

with tf.device('GPU'):
    # Warm-up
    print('Running warmup...')
    loss = benchmark_step()
    print(f"loss: {loss}")

    timeit.timeit(lambda: print(f"loss: {benchmark_step()}"),
                  number=args.num_warmup_batches)

    # Benchmark
    print('Running benchmark...')
    img_secs = []
    for x in range(args.num_iters):
        time = timeit.timeit(lambda: benchmark_step(),
                             number=args.num_batches_per_iter)
        img_sec = args.batch_size * args.num_batches_per_iter / time
        print('Iter #%d: %.1f img/sec' % (x, img_sec))
        img_secs.append(img_sec)

    # Results
    img_sec_mean = np.mean(img_secs)
    img_sec_conf = 1.96 * np.std(img_secs)
    print('Img/sec: %.1f +-%.1f' % (img_sec_mean, img_sec_conf))

Then, we allocate a single A100:

salloc -p gpu -n 1 --ntasks-per-node 1 --gpus 1 --cpus-per-task 18 -t 8:00:00

use ssh to connect to the allocated node, and run the benchmark script with the following environment:

module load 2021
module load TensorFlow/2.6.0-foss-2021a-CUDA-11.3.1
module list

export OMP_NUM_THREADS=18

python benchmark.py

Using various values for the modelmixed-prec and disable-tf32 arguments, we run with various precisions and models to construct the following table:

ModelPrecisionThroughput (img/s)Speedup (compared to FP32)Loss (1st iteration)
ResNet50FP32455.917.457645893096924
ResNet50TF32750.61.657.456507205963135
ResNet50FP16 (input) + FP32 (accumulator)1087.62.387.45703125
VGG19FP32212.716.9077839851379395
VGG19TF32550.32.596.907783508300781
VGG19FP16 (input) + FP32 (accumulator)1099.85.176.90625
DenseNet121FP32391.916.96142053604126
DenseNet121TF32591.51.516.961450576782227
DenseNet121FP16 (input) + FP32 (accumulator)876.22.246.9609375

A few key results to note:

  • Speedup of mixed precision or TF32 over tradition FP32 varies per model
  • Speedup is much smaller than the theoretical difference in throughput from the tables in the previous section (but still very substantial!)
  • Loss is affected by the reduced precision. Note that this is not a problem in itself: as long as the convergence behavior of the training is not affected, the reduced precision is fine. In their published benchmark results, NVIDIA has demonstrated that a large amount of well-known models indeed converge properly using mixed precision.

Training with TensorFloat32

Because TensorFloat32 covers the same range as traditional FP32, training in TF32 can easily be done as a drop-in replacement. In fact, NVIDIA has made the use of TF32 the default for any cuDNN call. Also, both TensorFlow and PyTorch use TF32 by default. TensorFlow will (depending on the verbosity level for the logging you set) also inform you explicitely that it will use FP32, e.g.:

2021-11-30 12:17:30.251449: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

Disabling the use of TensorFloat32 (debugging only)

In some cases, you might want to disable the use of TensorFloat32. For example, if you are debugging convergence issues, and you want to make sure the datatype is not the problem. Or, if you want to compare your convergence behaviour between two machines in order to validate a run, but only one of these machines supports TensorFloat32.

General environment variable

Setting the environment variable NVIDIA_TF32_OVERRIDE=0 before running your code should in principle disable the use of the TensorFloat datatype. All low level CUDA libraries will respect this variable. For higher level framworks that use CUDA libraries as a backend, it may depend on the specific framework.

TensorFlow

TensorFlow does not seem to respect the general NVIDIA_TF32_OVERRIDE variable. To turn off the use of TensorFloat32 by TensorFlow, you'll explicitely have to disable it by calling

tf.config.experimental.enable_tensor_float_32_execution(False)

in your code. See here.

PyTorch

PyTorch does respect the NVIDIA_TF32_OVERRIDE environment variable. However, you can also turn it off explicitely in your code using

# The flag below controls whether to allow TF32 on matmul. This flag defaults to True.
torch.backends.cuda.matmul.allow_tf32 = False

# The flag below controls whether to allow TF32 on cuDNN. This flag defaults to True.
torch.backends.cudnn.allow_tf32 = False

Training with Mixed Precision

Training with mixed precision is less trivial than using TensorFloat32. The main reason is that the range of values that FP16 can represent is smaller. This can be a problem particularly with small gradients, which may fall below the FP16 representable range (for more information, see the histograms in NVIDIA's documentation on mixed precision training). This can be solved by so-called loss scaling. Essentially, in loss scaling, losses are multiplied by a factor 'S' in between the forward and backward propagation steps. Then, after the backward propagation, the weight gradient is multiplied by 1/S before doing the weight update.

Of course, this procedure can be done manually, but many frameworks support some form of automatic loss scaling. Below, we summarize the key parts of the TensorFlow and PyTorch documentation on mixed precision training, but we encourage you to read their respective use manual sections to get a full picture.

TensorFlow

The official documentation contains an extensive section on using mixed precision in TensorFlow.

To enable mixed precision, you have to set the global policy:

from tensorflow.keras import mixed_precision

mixed_precision.set_global_policy('mixed_float16')

If you train with the using tf.keras.Model.fit API, that's all you need to do: this API automatically performs loss scaling if te 'mixed_float16' policy is set. If however you implement a custom training loop (like in our benchmark example above), you have to wrap the optimizer in the tf.keras.mixed_precision.LossScaleOptimizer class like so:

# Any keras optimizer, use RMSprop as example:
optimizer = keras.optimizers.RMSprop()
# Wrap in LossScaleOptimizer to perform loss scaling
optimizer = mixed_precision.LossScaleOptimizer(optimizer)

If you want, you can specify an explicit loss scale, but it is recommended to keep the default loss scaling behavior of this optimizer. Finally, you have to insert the scaling step after calculating the loss, compute the gradients on the scaled loss, and then get the unscaled gradients:

@tf.function
def train_step(x, y):
  with tf.GradientTape() as tape:
    predictions = model(x)
    loss = loss_object(y, predictions)
    # Scale loss:
    scaled_loss = optimizer.get_scaled_loss(loss)
  # Compute gradients on scaled loss:
  scaled_gradients = tape.gradient(scaled_loss, model.trainable_variables)
  # Invert the scaling before applying the gradients
  gradients = optimizer.get_unscaled_gradients(scaled_gradients)
  optimizer.apply_gradients(zip(gradients, model.trainable_variables))
  return loss

PyTorch

PyTorch has an Automatic Mixed Precision package (AMP). The official documentation of the API can be found here, but using mixed precision in PyTorch is explained more extensively in their PyTorch recipe section.

Typically, automatic mixed precision training uses torch.cuda.amp.autocast together with torch.cuda.amp.GradScaler. The first, torch.cuda.amp.autocast, ensures that operations run in an op-specific dtype which is determined by autocast. It aims to select the dtype such that FP16 is used for inputs if the use of Tensor Cores is expected to be faster for that op. The second, torch.cuda.amp.GradScaler aims to automatically scale the gradients to prevent underflow. Alltogether, your code would typically look like this:

use_amp = True

net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)

for epoch in range(epochs):
    for input, target in zip(data, targets):
        # This context manager makes sure the dtypes in 'net' are set to support mixed precision Tensor Core operations as much as possible
        with torch.cuda.amp.autocast(enabled=use_amp):
            output = net(input)
            loss = loss_fn(output, target)
        # scaler.scale(loss) returns scaled losses, before the backward() is called
        scaler.scale(loss).backward()
        scaler.step(opt)
        scaler.update()
        opt.zero_grad() # set_to_none=True here can modestly improve performance

If you want to inspect or modify gradients (e.g. clipping), this requires you to unscale the gradients in between the backward() and the step(...) calls. See the official documentation for details.

Which datatype should I use?

There is little reason not to use TensorFloat32: it is (much) faster than using FP32, and since the range is similar, it does not require things like loss scaling. Therefore, no code changes are needed. The only reason not to use it would be the reduced precision of the fraction, which in theory could affect convergence behavior. Practical experience so far has shown that convergence behaviour with TensorFloat32 for deep learning is generally not altered (have you ever wondered if FP32 was precise enough to make your training converge?). Only in cases where you experience issues with convergence could you try to see if disabling it helps - but even if it does, there were probably other steps you could take to make your training more stable that would have a smaller effect on training speed.

Training in mixed precision is more involved. It requires some code changes (though frameworks have automated a lot for you) and you'll have to think carefully about where you inspect/modify gradients. It does provide a substantially larger speedup than using TF32. This can be particularly important for compute intensive tasks, such as hyperparameter tuning. It is therefore worth to try using mixed precision. Mixed precision has succesfully been used to train a large number of well known networks to proper convergence. If convergence does prove to be an issue for your particular task, switch back to TF32 and see if that helps. If it did, check your loss scaling code, inspect scaled/unscaled losses, and verify that nothing gets clipped due to underflow. 

Sources:

nvidia A100 whitepaper

nvidia deep learning performance tutorial

nvidia tensor core performance tutorial

For future reference:

nvidia H100 whitepaper