IgANet – Physics-informed isogeometric analysis neural network

The main goal of the SURF-ETP project was to port the IgANets (https://github.com/iganets) source code from CUDA to ROCm (milestones 1 and 2) and perform performance measurements on AMD MI210 graphics cards (milestone 3). All three milestones have been reached. The IgANets code builds upon the C++ API of Torch 2.x and, as such, can utilize LibTorch’s CPU, CUDA, and ROCm backends. Since some of the functionality of IgANets is implemented natively as CUDA kernels, it was necessary to port these parts of the code to HIP kernels, which is a major outcome of the SURF-ETP project. A second major outcome is the detailed performance evaluation of the different compute backends and the different code implementations.

The efficient evaluation of B-spline functions is a major compute kernel of IgANets both during training and inference. A lot of effort has, therefore, been put into implementing this compute kernel efficiently. The mathematical task is as follows: Given a location, compute the value of the B-spline functions, which are the so-called control points, and  the univariate B-spline basis functions. The mathematical details and an algorithm for the efficient implementation are given in [1]. IgANets implements two possible variants, one in which the Kronecker product between the univariate basis functions is evaluated (non-MemOpt) and then applied to the control points and one in which the triple sum is computed step by step without assembling the full vector (MemOpt). Also, since the subset of indices  that contribute to the sum depends on the value of, two alternative versions with and without precomputing these indices have been implemented.

The figure below shows the wallclock time in ns/sample for evaluating bicubic (dim=2, order=3) B-spline basis functions using the AMD MI210 GPU from the SURF-ETP system. Apparently, the overhead for generating the compute graph to enable the computation of derivatives (ReqGrad) is negligible for sufficiently large sampling sets. What has a significant impact on the computational performance is the precomputation of indices (Precomp) and the assembly of the full Kronecker product instead of using the memory-optimized but computationally more expensive variant of the algorithm.

Similar performance trends are observed for NVIDIA A100, H100, and GH200 systems, from which the latter was also made accessible through SURF-ETP; cf. figure below.

The same performance benchmark has also been executed on two CPU-based systems, a 2-socket AMD EPYC 9654 and a 2-socket Intel Xeon Platinum 8360 system. The performance plots given below indicate that no significant difference in performance exists between the memory-optimized and non-optimized variants for the two CPU-based systems.

The next figure shows a direct performance comparison across all systems with memory optimization turned off and precomputation of indices enabled. It shows that the AMD MI210 is mostly on par with an NVIDIA A100 and outperforms the latter and even an NVIDIA H100 for problem sizes beyond 0.5 million sampling points. It also shows that for sufficiently small problem sizes (about 1000 samples), the two CPU-based systems are equally efficient. The plot also shows that the OpenMP parallelization is more effective and consistent for the Intel CPU than for the AMD one.

Contribution of SURF–ETP

Especially in the beginning, ROCm support was experimental in Torch 2.x and required a lot of debugging. The SURF-ETP project was very helpful in this regard. SURF’s helpdesk staff helped me get the ROCm toolchain working and had some helpful tips on how to adapt the CMake scripts to perform the performance evaluation on the AMD MI210 GPU. Also, access to this GPU type through the SURF-ETP project is highly appreciated. 

Conclusions

As the IgANets code is still being developed and extended continuously, it would be highly appreciated if access to the SURF-ETP system could be extended so that the correct functioning of the code after some major refactoring steps can be verified every now and then. At the moment, the SURF-ETP system is the only one available to the PI with a fully functioning ROCm infrastructure (software + MI210 hardware).

References

The performance results from the SURF-ETP project will be shown in forthcoming talks (e.g., at IGA 2024, October 27-30, 2024; ACUD 2024, December 12, 2024; DTE & AICOMAS 2025, February 17-21, 2025) and a journal publication that the PI is currently writing. Earlier performance results for CPUs and NVIDIA V100 GPUs were shown at various international conferences and published in

  • [1] Ji, Y., Möller, M., Verhelst, H.M. "Design Through Analysis," in: Bodnár, T., Galdi, G.P., Nečasová, Š. (eds) Fluids Under Control. Advances in Mathematical Fluid Mechanics. Birkhäuser, Cham, 2024, 303–368 (10.1007/978-3-031-47355-5_5).
  • No labels