Evaluation of Cerebras SDK and CS-2 platform
Earlier this year, SURF joined forces with Cerebras Systems—a pioneer AI technology company—to conduct a technology assessment of the unique CS-2 wafer-scale engine (WSE). The CS-2 exhibits outstanding capabilities; many of its specifications exceed that of other general-purpose processors and GPUs by orders of magnitude, eliminating common bottlenecks in AI and HPC workloads.
The partnership focused on exploring and understanding the potential of the CS-2 for AI and scientific computing workloads. Cerebras generously provided SURF with access to their simulator, documentation, and physical hardware through the Cirrascale Cloud. This trial offered SURF a unique opportunity to delve deep into the exceptional capabilities of the Cerebras engine. The assessment was led by two SURF teams, with a specific focus on high-performance machine learning (HPML) and computing and visualization (HPCV).
Introduction
Cerebras Systems Inc. is an American artificial intelligence company that builds computer systems for complex artificial intelligence deep learning applications. Their key product, the CS-2 wafer-scale engine (WSE), delivers the wall-clock compute performance of many tens to hundreds of graphics processing units (GPU) or more. In one system less than one rack in size, the CS-2 delivers answers in minutes or hours that would take days, weeks, or longer on large multi-rack clusters of legacy, general-purpose processors.
To comprehend such an unconventional hardware, SURF partnered with Cerebras to conduct a technology assessment that mainly focused on exploring the capabilities of Cerebras' unique WSE. The technical evaluation began in February 2023 when Cerebras shared their simulator, documentation, and provided trial access to the physical hardware via the Cirrascale Cloud. The trial access granted SURF seven allocations of 24 hours for three months, providing an opportunity to explore the capabilities of the Cerebras WSE. Two SURF teams, the high-performance machine learning (HPML) and computing and visualization (HPCV), engaged in the assessment driven by their interest in exploring state-of-the-art technologies.
Technology overview
The Cerebras CS-2 WSE stands out with its unique design and dataflow architecture. It comprises 850,000 independent processing elements (PEs) implemented as a two-dimensional rectangular mesh on a single silicon wafer, respectively. PEs consists of a processor, a router, and local tile memory, collectively forming the fabric. Data flows through the mesh of PEs in 32-bit packet wavelets triggering data transformations. With 48 KB of dedicated memory, the CS-2 enables efficient read and write operations of up to 16 and 8 bytes per cycle. The interconnect fabric between PEs offers an injection bandwidth of 16 bytes per core per cycle.
Developing kernels for the CS-2 involves utilizing the Cerebras-specific CSL language. The code written in CSL uses tasks to define the operations performed by PEs on data. Data wavelets are bound to tasks based on their color. Tasks can be assigned to groups of PEs, named rectangles, for task execution. In the background, the task scheduler determines the order of execution of tasks during compilation.
The Cerebras CS-2 showcases remarkable performance and speedup across various workloads. However, its power consumption requires high machine utilization to achieve energy efficiency. For instance, we estimate a sustained performance higher than to 24 NVIDIA A100 GPUs is needed to justify the power draw1. While Cerebras demonstrates superior gains, maintaining these advantages becomes challenging at smaller scales or with complex workloads.
1The direct comparison of power consumption between CS-2 and H100 shows a ratio of 1 to 60. To operate 60 GPUs, 8 DGX servers are needed, consuming a total of 80 KW for DGX-H100s. In contrast, a single CS-2 consumes 28 KW, or approximately 30 KW with supporting servers, which is equivalent to the power usage of around 3 DGX-H100s or 24 GPUs.
Assessment
Artificial Intelligence
Cerebras makes it easy for AI developers by offering standard PyTorch support and several reference model implementations in its Model Zoo GitHub repo. These models only require parameter adjustments to run successfully. Cerebras supports popular models like GPT-3, Llama 2, Falcon, U-Net, and Diffusion Transformers.
Implementing a model from scratch is more challenging as not all layer types are fully supported by the Cerebras software stack. Using lesser-used layers (e.g., 3D convolutional layers) might not be possible. As such, a direct 1-to-1 translation is not always possible from some PyTorch models.
The Cerebras' runtime is not currently accessible to the public. Although they provide a simulator for kernel implementation and testing, CSL is a low-level language that would require significant programming effort, some requiring Cerebras developer assistance to add custom layers.
Scientific Computing
The Cerebras platform, originally designed for AI workloads, has garnered interest in the realm of scientific computing [1, 2]. Its unique design mitigates the memory-bound bottlenecks often encountered in scientific computing problems. However, the WSE architecture is not suitable for all scientific applications and presents its own set of limitations and tradeoffs. One key advantage of the CS-2 is that its 40GB of memory is on-chip, enabling 20 Petabyte/s of memory bandwidth, making it ideal for applications with high IO relative to compute. However, 40 GB is on-par with a GPU’s external memory and often insufficient for large scale problems. Cerebras addresses this shortcoming in its weight-streaming configuration, which pairs Cerebras hardware with up to 12 Terabytes of external memory, which when carefully managed to overlap IO with compute, enables Cerebras CS-2s to exhibit both high memory capacity and bandwidth.
The Cerebras architecture is naturally suited for stencil computations, seismic processing, and other algorithms which map naturally to a grid-based layout. While implementing more complex applications involving dynamic communication patterns across many PEs require more engineering lift, it is also possible. The Cerebras SDK includes an implementation of a hypersparse SpMV and histogram benchmarks. Additionally, the Argonne National Lab has implemented a macroscopic cross-section lookup kernel for Monte Carlo neutron transport that involves a more complex and dynamic communication pattern across PEs.
Another issue pertains to precision compatibility. The focus on AI means the CS-2 only supports FP32 and FP16 operations but not FP64. This means the CS-2 will not be a good fit for certain HPC applications. Nevertheless, it seems an easily addressable design choice rather than an inherent limitation of the architecture.
Finally, low level programming of the CS-2 is accessed via the Cerebras SDK and its CSL programming language. These tools are geared toward experienced HPC developers and may pose a challenge to generalist programmers.
Conclusion
In summary, the Cerebras system exhibits outstanding potential. Many of the device's specifications exceed that of other general-purpose processors in orders of magnitude, such as the peak performance or memory throughput, unleashing common bottlenecks. Contrarily, other features, such as the on-chip memory size, remain comparable, limiting the problem size without the use of external memory.
Regarding usability, Cerebras has deployed a new API and a proprietary SDK. However, it remains relatively immature for production system oriented toward the research community. Nevertheless, if Cerebras releases its software stack in the future, further exploration of the machine's potential can be pursued, allowing for a whole new range of scientific applications to be tested.
Based on our assessment, we are excited by the potential of Cerebras CS-2 systems to unlock challenging AI and scientific problems. As next steps, SURF plans to collaborate with its members by identifying their use cases for future collaboration. Additionally, SURF intends to provide access to CS-2 through the SURF Research Cloud soon.
References
- [1] Kamil Rocki et al. "Fast Stencil-Code Computation on a Wafer-Scale Processor." In Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC '20), IEEE Press, Article 58, 1–14, 2020 (https://dl.acm.org/doi/10.5555/3433701.3433778).
- [2] Woo, Mino et al. "Disruptive Changes in Field Equation Modeling: A Simple Interface for Wafer Scale Engines." ArXiv, 2209.13768, 2022 (https://doi.org/10.48550/arXiv.2209.13768).