Since its inception, the Graphics Processing Unit (GPU) has had promising possibilities as an accelerator for tasks other than graphics rendering. While the GPU in a gaming PC and one used as a general-purpose accelerator have considerable base-feature overlap, there are many benefits when optimizing these cards specifically for compute workloads. The ability to execute a plethora of simple calculations in parallel vastly improves the performance of any workload that could benefit from repetitive iteration. This is now coming of age as GPUs guide the future of supercomputing – bringing us into the Exascale Era.
Hardware capabilities are the foundation for a great supercomputer
AMD Instinct™ accelerators were designed from the outset to be optimized for compute intensive applications. Though AMD CDNA™ architecture evolved from the gaming focused AMD RDNA™ architecture, it was further developed with particular focus on delivering ground-breaking acceleration to fuel the convergence of HPC, AI, and Machine Learning. There were also improvements made in performance per watt metrics because traditional GPUs can be very power hungry, particularly when you install more than one of them in a node. Communication between multiple accelerators in a system is also often a point of contention, so AMD Infinity Fabric™ provides an ultra-rapid interconnect between GPUs, enabling them to work more seamlessly together when more than one is installed in a system. The ROCm™ software environment was created to provide an open platform for application development, along with the HIP programming environment – which aims to simplify the portability of applications.
A GPU purpose-built for compute acceleration and multi-device interconnectivity within a node was just the beginning, however. Now that the Exascale barrier has been broken for supercomputers, the challenges of scaling across thousands of nodes have entered a whole new dimension. The Frontier supercomputer at Oak Ridge National Laboratory (ORNL) has required several key innovations to make this possible. Frontier is a massive HPE Cray EX system that incorporates 9,408 nodes, each with an optimized 3rd Gen AMD EPYC™ CPU with 64 cores and four AMD In.... Harnessing Frontier’s full potential of more than 1.5 exaflops of double-precision performance will entail a unique approach, because no supercomputer has had this many GPUs before.
Chip-level IO is key for great performance
“It's an unprecedented scale,” says Nicholas Malaya, Principal Member of Technical Staff at AMD. “We designed Frontier from the beginning to scale very well. One of the big hardware innovations is that the network controllers are attached directly to the GPU.” The new CDNA 2 architecture of the AMD Instinct MI250X GPU has introduced crucial enhancements that specifically address the scale now available with supercomputers like Frontier. Where CDNA employed PCI Express® to connect to the host computer, with three AMD Infinity Fabric links to communicate between GPUs on the same node, CDNA 2 has expanded the use of Infinity Fabric alongside additional communication capabilities. 3rd Gen Infinity Fabric is employed to connect the two Graphics Complex Dies (GCDs) within the GPU itself. It can also be used, as before, to communicate between GPUs within a node. With the AMD Instinct MI250X, however, Infinity Fabric is also employed to talk to the host CPU. Finally, there is a PCI Express interconnect attaching a built-in 200Gbits/sec network interface directly to the GPU.
Massive inter-node IO at the cluster level
“Locally the GPUs talk through Infinity Fabric,” explains Malaya. “But if they need to communicate with another node, of which there are more than 9,000 in Frontier, then this takes place through the built-in networking via an HPE Slingshot interconnect, which is an Ethernet-based high-speed network. Within a node, everything is coherent. All the GPUs can communicate directly with their shared memory through our Infinity Fabric. Off the node, communication is via industry standard distributed programming models, such as MPI.”
Aside from a direct link to HPE’s Slingshot networking, the AMD Instinct MI250X also has a coherent link to the CPU thanks to its Infinity Fabric connectivity to the processor, whereas other GPUs may only connect with other GPUs on the system. “It's very hard for people to get their work onto the GPU,” says Malaya. “A lot of applications run on the CPU and the coherence of the MI250X enables researchers to get their work quickly onto the GPU. It saves them development time, which can be more expensive than the hardware.”
Taking advantage of IO capabilities with software support
Harnessing these new features has necessitated additional software support. “We have extended ROCm to enable GPU-aware messaging, meaning that you can now send messages from the GPUs directly onto the network,” says Malaya. “Frontier is the first computer of its kind in history to use that. There's a lot of software enablement required, focusing on two main areas. The first, the Message Passing Interface (MPI), is the standard approach in HPC to send messages around the network. We've extended ROCm to enable this very low latency link between the GPU and the network, ensuring better scaling than ever at large system sizes. But we’re also extending this to our ROCm Collective Communication Library (RCCL). This interfaces with the HPE software, so we must work with our partners very closely to deliver it. But it's a key software library for scaling out machine learning and artificial intelligence (AI/ML) workloads to thousands of compute nodes.”
Scientific applications are increasingly harnessing the potential of AI/ML, which can gain huge performance benefit from GPU acceleration. “Frontier is not only the first Exascale supercomputer and the world's greatest computational instrument, but also the premier platform for machine learning and artificial intelligence training,” says Malaya. “Frontier was also ranked first on the June 2022 HPL-AI benchmark, which measures raw performance in numerical precisions important for AI/ML. If you look at the largest systems previously used by research organizations for machine learning, they typically only have about 1,000 GPUs. Frontier’s 37,000 GPUs will enable a step change in AI/ML model training. For example, Google's LaMDA conversational technology took 51 days to train on 1,000 Tensor Processor Units. With 37,000 GPUs operating in parallel, it’s possible this could be reduced to a little as two days. This would be a disruptive innovation, really opening vast new opportunities.”
The built-in networking interface on the AMD Instinct MI250X is crucial to delivering this new level of performance for AI/ML. “The hardware networking is unique to MI250X,” says Malaya. “We support ROCm and MPI on all our Instinct GPUs, but the MI250X deployed in Frontier is the one that has the network connected directly to the GPU. This reduces latency. There are fewer hops to make between the CPU and the GPU and that enables these messages to be sent all around the network in a more efficient way. It allows you to scale to more GPUs than ever before. AMD is ahead of the curve that way.”
The software layer is also essential to making this work, through RCCL and enabling MPI within the ROCm drivers. “You can now rewrite your software to be even more efficient than ever before,” says Malaya. “We have refactored our own code, of course, but scientific teams are learning how to refactor their codes to take advantage of the system architecture too. This works at two levels. One is on the GPU. They're finding that GPUs are so computationally powerful they can do a lot more work than they've ever done before. But the other big one is thinking about how they move data across these machines because the data movement is a huge amount of the power that's being used on the system. Minimizing that data movement is very important to take full advantage of the system. We're seeing very useful research papers as people learn about the implications.”
Getting Ready for Groundbreaking New Research
This ability to accelerate time to science is already starting to produce results. Two 2022 finalists for the Gordon Bell Prize awarded for outstanding achievement in high-performance computing (HPC), Ramakrishnan Kannan and Jean-Luc Vay, based their entries partly or wholly on work with Frontier. The Center for Accelerated Application Readiness (CAAR) was also created to ready scientific applications for Frontier, such as software for cosmological research, weather forecasting or molecular dynamics. Test nodes with 100 nodes and AMD Instinct MI250X accelerators have been used to trial refactored code ready for deployment. A recent Gordon Bell Finalist at SC22 showcased the first exaflop demonstration of biomedical knowledge graph analytics and where Frontier is expected to see 7X the performance improvement on COAST over Summit at full scale1. Overall the shift to harnessing the power of the GPU was the essential factor in taking full advantage of what Frontier has to offer the scientific community.
“Less than one per cent of Frontier’s floating-point calculations come from the CPUs,” says Malaya. “If you really want to do a lot of compute, you have to use the GPUs. It's not optional. In general, more and more compute workloads are moving to the GPU and machine learning. Anything you can do to lower that barrier-to-entry of getting people effectively running on the GPUs pays a huge dividend.” One area of particular benefit is the size of the datasets that can be loaded. “There's 128GB of HBM on each GPU on Frontier. The previous supercomputer at ORNL had 16GB of memory per GPU, so Frontier with AMD Instinct has eight times more GPU memory. That allows you to load much larger datasets, and you don't have to wait for it to stream into memory. You can move all your workload over.
“No matter how much we give developers for HPC and machine learning workloads, they'll fill the GPU memory as fast as they can. That's a big limit for them. In the past, if you ran out of memory capacity, you couldn’t do the calculation.” Frontier negates this issue by delivering a step change in the amount of memory per GPU. Then, with each GPU connected coherently to node memory and the memory of other GPUs on that node, the accessible space is further expanded. The built-in networking provides distributed access to the memory of the other 9,000+ nodes. The addressable space is enormous – nearly 5 petabytes. “This enables new science on Frontier that was inaccessible in the past,” explains Malaya. The huge memory space will particularly unleash AI/ML workloads, which benefit so much from the largest possible models. Researchers can increase the parameter count exponentially. They're finding that the models can now increase to the point of being superhuman in things like vision and language modeling.
With Frontier only just going online this year, the potential it is making available has only just begun. The combination of AMD hardware innovation to enable GPU performance at scale with the software required to support it has shifted the paradigm of research. Not only can more work be completed in less time, but emerging AI/ML applications can be deployed that simply weren’t feasible before. The scaling capabilities of AMD Instinct GPUs are not just an evolution in performance – they enable a qualitative leap in what is possible with computational research.
For more information:
References: