El Capitan Takes Exascale Computing to New Heights

Janet_Morss · ‎01-10-2025

On November 18, 2024, the Top500 announced the world has a new fastest supercomputer: El Capitan. As the third system to surpass the exascale threshold, El Capitan achieved 1.742 exaflops per second on the High Performance Linpack benchmark, next to the now #2 system, Frontier, which clocked in at 1.353 exaflops per second.

El Capitan is designed with AMD Instinct™ MI300A APUs, which are next-gen accelerated processing units unifying CPU and GPU chiplet technologies on a single package with shared HBM3 memory to achieve breakthroughs in AI and HPC

Both El Capitan and Frontier sit under the umbrella of the US Department of Energy (DOE). Housed at Lawrence Livermore National Laboratory in California, El Capitan was the result of a joint effort by three different federal laboratories: Livermore, Los Alamos and Sandia. Funding for the system came from the Advanced Simulation and Computing (ASC) project of the National Nuclear Security Administration (NNSA).

“El Capitan is crucial to the National Nuclear Security Administration’s core mission and significantly bolsters our ability to perform large ensembles of high-fidelity 3D simulations that address the intricate scientific challenges facing the mission,” said ASC director Rob Neely. He added, “I can’t wait to demonstrate the capabilities to our sponsors and supporters. We are also committed to investing in AI on El Capitan and will be using the system for large-scale AI training and inference to make our calculations faster, more efficient, and potentially more accurate. Having a capability that is dialed in for both modeling and simulation and AI workloads, all in one system, is very exciting.”

The result of more than five years of effort, El Capitan was preceded by three prototype systems, which, like El Capitan, were named for landmarks at Yosemite National Park: RZVernal, Tioga and Tenaya. All three achieved fast enough results to rank on the Top500 list of fastest supercomputers in the world.

El Capitan also has a sister system, Tuolumne, named for a meadow in Yosemite. While El Capitan will primarily support the NNSA’s national security mission running classified workloads, Tuolumne will be available for a wide variety of non-classified scientific research.

Photo courtesy of Lawrence Livermore National Laboratory

Building the World’s Fastest Computer

El Capitan leverages the AMD Instinct MI300A, integrating 24 AMD "Zen 4" x86 CPU cores with 228 AMD CDNA™ 3 high-throughput GPU compute units, 128 GB of unified HBM3 memory that presents a single shared address space to CPU and GPU, interconnected into the coherent 4th Gen AMD Infinity Fabric™ architecture. It has more than 10 million accelerator compute units, which enable its exascale performance.

“Leveraging the AMD Instinct MI300A APUs, we’ve built a system that was once unimaginable, pushing the absolute boundaries of computational performance while maintaining exceptional energy efficiency,” said Bronis R. de Supinski, LLNL’s chief technology officer for Livermore Computing. “With AI becoming increasingly prevalent in our field, El Capitan allows us to integrate AI with our traditional simulation and modeling workloads, opening new avenues for discovery across various scientific disciplines.”

The system runs on the HPE Cray Supercomputing EX platform. It also makes use of the HPE Slingshot interconnect for fast data transfer, 5.4375 petabytes of memory and custom-built, extremely fast local storage. In addition, it employs a 100% fanless, direct liquid-cooling system which helps make it extremely energy efficient. In fact, El Capitan ranked 18^th on the Green500 list after achieving 58.89 gigaflops per watt.

“This tremendous accomplishment, years in the making and the result of tireless efforts by hundreds of dedicated employees in this large collaborative team, is a testament to the Laboratory’s leadership in driving scientific discovery,” said LLNL Lab Director Kim Budil. “It continues a legacy of supercomputing excellence that spans more than 70 years.”

As a result of that legacy, the US government now owns the three fastest supercomputers in the world, according to the TOP500.

The Exascale Quest

For the DOE, the push to achieve exascale performance was about far more than bragging rights. The Department had already deployed petascale systems that had achieved substantial scientific, economic, and national security benefits. But other countries were catching up.

It began the Exascale Computing Project in order to help maintain its advantage. The project website explains:

“To maintain leadership and to address future challenges in economic impact areas and threats to security, the United States made a strategic move in HPC—a grand convergence of advances in co-design, modeling and simulation, data analytics, machine learning, and artificial intelligence. The Exascale Computing Project (ECP) was formed to drive this effort in support of the world’s first capable exascale ecosystem.”

More specifically, the ECP focused on building a supercomputer that could support five key areas of critical interest to the country:

National security: The US needed a system capable of running advanced simulations to safeguard its stockpile of nuclear weapons, simulate nuclear weapon performance and respond to hostile threats.
Scientific discovery: Exascale systems also have the capability of probing the fundamental questions of the universe, providing models for particle physics, analyzing proteins and molecular structure, modeling fusion plasma and enabling research in chemistry and material science.
Economic security: The DOE wanted to apply exascale computing power to additive manufacturing, urban planning, power grid design and seismic hazard risk.
Energy security: With the system, the department hopes to design modular reactors, improve the efficiency of wind turbines and combustion engines, design new materials for use in fission and fusion reactors, improve carbon capture and waste disposal, and analyze stress-resistant crops.
Healthcare: The systems will contribute to cancer research, enabling models that can predict how different cells will respond to different drugs.

Frontier and El Capitan were the culmination of this quest and the first step in many other trailblazing projects.

Blazing New Trails

El Capitan’s primary mission centers around managing the safety of the nation’s nuclear weapons stockpile. “El Capitan’s introduction continues the capability advancement needed to sustain our stockpile without returning to explosive nuclear testing,” explained Jill Hruby, Department of Energy (DOE) undersecretary for nuclear security and NNSA administrator. “This computational capability, backed by decades of data, expertise and code development is the heart of science-based stockpile stewardship.”

With that goal in mind, NNSA teams have already published a paper about their efforts to adapt a physics modeling tool called MARBL for parallel processing to run on GPUs.

“The big focus of this paper was supporting multi-physics — specifically multi-group radiation diffusion and thermonuclear burn, which are involved in fusion reactions — and the coupling of all of that with the higher-order finite-element moving mesh for simulating fluid motion,” said principal investigator Rob Rieben. “There is a lot you have to do in terms of programming, optimizing kernels and balancing memory and turning your code into a GPU-parallel code, and we were able to accomplish that.”

El Capitan and Tuolumne will also support advanced AI research. For example, researchers at LLNL are investigating the trustworthiness of large language models (LLMs) like ChatGPT. They published a paper that analyzed 16 different LLMs across eight different dimensions: Accountability, fairness, machine ethics, privacy, robustness, safety, transparency, and truthfulness. In the end, the researchers determined that none of the systems were trustworthy. However, by highlighting where the current LLMs fail, LLM developers can see ways to improve their models.

The Era of Exascale

El Capitan and the other exascale systems are truly ushering in a new era — an era where we will be able to find more solutions to the world’s most pressing problems, faster and at scale.

“El Capitan is the result of years of engineering and innovation, and this achievement is a testament to our years-long collaboration with NNSA, LLNL and AMD as we push the boundaries of what’s possible in computing,” said HPE’s Senior Vice President and General Manager of HPC & AI Infrastructure Solutions Trish Damkroger. “Built to address tomorrow’s challenges today, we can’t wait to see the incredible discoveries that will come from this machine, propelling our society forward while overcoming almost any obstacle. El Capitan is truly the pinnacle of supercomputing.”

Learn more about AMD and exascale computing in The Journey to Exascale and at AMD.com/HPC.