AMD knows a lot about supercomputing. In fact, AMD powered supercomputers hold 2 of the top 5 spots on the Top500 list – including the first system to bring the world into the era of exascale (a computer capable of performing a quintillion or 10^18 floating point operations per second), holding an HPL (High Performance LINPACK) score that is nearly triple that of the second-place listing. Though, fewer know how much work it is to get to the point of flexing a system’s computing muscle to a worldwide stage. There is a ton of work that happens behind the scenes before any of these massive systems can “open its doors” to production workloads. One of the main milestones for any supercomputer is called acceptance. Let’s break down what leads a supercomputer to be accepted and find out what that term really means. Then, learn how AMD contributes to building the most performant systems in the world and some of the science they enable.
Defining the process – ac·cept·ance
In short, acceptance defines the point at which a fully deployed supercomputer is stable, its performance is scalable, and the system is ready to be unleashed to do production research. Defining it so simply is misleading, though, as this process generally can take many months. These systems are built at their full scale for the first time on-site, completely from the ground up; there is not any pre-integration work of network, filesystem, hardware (compute units), or cabinets. There is a ton of optimization and testing done during this process – the system needs to prove it can run applications at scale and be performant. For example, the LUMI system has tens of thousands of components – over 10,000 AMD Instinct™ MI250X GPUs (Graphics Processing Units) alone – that must work in harmony not only with each other but also with the rest of the system components (like the file system, network, and more mentioned above), to showcase stability and scalability.
As mentioned, it takes a lot of work between teams of people to get these systems ready for production. The work doesn’t stop after the processors and accelerators are shipped and installed – a joint development between AMD, hardware vendors, site directors, early users, and others collaborate to get the system stable and performance optimized. During this very integrated process, Center of Excellence (CoE) teams work tirelessly to accelerate mission-critical applications at multiple facilities simultaneously. These are among the largest systems on the planet (e.g., 37k GPUs in the Frontier system) and are not able to be simply simulated in a single lab. On-site triage teams and external support teams focus on extracting the highest performance possible from the system; a true “all hands-on deck” type process. With these many teams involved it can take some time to complete this very integrated process to get the most performance out of the system. This step is all about collaboration between disparate groups and different companies.
The sheer scale of these massive systems means that they are not without challenges. However, the ability to respond quickly and work together with key stakeholders to solve challenges is really important at this stage. Teams need the expertise to solve said challenges quickly to allow research to continue. This is an area where AMD shines – our engineering prowess is prominently shown as technical teams iron out bottlenecks – working closely with technical partners and customers alike.
All of the hard work to get the system fully deployed and open to science is done to enable groundbreaking research that helps address some of humankind’s largest challenges. The incredible scale at which these supercomputers like Frontier and LUMI deploy applications proves the ability of the AMD Instinct accelerators, EPYC processors and the AMD ROCm™ software stack for HPC (High Performance Computing) and AI (Artificial Intelligence) workloads. Below are just a few examples of what research has already been started on the Frontier and LUMI systems – the world’s fastest and third fastest supercomputers in the world – powered by AMD Instinct accelerators.
Create a program that can detect and diagnose cancer growth from digital images. Training AI models to analyze and simulate computational pathology with the goal of improving clinical workflow efficiency, diagnostic quality, and create personalized diagnosis and treatment plans.
Develop an incredibly detailed and high-resolution digital replica of Earth to monitor and predict interactions between natural phenomena and human activities.
Climate DT (Digital Twin)
A digital twin model of Earth with a focus on the impacts on our Earth’s atmosphere.
Perform simulations to predict loss of energy in fusion plasmas and attempt to optimize plasma performance for the next generation of fusion energy reactors.
Build Multiphysics models of stellar explosions to understand how space and time are warped by gravitational waves, how neutrinos and other subatomic particles are formed in these explosions, and how the nuclear elements are synthesized.
Find ways to replace fossil fuel-derived hydrocarbons with ones that can be produced with biomass.
Why are supercomputer sites choosing AMD?
AMD is widely known as a hardware innovator in the semiconductor business. It is no secret that accelerators like the AMD Instinct MI250 GPU look impressive on the spec sheet. However, there is another thing that researchers and scientists find equally – if not more – important: the software stack. AMD ROCm is an open-source software ecosystem that is committed to capturing and sharing improvements with the broader community.
Collaboration with the open-source community is a driving force behind AMD ROCm platform innovations. This industry-differentiating approach to accelerated compute and heterogeneous workload development gives our users incredible flexibility, choice, and platform autonomy. Tools, guidance, and insights are shared freely across the AMD ROCm GitHub community and forums. Investments by AMD in the newly revamped documentation site shows our strong commitment to the developer community where software plays a key role to harness the power of AMD Instinct GPUs. This approach to software development meant that the AMD ROCm stack was able to become more mature quite quickly; there are over 90 applications in the application catalog today.
The hardware engineering prowess of AMD combined with an open-source software ecosystem to further push advancements in both HPC and AI provided AMD processor and accelerator-powered supercomputers the performance to be among the most powerful systems on the planet.