Leadership HPC Performance with 5th Generation AMD EPYC Processors

Robert_Hormuth · ‎01-22-2025

Earlier in November AMD showcased its ongoing high-performance computing (HPC) leadership at Supercomputing 2024 by powering the world’s fastest supercomputer for the sixth straight Top500 list. (source)

With over 500 performance world records and 200+ world records in HPC technical apps as of October 10, 2024, including modeling and simulation, intensive computation, floating point performance & HPC energy efficiency, AMD EPYC™ CPUs are well-established as a leader in enabling HPC performance.

Leveraging recently launched 5^th Gen AMD EPYC Processors, which are scaling from 8 cores up to 192 cores per CPU, our customers can achieve leadership performance:

First of all, the “Zen 5” architecture used in the AMD EPYC 9005 series provides exceptional IPC uplift for Server CPUs. But what is IPC, and why is it important? There are two ways to improve performance: faster clocks (frequency) and doing more work during each clock cycle—IPC (instructions per clock). For the EPYC 9005 series, we did both:
1) AMD EPYC 9005 Series has ~17% IPC uplift for enterprise and cloud, based on the geomean of 36 Enterprise & Cloud Server Workloads, and an even higher uplift in HPC & AI ^9xx5-001.

2) We were also able to increase the CPU frequencies across the stack. For example, comparing the 5th Gen AMD EPYC 9355 CPU with the 4th Gen AMD EPYC 9354 CPU, we were able to increase the base frequency to 3.55 GHz from 3.25 GHz and the max boost clock to 4.4 GHz from 3.8 GHz, while the CPU core count and TDP of the processor remained the same at 280W. This would predictably result in more performance at the same power, which increases energy efficiency.

By using 64 cores EPYC 9575F CPUs, compared to previous generation AMD EPYC 9554 CPUs or 5^th Gen Intel Xeon 8592+ CPUs, you can achieve up to 1.6X performance improvement in licensed commercial applications, such as Ansys® LS-DYNA®, Altair® Radioss®, Ansys® Fluent®, Altair® AcuSolve®. ^9xx5-028, ⁰²⁹^,⁰³⁰^,⁰³¹^,⁰³²^,⁰³³^,^034A^,^035A

By utilizing 192 cores 9965 CPUs, compared to the previous generation AMD EPYC 9654 96 cores CPUs and top of the stack 5^th generation Intel Xeon 8592+ CPU, you can achieve up to 3.9X performance in GROMACS or up to 3.7X performance in NAMD. ^9xx5-022^,^-023^,^-024^,^-038^,^-039

All in all, with the introduction of a full 512b data path into the already available AVX512 instruction set, 5^th Gen AMD EPYC is providing up to 37% average core IPC (instructions per clock) improvement for HPC & AI workloads vs the 4^th Generation. This is calculated on the geomean of 24 HPC and AI workloads and we included workloads in simulations and AI such as NAMD, GROMACS, Resnet 50, and BERT.

What’s with AVX512 support and why is this important?
2 things:

5^th Gen AMD EPYC processors implement the full set of AVX-512 instructions. We have expanded data paths to 512 bits starting with the ‘Zen 5’ core, this means that the data can now be read into the CPU in a single clock cycle. Along with this, we have increased the floating-point queue, schedulers, and pipes generationally. If power efficiency is a priority over performance, BIOS settings can direct the processor to execute two 256-bit vectors in sequential clock cycles for AVX-512 instructions.
TFLOPS (trillions of floating-point operations per second) is a common unit used to measure the computing power of a processor and system, particularly for scientific and technical workloads. An example is – the TOP500 HPC system ranking, which uses real TFLOPS performance using the HPL (High-Performance Linpack) suite to rank the systems by performance. TFLOPS measures the number of floating-point arithmetic operations a processor can perform per second, with one "floating point operation" defined as a mathematical operation involving floating-point numbers (e.g. addition, subtraction, multiplication, division, etc.).

AMD EPYC 9005 provides dramatically higher performance across processor generations as measured in AMD testing on the HPL benchmark, comparing 4th and 5th Gen AMD EPYC processors^9xx5-080. Check out the 5th Gen AMD EPYC Processor Architecture Whitepaper for more technical details.

Now, how does the theoretical TFLOPS calculation work with AMD EPYC?

To assess theoretical performance, based on base frequency, you can use the standard calculation for TFLOPS performance, which is:

(Number of cores) x (base clock speed in GHz) x (number of floating-point operations per clock cycle)

The first 2 are easy, for convenience, you can use the Server Processor Specifications page on the AMD.com website to get the CPU specs.

The last one is proven to be more complex to identify straight away, let’s go through it step by step.

HPL runs 64-bit FP calculations.
AMD EPYC 9004 CPUs (“Zen4”) already had AVX-512 instructions; however, those CPUs are using 2 clock cycles to achieve AVX-512 calculation due to FMA* and dual AVX256 data path. So, 256b datapath / 64b (datatype) * (2 pipes * 2 ops/lane) = 16 floating-point operations per clock cycle.
AMD EPYC 9005 CPUs (“Zen5”) have a full 512b data path due to FMA* and dual native AVX512 per core. So, 512b datapath / 64b (datatype) * (2 pipes * 2 ops/lane) = 32 floating-point operations per clock cycle.

* FMA stands for fused multiply-add and it is a floating-point multiply-add operation performed in one step (fused operation), with a single rounding. Specifically, it calculates FMA(a,b,c)=(a×b)+c with a single rounding step.

With this, here is a 5^th Gen AMD EPYC vs 4^th Gen AMD EPYC top-of-the-stack processors comparison:

AMD EPYC 9965 CPU. TFLOPS theoretical performance for 9965 would be:

192 (# of cores) x 2.25 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 13.824 TFLOPS.

4^th Gen AMD EPYC 9754 CPU TFLOPS theoretical performance would be:
128 (# of cores) x 2.25 GHz (base clock speed in GHz) x 16 (AMD EPYC 9004 floating-point operations per clock cycle multiplier) = 4.608 TFLOPS.

With the increased core count & microarchitecture changes, the maximum theoretical performance result of the top-of-the-stack 5^th Gen CPU is 200% more than the 4^th Gen.

For an apples-to-apples comparison of core counts, let’s take the new 5^th Gen 128 core AMD EPYC 9745 CPU - TFLOPS theoretical performance for 9745 would be:

128 (# of cores) x 2.4 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 9.8304 TFLOPS.

With this, 5^th Gen 9745 at 128 cores provides 113% more theoretical performance than the same core count 4^th Gen 9754 CPU. Both are based on dense (“Zen5c” and “Zen4c”) architecture design respectively.

For classic “Zen5” and “Zen4” architecture comparison, let’s take a closer look at 96 cores space:

5^th Gen AMD EPYC 9655 CPU theoretical performance would be:

96 (# of cores) x 2.6 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 7.9872 TFLOPS.

4^th Gen AMD EPYC 9654 CPU theoretical performance would be:

96 (# of cores) x 2.4 GHz (base clock speed in GHz) x 16 (AMD EPYC 9004 floating-point operations per clock cycle multiplier) = 3.6864 TFLOPS.

In this example, 5^th Gen 9655 provides 116% more theoretical performance than the same core count 4^th Gen 9654 CPU. Both are based on classic (“Zen5” and “Zen4”) architecture designs respectively.

Thus, moving from the “Zen 4” to “Zen 5” architecture can improve theoretical performance by up to 3x (based on base clock speeds).
Actual performance may vary depending on factors such as software efficiency, temperature, power limitations, and other conditions. You can measure the real performance using the HPL benchmark. Don’t forget to tune in the CPU & system for the best results, utilizing the AMD EPYC 9005 High-Performance Computing (HPC) Tuning Guide & other resources available to you.