cancel
Showing results for 
Search instead for 
Did you mean: 

Leadership HPC Performance with 5th Generation AMD EPYC Processors

Earlier in November AMD showcased its ongoing high-performance computing (HPC) leadership at Supercomputing 2024 by powering the world’s fastest supercomputer for the sixth straight Top500 list. (source)

With over 500 performance world records and 200+ world records in HPC technical apps as of October 10, 2024, including modeling and simulation, intensive computation, floating point performance & HPC energy efficiency, AMD EPYC™ CPUs are well-established as a leader in enabling HPC performance.

Leveraging recently launched 5th Gen AMD EPYC Processors, which are scaling from 8 cores up to 192 cores per CPU, our customers can achieve leadership performance:

  • First of all, the “Zen 5” architecture used in the AMD EPYC 9005 series provides exceptional IPC uplift for Server CPUs. But what is IPC, and why is it important? There are two ways to improve performance: faster clocks (frequency) and doing more work during each clock cycle—IPC (instructions per clock). For the EPYC 9005 series, we did both:
    1) AMD EPYC 9005 Series has ~17% IPC uplift for enterprise and cloud, based on the geomean of 36 Enterprise & Cloud Server Workloads, and an even higher uplift in HPC & AI 9xx5-001.
    Robert_Hormuth_0-1737053488281.png

     

     2) We were also able to increase the CPU frequencies across the stack. For example, comparing the 5th Gen AMD EPYC 9355 CPU with the 4th Gen AMD EPYC 9354 CPU, we were able to increase the base frequency to 3.55 GHz from 3.25 GHz and the max boost clock to 4.4 GHz from 3.8 GHz, while the CPU core count and TDP of the processor remained the same at 280W. This would predictably result in more performance at the same power, which increases energy efficiency. 

  • By using 64 cores EPYC 9575F CPUs, compared to previous generation AMD EPYC 9554 CPUs or 5th Gen Intel Xeon 8592+ CPUs, you can achieve up to 1.6X performance improvement in licensed commercial applications, such as Ansys® LS-DYNA®, Altair® Radioss®, Ansys® Fluent®, Altair® AcuSolve®. 9xx5-028, 029, 030, 031, 032, 033, 034A, 035A
    Robert_Hormuth_1-1737053488131.png

     

     
  • By utilizing 192 cores 9965 CPUs, compared to the previous generation AMD EPYC 9654 96 cores CPUs and top of the stack 5th generation Intel Xeon 8592+ CPU, you can achieve up to 3.9X performance in GROMACS or up to 3.7X performance in NAMD. 9xx5-022, -023, -024, -038, -039
    Robert_Hormuth_2-1737053488198.png

     

     

All in all, with the introduction of a full 512b data path into the already available AVX512 instruction set, 5th Gen AMD EPYC is providing up to 37% average core IPC (instructions per clock) improvement for HPC & AI workloads vs the 4th Generation. This is calculated on the geomean of 24 HPC and AI workloads and we included workloads in simulations and AI such as NAMD, GROMACS, Resnet 50, and BERT.

 

What’s with AVX512 support and why is this important? 
2 things:

  • 5th Gen AMD EPYC processors implement the full set of AVX-512 instructions. We have expanded data paths to 512 bits starting with the ‘Zen 5’ core, this means that the data can now be read into the CPU in a single clock cycle. Along with this, we have increased the floating-point queue, schedulers, and pipes generationally. If power efficiency is a priority over performance, BIOS settings can direct the processor to execute two 256-bit vectors in sequential clock cycles for AVX-512 instructions.
  • TFLOPS (trillions of floating-point operations per second) is a common unit used to measure the computing power of a processor and system, particularly for scientific and technical workloads.  An example is – the TOP500 HPC system ranking, which uses real TFLOPS performance using the HPL (High-Performance Linpack) suite to rank the systems by performance. TFLOPS measures the number of floating-point arithmetic operations a processor can perform per second, with one "floating point operation" defined as a mathematical operation involving floating-point numbers (e.g. addition, subtraction, multiplication, division, etc.).

AMD EPYC 9005 provides dramatically higher performance across processor generations as measured in AMD testing on the HPL benchmark, comparing 4th and 5th Gen AMD EPYC processors9xx5-080. Check out the 5th Gen AMD EPYC Processor Architecture Whitepaper for more technical details.

 

Now, how does the theoretical TFLOPS calculation work with AMD EPYC?

To assess theoretical performance, based on base frequency, you can use the standard calculation for TFLOPS performance, which is:

(Number of cores) x (base clock speed in GHz) x (number of floating-point operations per clock cycle)

The first 2 are easy, for convenience, you can use the Server Processor Specifications page on the AMD.com website to get the CPU specs.

The last one is proven to be more complex to identify straight away, let’s go through it step by step.

  1. HPL runs 64-bit FP calculations.
  2. AMD EPYC 9004 CPUs (“Zen4”) already had AVX-512 instructions; however, those CPUs are using 2 clock cycles to achieve AVX-512 calculation due to FMA* and dual AVX256 data path. So, 256b datapath / 64b (datatype) * (2 pipes * 2 ops/lane) = 16 floating-point operations per clock cycle.
  3. AMD EPYC 9005 CPUs (“Zen5”) have a full 512b data path due to FMA* and dual native AVX512 per core. So, 512b datapath / 64b (datatype) * (2 pipes * 2 ops/lane) = 32 floating-point operations per clock cycle.

* FMA stands for fused multiply-add and it is a floating-point multiply-add operation performed in one step (fused operation), with a single rounding. Specifically, it calculates FMA(a,b,c)=(a×b)+c with a single rounding step.

 

With this, here is a 5th Gen AMD EPYC vs 4th Gen AMD EPYC top-of-the-stack processors comparison:

AMD EPYC 9965 CPU. TFLOPS theoretical performance for 9965 would be:

192 (# of cores) x 2.25 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 13.824 TFLOPS.

4th Gen AMD EPYC 9754 CPU TFLOPS theoretical performance would be:
128 (# of cores) x 2.25 GHz (base clock speed in GHz) x 16 (AMD EPYC 9004 floating-point operations per clock cycle multiplier) = 4.608 TFLOPS.

With the increased core count & microarchitecture changes, the maximum theoretical performance result of the top-of-the-stack 5th Gen CPU is 200% more than the 4th Gen.

 

For an apples-to-apples comparison of core counts, let’s take the new 5th Gen 128 core AMD EPYC 9745 CPU - TFLOPS theoretical performance for 9745 would be:

128 (# of cores) x 2.4 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 9.8304 TFLOPS.

With this, 5th Gen 9745 at 128 cores provides 113% more theoretical performance than the same core count 4th Gen 9754 CPU. Both are based on dense (“Zen5c” and “Zen4c”) architecture design respectively.

 

For classic “Zen5” and “Zen4” architecture comparison, let’s take a closer look at 96 cores space:

5th Gen AMD EPYC 9655 CPU theoretical performance would be:

96 (# of cores) x 2.6 GHz (base clock speed in GHz) x 32 (AMD EPYC 9005 floating-point operations per clock cycle multiplier) = 7.9872 TFLOPS.

4th Gen AMD EPYC 9654 CPU theoretical performance would be:

96 (# of cores) x 2.4 GHz (base clock speed in GHz) x 16 (AMD EPYC 9004 floating-point operations per clock cycle multiplier) = 3.6864 TFLOPS.

In this example, 5th Gen 9655 provides 116% more theoretical performance than the same core count 4th Gen 9654 CPU. Both are based on classic (“Zen5” and “Zen4”) architecture designs respectively.

 

Thus, moving from the “Zen 4” to “Zen 5” architecture can improve theoretical performance by up to 3x (based on base clock speeds).
Actual performance may vary depending on factors such as software efficiency, temperature, power limitations, and other conditions. You can measure the real performance using the HPL benchmark. Don’t forget to tune in the CPU & system for the best results, utilizing the AMD EPYC 9005 High-Performance Computing (HPC) Tuning Guide & other resources available to you.

About the Author
Robert Hormuth is Corporate Vice President, Architecture and Strategy of the Datacenter Solutions Group (DSG) at AMD. Robert has 36 years in the computer industry, joining AMD in 2020 after 13 years with Dell where he was CTO of the Server Business unit, 8 years with Intel and 11 years at National Instruments. At AMD Robert is charged with creating the long-term vision & strategy for DSG by identifying the technical requirements/implications to the DSG portfolio. Robert has a B.S. in Electrical and Computer Engineering from The University of Texas at Austin and currently holds 47 patents.