AMD EPYC™ 9004 Processors with AMD 3D V-Cache ™ Technology Further Ignite Technical Computing[1]

raghu_nambiar · ‎07-06-2023

At our Datacenter and AI Technology Premier event, AMD expanded the 4th Gen AMD EPYC ™ processor family by introducing two additional processor models in the 4th Gen AMD EPYC family. First, AMD debuted the AMD EPYC 97x4 processors, codenamed “Bergamo”, the industry’s first x86 processors purpose-built for cloud-native computing. You can read more about it in my previous blog. This blog discusses AMD EPYC 9xx4 processors with AMD 3D V-Cache™ technology, codenamed “Genoa-X”.

AMD EPYC 9004 processors with AMD 3D V-Cache technology continue the legacy of 3rd Gen AMD EPYC 7003 processors with AMD 3D V-Cache technology by delivering 3x larger L3 cache than standard AMD EPYC 9004 processors – up to 1,152MB of L3 cache per CPU. AMD EPYC 9004 processors with AMD 3D V-Cache technology leverage the same great design of general purpose 4th Gen AMD EPYC processors and add AMD 3D V-Cache technology to stack additional SRAM directly on top of the compute die, thereby tripling the total L3 cache size. A cache this large can store a significantly larger working dataset. Placing that data so close to the cores can relieve pressure on memory bandwidth and significantly speed up many technical computing workloads.

AMD EPYC 9004 processors with AMD 3D V-Cache technology include the cutting-edge technologies found in general purpose 4th Gen AMD EPYC processors, including “Zen 4” cores built on 5nm process technology, 12 channels of DDR5 memory with supported memory speeds up to 4800GHz, up to 128 (1P) or 160 (2P) lanes of PCIe® Gen5 delivering 2x the transfer rate of PCIe Gen4, 3rd Gen Infinity Fabric delivering 2x the data transfer rate of 2nd Gen Infinity Fabric, and AMD Infinity Guard technology that defends your data while in use. These new processors are socket compatible with existing 4th Gen AMD EPYC platforms.

The 300+ world records earned by AMD EPYC 9004 Series processors are a testament to AMD’s relentless pursuit of performance leadership with industry-leading energy efficiency [2] and optimal TCO[3]. The industry has responded to these efforts: A rich and growing ecosystem of full-stack solutions and partnerships leverage the cutting-edge features and technologies offered by AMD EPYC processors to enable faster time to value for customers’ current and future needs.

We are grateful for our broad ecosystem of partners who continue to collaborate with our engineers to deliver a wide range of datacenter solutions, including:

Alibaba Cloud, Altair, AlmaLinux, Amazon Web Services, Anjuna, Ansys, ASRock, Asus, Atos, BEAMR, Broadcom, Cadence, Canonical, Casa Systems, Cisco, Citrix, Cloudera, Couchbase, Dassault Systèmes, Datastax, Dell, Elastic, Equinix, ESI, Excelero, Foxconn, FreeBSD, Gigabyte, Google Cloud, HBC, HPE, IBM Cloud, Inventec, JMA, Juniper, Kioxia, Lenovo, MariaDB, Mavenir, SingleStore, Micron, Microsoft, Mitac, Neural Magic, MongoDB, MSI, MySQL, NetScout, Nokia, Nutanix, Oracle, PGS Software, QCT, Quobyte, Radisys, Red Hat, RedisLabs, Robin, Rocky Linux, Samsung, Shearwater, Siemens Digital Industries Software, SK Hynix, SLB, Splunk, StorMagic, Supermicro, SUSE, Synopsis, Tencent Cloud, TigerGraph, Transwarp, Tyan, Velocix, Vertica, WEKA, VMware, Western Digital, Wiwynn, Wistron and others.

AMD works closely with our partners and seizes every opportunity to explore and tune the performance of many technical computing workloads that can take advantage of the large L3 cache, thus demonstrating the breakthrough performance offered by AMD EPYC 9004 processors with AMD 3D V-Cache technology. Let’s look at some of these performance results.

Computational Fluid Dynamics (CFD)

CFD uses numerical analysis to simulate and analyze fluid flow and how that fluid (liquid or gas) interacts with solids and surfaces, such as the water flow around a boat hull or the aerodynamics of a car body or aircraft fuselage, as well as a wide variety of less obvious uses, including industrial processing and consumer packaged goods. These workloads can be computationally intensive and require substantial resources, however most CFD workloads are limited by memory bandwidth.

AMD EPYC 9004 processors with AMD 3D V-Cache technology can significantly improve the performance of CFD workloads. With up to 1,152MB of L3 cache, more of the workload’s total working dataset can fit into ultra-fast L3 cache memory situated in close proximity to the compute cores.

AMD EPYC 9004 processors with AMD 3D V-Cache technology can also significantly improve the scalability of CFD simulations. These workloads can be parallelized to distribute the computational load across multiple cores and multiple compute nodes, thereby efficiently scaling out to very large node counts. CFD codes can scale-out efficiently by distributing the working dataset across the nodes in the run with minimal shared memory. Each additional compute node added to the run thus increases the compute power (cores, bandwidth, etc.) without shifting too much pressure onto the overhead of maintaining shared memory between the nodes. Each processor therefore adds to the total L3 cache available to the overall workload. Fitting more of the overall workload into the cache can significantly accelerate the job and can create super-linear scaling.[4] More on this below.

Altair® AcuSolve®: Altair AcuSolve is a proven asset for companies looking to explore designs by applying a full range of flow, heat transfer, turbulence, and non-Newtonian material analysis capabilities without the difficulties associated with traditional CFD applications. A two-socket AMD EPYC 9684X system outperformed a comparable two-socket Intel Xeon Platinum 8480+ system by ~1.94x running the impinging nozzle test case.[5]

Figure 1: Altair AcuSolve performance (system level)

On a per-core basis, a two-socket 32-core AMD EPYC 9384X system outperformed a comparable two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.63x.[5]

Figure 2: Altair AcuSolve performance (32 cores)

Ansys® CFX®: Ansys CFX is a high-performance computational fluid dynamics (CFD) software tool that delivers robust, reliable, and accurate solutions quickly across a wide range of CFD and Multiphysics applications. A two-socket AMD EPYC 9684X system outperformed a comparable two-socket Intel Xeon Platinum 8480+ system by up to ~2.59x on standard CFX benchmarks.[6]

Figure 3: Ansys CFX performance (system level)

On a per-core basis, a two-socket 32-core AMD EPYC 9384X outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by up to ~2.03x on the same benchmarks.[6]

Figure 4: Ansys CFX performance (32 cores)

Ansys® Fluent®: Ansys Fluent is a fluid simulation application that offers advanced physics modeling capabilities and industry-leading accuracy. A two-socket AMD EPYC 9684X system outperformed a comparable two-socket Intel Xeon Platinum 8480+ system by ~2.15x on a composite average of 15 standard benchmarks.[7]

Figure 5: Ansys Fluent performance (system level)

On a per-core basis, a two-socket 32-core AMD EPYC 9384X outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.35x on the same benchmarks.[7]

Figures 6: Ansys Fluent performance (32 cores)

OpenFOAM®: OpenFOAM® is a free, open source CFD software. Its user base includes commercial and academic organizations. A two-socket AMD EPYC 9684X system outperformed a comparable two-socket Intel® Xeon® Platinum 8480+ system by ~2.08x on a composite average of the standard motorbike model at sizes of 130x52x52, 108x46x46 and 100x40x40.[8]

Figure 7: OpenFOAM performance (system level)

On a per-core basis a two-socket 32-core AMD EPYC 9384X outperformed a comparable two-socket 32-core Intel Xeon Platinum 8462Y+ by ~1.77x on the same benchmarks.[8]

Figure 8: OpenFOAM Performance – 32 Cores

Explicit Finite Element Analysis (FEA)

Explicit Finite Element Analysis (FEA) is a numerical simulation technique used to analyze the behavior of structures and materials subjected to dynamic events, such as impact, explosions, or crash simulations. For example, the automotive industry uses FEA to analyze vehicle designs and predict both a car's behavior in a collision and how that collision might affect the car's occupants. Another example is cell phone manufacturers simulating a drop test of their phones to ensure their durability. Using simulations allows manufacturers to save time and expense by testing virtual designs and reducing the need to experimentally test a full prototype.

These simulations start with a very complex digital model of the device to be tested (e.g., a car or a cell phone) and then simulate the physics of a dynamic event (e.g., an impact) by solving a series of differential equations over a period of time. Each stress or strain on one part of the model can create heat, movement, torque, etc. in other parts of the model, looking for areas where the model might deform or fail. These calculations can require high levels of compute and memory bandwidth on a compute node. Further, since an impact on one part of the model can cause changes in a distant part of the model, there can be high communication demands between compute nodes that have to share information between each other about how each of their assigned portions of the model are affected by, or are affecting, each other.

Altair® Radioss™: Altair Radioss is used to perform structural analyses under impact or crash conditions. Its benchmarks provide hardware performance data measured using sets of benchmark problems selected to represent typical usage. A two-socket AMD EPYC 9684X system outperformed a comparable two-socket Intel Xeon Platinum 8480+ system by ~2.10x.[9]

Figure 9: Altair Radioss performance (system level)

And a two-socket 32-core AMD EPYC 9384X system outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.37x on the same benchmarks.[9]

Figure 10: Altair Radioss performance (32 cores)

Ansys® LS-DYNA®: Ansys® LS-DYNA® is a widely used explicit simulation program. It is capable of simulating complex real-world short-duration events in the automotive, aerospace, construction, military, manufacturing, and bioengineering industries. A two-socket AMD EPYC 9684X system outperformed a two-socket Intel Xeon Platinum 8480+ by up to ~2.86x on the standard 3cars benchmark and showed solid performance uplifts on three other standard benchmarks.[10]

Figure 11: Ansys LS-DYNA performance (system level)

On a per-core basis, a two-socket 32-core AMD EPYC 9384X system outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ by up to ~1.89x on the 3car benchmark and also showed significantly higher performance on three other standard benchmarks.[10]

Figure 12: Ansys LS-DYNA performance (32 cores)

As presented above, the performance impact that AMD EPYC 9004 processors with AMD 3D V-Cache technology can deliver for technical computing workloads is impressive. If you are looking to minimize your time to solution, the highest core-count processors deliver exceptional performance per compute node. All of these workloads are very complex and solve very challenging problems. Further, the software licensing costs can be high, especially because software is often licensed on a per-core basis. Those looking to maximize the value of a per-core software license should consider mid-core count AMD EPYC 9004 processors with AMD 3D V-Cache technology to deliver a balance of exceptionally high per-core and per-node performance.

4th Gen AMD EPYC processors deliver the performance and efficiency needed to tackle today’s most challenging workloads. The advent of 4th Gen AMD EPYC processors with AMD 3D V-Cache technology brings the proven performance of AMD 3D V-Cache technology to the 4th generation of AMD EPYC processors to deliver exceptional performance for many memory bandwidth bound workloads.

Super-linear Scaling

The significant single-node performance advantage of AMD EPYC 9004 Series Processors with AMD 3D V-Cache technology becomes even more pronounced when these are deployed against realistic workloads in a multi-node technical computing context. Adding more computational nodes to a technical computing cluster reduces the portion of the dataset being processed by each node. Enough reduction allows each portion of the dataset to fit entirely within the L3 cache in each compute node, which causes a sudden performance boost called super-linear scaling.[4] This behavior is not unusual for processors, but the industry-leading 1152MB size of the L3 cache in AMD EPYC 9004 Series Processors with AMD 3D V-Cache technology (3x the 384MB of the standard EPYC 9004 series L3 cache) show excellent scalability, including super-linear scaling.

For example, AMD testing showed that the OpenFOAM Motorbike model with the 130x52x52 mesh exhibits super-linear scaling of ~2.50X at two nodes. This speedup extends to ~6.40x at four nodes and ~13.55x at eight nodes. The accelerating scalability as more nodes are added demonstrates the super-linear scaling effect.[8]

Figure 13: OpenFOAM super-linear scaling

Conclusion

AMD is steadfastly committed to our partners. We understand the need to address the evolution of the various market segments and verticals that our partners serve. We continue innovating products that deliver exceptional performance and efficiency. The introduction of AMD EPYC 9004 processors with AMD 3D V-Cache technology is yet another milestone on our ongoing quest to continue delivering the world’s preeminent datacenter processors.

AMD offers guidance around the best CPU tuning practices to achieve optimal performance on these key workloads when deploying 4th Gen AMD EPYC processors for your environment. Please visit AMD EPYC™ Server Processors to learn more.

The launch of 4th Gen AMD EPYC processors in November of 2022 marked the debut of the world’s highest-performance server processor that delivers optimal TCO across workloads, industry leadership x86 energy efficiency [2][3] to help support sustainability goals, and Confidential Computing across a rich ecosystem of solutions. The advent of AMD EPYC 97x4 processors and AMD EPYC 9004 processors with AMD 3D V-Cache ™ technology expands the line of 4th Gen AMD EPYC processors with new processor models optimized for cloud infrastructure and memory-bound workloads, respectively.

Other key AMD technologies include:

AMD Instinct™ accelerators are designed to power discoveries at exascale to enable scientists to tackle our most pressing challenges.
AMD Pensando™ solutions deliver highly programmable software-defined cloud, compute, networking, storage and security features wherever data is located, helping to offer improvements in productivity, performance and scale compared to current architectures with no risk of lock-in.
AMD FPGAs and Adaptive SoCs offers highly flexible and adaptive FPGAs, hardware adaptive SoCs, and the Adaptive Compute Acceleration Platform (ACAP) processing platforms that enable rapid innovation across a variety of technologies from the endpoint to the edge to the cloud.

Raghu Nambiar is a Corporate Vice President of Data Center Ecosystems and Solutions for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

References

“Technical Computing” or “Technical Computing Workloads” as defined by AMD can include: electronic design automation, computational fluid dynamics, finite element analysis, seismic tomography, weather forecasting, quantum mechanics, climate research, molecular modeling, or similar workloads. GD-204
SP5-072: As of 1/11/2023, a 4th Gen EPYC 9654 powered server has highest overall scores in key industry-recognized energy efficiency benchmarks SPECpower_ssj®2008, SPECrate®2017_int_energy_base and SPECrate®2017_fp_energy_base. See details at https://www.amd.com/en/claims/epyc4#SP5-072
SPCTCO-002A: A 2P AMD EPYC 96 core 9654 CPU powered server, to deliver 10,000 units of integer performance takes an estimated: 59% fewer servers (7 AMD servers vs 17 Intel servers), 46% less power, and a 48% lower 3-yr TCO than a 2P server based on the 40 core Intel Xeon Platinum 8380 CPUs. The 2P EPYC 96 core CPU solution also provides estimated Greenhouse Gas Emission savings emissions avoided equivalent to 145,443 pounds of coal not burned in the USA over 3 years and carbon sequestration equivalent of 53 acres of forest annually in the USA.
AMD defines “linear scaling” as an equal and proportionate application performance uplift relative to single node performance; that is, when scaling out to 2 nodes results in 2x the performance of a single node, scaling out to 4 nodes results in 4x the performance of a single node, and so forth. “Super-linear” scaling is when the performance uplift achieved by adding one or more node(s) is greater than linear. AMD allows a +/- of 2% margin of error when claiming linear or super linear scaling. GD-2055.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-altair-acusolve.pdf.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-ansys-cfx.pdf.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-ansys-fluent.pdf.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-openfoam.pdf.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-altair-radioss.pdf.
See https://www.amd.com/system/files/documents/amd-epyc-9004x-pb-ansys-ls-dyna.pdf.