At our Datacenter and AI Technology Premier event, AMD expanded the 4th Gen AMD EPYC ™ processor family by introducing two additional processor models in the 4th Gen AMD EPYC family. First, AMD debuted the AMD EPYC 97x4 processors, codenamed “Bergamo”, the industry’s first x86 processors purpose-built for cloud-native computing. You can read more about it in my previous blog. This blog discusses AMD EPYC 9xx4 processors with AMD 3D V-Cache™ technology, codenamed “Genoa-X”.
AMD EPYC 9004 processors with AMD 3D V-Cache technology continue the legacy of 3rd Gen AMD EPYC 7003 processors with AMD 3D V-Cache technology by delivering 3x larger L3 cache than standard AMD EPYC 9004 processors – up to 1,152MB of L3 cache per CPU. AMD EPYC 9004 processors with AMD 3D V-Cache technology leverage the same great design of general purpose 4th Gen AMD EPYC processors and add AMD 3D V-Cache technology to stack additional SRAM directly on top of the compute die, thereby tripling the total L3 cache size. A cache this large can store a significantly larger working dataset. Placing that data so close to the cores can relieve pressure on memory bandwidth and significantly speed up many technical computing workloads.
AMD EPYC 9004 processors with AMD 3D V-Cache technology include the cutting-edge technologies found in general purpose 4th Gen AMD EPYC processors, including “Zen 4” cores built on 5nm process technology, 12 channels of DDR5 memory with supported memory speeds up to 4800GHz, up to 128 (1P) or 160 (2P) lanes of PCIe® Gen5 delivering 2x the transfer rate of PCIe Gen4, 3rd Gen Infinity Fabric delivering 2x the data transfer rate of 2nd Gen Infinity Fabric, and AMD Infinity Guard technology that defends your data while in use. These new processors are socket compatible with existing 4th Gen AMD EPYC platforms.
The 300+ world records earned by AMD EPYC 9004 Series processors are a testament to AMD’s relentless pursuit of performance leadership with industry-leading energy efficiency [2] and optimal TCO[3]. The industry has responded to these efforts: A rich and growing ecosystem of full-stack solutions and partnerships leverage the cutting-edge features and technologies offered by AMD EPYC processors to enable faster time to value for customers’ current and future needs.
We are grateful for our broad ecosystem of partners who continue to collaborate with our engineers to deliver a wide range of datacenter solutions, including:
Alibaba Cloud, Altair, AlmaLinux, Amazon Web Services, Anjuna, Ansys, ASRock, Asus, Atos, BEAMR, Broadcom, Cadence, Canonical, Casa Systems, Cisco, Citrix, Cloudera, Couchbase, Dassault Systèmes, Datastax, Dell, Elastic, Equinix, ESI, Excelero, Foxconn, FreeBSD, Gigabyte, Google Cloud, HBC, HPE, IBM Cloud, Inventec, JMA, Juniper, Kioxia, Lenovo, MariaDB, Mavenir, SingleStore, Micron, Microsoft, Mitac, Neural Magic, MongoDB, MSI, MySQL, NetScout, Nokia, Nutanix, Oracle, PGS Software, QCT, Quobyte, Radisys, Red Hat, RedisLabs, Robin, Rocky Linux, Samsung, Shearwater, Siemens Digital Industries Software, SK Hynix, SLB, Splunk, StorMagic, Supermicro, SUSE, Synopsis, Tencent Cloud, TigerGraph, Transwarp, Tyan, Velocix, Vertica, WEKA, VMware, Western Digital, Wiwynn, Wistron and others.
AMD works closely with our partners and seizes every opportunity to explore and tune the performance of many technical computing workloads that can take advantage of the large L3 cache, thus demonstrating the breakthrough performance offered by AMD EPYC 9004 processors with AMD 3D V-Cache technology. Let’s look at some of these performance results.
CFD uses numerical analysis to simulate and analyze fluid flow and how that fluid (liquid or gas) interacts with solids and surfaces, such as the water flow around a boat hull or the aerodynamics of a car body or aircraft fuselage, as well as a wide variety of less obvious uses, including industrial processing and consumer packaged goods. These workloads can be computationally intensive and require substantial resources, however most CFD workloads are limited by memory bandwidth.
AMD EPYC 9004 processors with AMD 3D V-Cache technology can significantly improve the performance of CFD workloads. With up to 1,152MB of L3 cache, more of the workload’s total working dataset can fit into ultra-fast L3 cache memory situated in close proximity to the compute cores.
AMD EPYC 9004 processors with AMD 3D V-Cache technology can also significantly improve the scalability of CFD simulations. These workloads can be parallelized to distribute the computational load across multiple cores and multiple compute nodes, thereby efficiently scaling out to very large node counts. CFD codes can scale-out efficiently by distributing the working dataset across the nodes in the run with minimal shared memory. Each additional compute node added to the run thus increases the compute power (cores, bandwidth, etc.) without shifting too much pressure onto the overhead of maintaining shared memory between the nodes. Each processor therefore adds to the total L3 cache available to the overall workload. Fitting more of the overall workload into the cache can significantly accelerate the job and can create super-linear scaling.[4] More on this below.
Figure 1: Altair AcuSolve performance (system level)
On a per-core basis, a two-socket 32-core AMD EPYC 9384X system outperformed a comparable two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.63x.[5]
Figure 2: Altair AcuSolve performance (32 cores)
Figure 3: Ansys CFX performance (system level)
On a per-core basis, a two-socket 32-core AMD EPYC 9384X outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by up to ~2.03x on the same benchmarks.[6]
Figure 4: Ansys CFX performance (32 cores)
Figure 5: Ansys Fluent performance (system level)
On a per-core basis, a two-socket 32-core AMD EPYC 9384X outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.35x on the same benchmarks.[7]
Figures 6: Ansys Fluent performance (32 cores)
Figure 7: OpenFOAM performance (system level)
On a per-core basis a two-socket 32-core AMD EPYC 9384X outperformed a comparable two-socket 32-core Intel Xeon Platinum 8462Y+ by ~1.77x on the same benchmarks.[8]
Figure 8: OpenFOAM Performance – 32 Cores
Explicit Finite Element Analysis (FEA) is a numerical simulation technique used to analyze the behavior of structures and materials subjected to dynamic events, such as impact, explosions, or crash simulations. For example, the automotive industry uses FEA to analyze vehicle designs and predict both a car's behavior in a collision and how that collision might affect the car's occupants. Another example is cell phone manufacturers simulating a drop test of their phones to ensure their durability. Using simulations allows manufacturers to save time and expense by testing virtual designs and reducing the need to experimentally test a full prototype.
These simulations start with a very complex digital model of the device to be tested (e.g., a car or a cell phone) and then simulate the physics of a dynamic event (e.g., an impact) by solving a series of differential equations over a period of time. Each stress or strain on one part of the model can create heat, movement, torque, etc. in other parts of the model, looking for areas where the model might deform or fail. These calculations can require high levels of compute and memory bandwidth on a compute node. Further, since an impact on one part of the model can cause changes in a distant part of the model, there can be high communication demands between compute nodes that have to share information between each other about how each of their assigned portions of the model are affected by, or are affecting, each other.
Figure 9: Altair Radioss performance (system level)
And a two-socket 32-core AMD EPYC 9384X system outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ system by ~1.37x on the same benchmarks.[9]
Figure 10: Altair Radioss performance (32 cores)
Figure 11: Ansys LS-DYNA performance (system level)
On a per-core basis, a two-socket 32-core AMD EPYC 9384X system outperformed a two-socket 32-core Intel Xeon Platinum 8462Y+ by up to ~1.89x on the 3car benchmark and also showed significantly higher performance on three other standard benchmarks.[10]
Figure 12: Ansys LS-DYNA performance (32 cores)
As presented above, the performance impact that AMD EPYC 9004 processors with AMD 3D V-Cache technology can deliver for technical computing workloads is impressive. If you are looking to minimize your time to solution, the highest core-count processors deliver exceptional performance per compute node. All of these workloads are very complex and solve very challenging problems. Further, the software licensing costs can be high, especially because software is often licensed on a per-core basis. Those looking to maximize the value of a per-core software license should consider mid-core count AMD EPYC 9004 processors with AMD 3D V-Cache technology to deliver a balance of exceptionally high per-core and per-node performance.
4th Gen AMD EPYC processors deliver the performance and efficiency needed to tackle today’s most challenging workloads. The advent of 4th Gen AMD EPYC processors with AMD 3D V-Cache technology brings the proven performance of AMD 3D V-Cache technology to the 4th generation of AMD EPYC processors to deliver exceptional performance for many memory bandwidth bound workloads.
The significant single-node performance advantage of AMD EPYC 9004 Series Processors with AMD 3D V-Cache technology becomes even more pronounced when these are deployed against realistic workloads in a multi-node technical computing context. Adding more computational nodes to a technical computing cluster reduces the portion of the dataset being processed by each node. Enough reduction allows each portion of the dataset to fit entirely within the L3 cache in each compute node, which causes a sudden performance boost called super-linear scaling.[4] This behavior is not unusual for processors, but the industry-leading 1152MB size of the L3 cache in AMD EPYC 9004 Series Processors with AMD 3D V-Cache technology (3x the 384MB of the standard EPYC 9004 series L3 cache) show excellent scalability, including super-linear scaling.
For example, AMD testing showed that the OpenFOAM Motorbike model with the 130x52x52 mesh exhibits super-linear scaling of ~2.50X at two nodes. This speedup extends to ~6.40x at four nodes and ~13.55x at eight nodes. The accelerating scalability as more nodes are added demonstrates the super-linear scaling effect.[8]
Figure 13: OpenFOAM super-linear scaling
AMD is steadfastly committed to our partners. We understand the need to address the evolution of the various market segments and verticals that our partners serve. We continue innovating products that deliver exceptional performance and efficiency. The introduction of AMD EPYC 9004 processors with AMD 3D V-Cache technology is yet another milestone on our ongoing quest to continue delivering the world’s preeminent datacenter processors.
AMD offers guidance around the best CPU tuning practices to achieve optimal performance on these key workloads when deploying 4th Gen AMD EPYC processors for your environment. Please visit AMD EPYC™ Server Processors to learn more.
The launch of 4th Gen AMD EPYC processors in November of 2022 marked the debut of the world’s highest-performance server processor that delivers optimal TCO across workloads, industry leadership x86 energy efficiency [2][3] to help support sustainability goals, and Confidential Computing across a rich ecosystem of solutions. The advent of AMD EPYC 97x4 processors and AMD EPYC 9004 processors with AMD 3D V-Cache ™ technology expands the line of 4th Gen AMD EPYC processors with new processor models optimized for cloud infrastructure and memory-bound workloads, respectively.
Other key AMD technologies include:
Raghu Nambiar is a Corporate Vice President of Data Center Ecosystems and Solutions for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.