cancel
Showing results for 
Search instead for 
Did you mean: 

AMD Ecosystem Poised and Ready for the 4th Gen AMD EPYC™ Processors

raghu_nambiar
8 0 13.8K

Note: Updated April 4th, 2023 with new AI/ML workloads using AMD ZenDNN 4.0.

The debut of the first generation of AMD EPYC processors introduced game-changing features including high core counts and leadership performance. Each generation has built on the previous generation by pushing performance leadership across multiple segments. The new 4th Gen AMD EPYC processors (9004 Series) announced today continue this established legacy by delivering impressive generational performance gains with double the workload performance in many cases. Some of the significant features offered by AMD EPYC 9004 Series Processors include:

  • New “Zen 4” cores: The latest “Zen 4” cores deliver up to 14% higher instructions per clock single thread uplift generationally and add support for AVX-512 instructions to boost software performance, especially for key AI/ML workloads.[1]
  • 50% more cores: 4th Gen AMD EPYC processors deliver up to 50% more cores than the prior generation for compute-intensive workloads. The efficiency, core density, and 5nm technology of 4th Gen AMD EPYC processors deliver significant overall power/performance efficiency improvements compared to prior AMD EPYC generations.
  • >2x higher memory bandwidth: 12 memory channels deliver ~50% higher generational bandwidth. The efficiency and performance of DDR5 with supported memory speeds up to 4800GHz adds an additional ~50% memory bandwidth for the most critical workload demands.[2]
  • 2x faster PCIe® performance: PCIe® Gen5 delivers 2x the transfer rate of PCIe Gen4.
  • 2x faster Infinity Fabric™: 3rd Gen Infinity Fabric delivers 2x the data transfer rate between sockets over 2nd Gen Infinity Fabric. AMD Infinity Guard delivers a leading set of modern security features to help protect sensitive data, enabling Confidential Computing with Secure Encrypted Virtualization technology.[3]

AMD’s objective is to offer our customers an exceptional out-of-the-box experience when deploying their applications on AMD EPYC processors. AMD understands that both software compatibility and high performance are critical for modern data-driven infrastructures. AMD EPYC processors deliver on this for key workloads such as relational databases, big data analytics, artificial intelligence, technical computing, and others.

AMD has also focused on industry verticals, such as telecom, healthcare, financial services, manufacturing, and others by delivering on-premises systems and both private and public cloud environments. AMD EPYC 9004 Series processors are designed to deliver better performance, faster throughput, and higher productivity gains across the board.

These key workloads come with guidance on best practices for tuning to achieve the optimal performance when deploying 4thGen AMD EPYC processors for your environment. Visit www.amd.com/epyc-tuning-guides.

Our Partners Make Us Successful

AMD closely collaborates with our extensive and growing set of ecosystem partners to bring next-generation engineering innovations to market that support AMD EPYC® 9004 Series Processors. Ecosystem improvements across hardware and software products deliver immediate customer value for datacenters, in the cloud, and now at the edge. 

We are grateful for our broad ecosystem of partners who continue to collaborate with our engineers to deliver a wide range of datacenter solutions, including:

Alibaba Cloud, Altair, Amazon Web Services, Anjuna, Ansys, ASRock, Asus, Atos, BEAMR, Broadcom, Cadence, Canonical, Casa Systems, Cisco, Citrix, Cloudera, Couchbase, Dassault Systèmes, Datastax, Dell, Elastic, Equinix, ESI, Excelero, Foxconn, FreeBSD, Gigabyte, Google Cloud, HBC, HPE, IBM Cloud, Inventec, JMA, Juniper, Kioxia, Lenovo, MariaDB, Mavenir, MemSQL, Micron, Microsoft, Mitac, Neural Magic, MongoDB, MSI, MySQL, Netscout, Nokia, Nutanix, Oracle, PGS Software, QCT, Quobyte, Radisys, Red Hat, RedisLabs, Robin, Samsung, ScaleMP, Siemens Digital Industries Software, SK Hynix, Splunk, StorMagic, Supermicro, SUSE, Synopsis, Tencent Cloud, TigerGraph, Transwarp, Tyan, Velocix, Vertica, WEKA, VMware, Western Digital, Wiwynn, Wistron and others.

Operating Systems and Security

AMD has significantly increased our investment in open-source OS, hypervisors, containers, and orchestration since the introduction of AMD EPYC processors. Our developers contribute to key areas of the Linux kernel and virtualization stack to help improve infrastructure reliability, robustness, and performance. We also deeply invest in engagements with the various Operating System vendors in the ecosystem. Thanks to our work with these partners, AMD 4th Gen EPYC Processors will enjoy support from Microsoft Windows Server, VMware vSphere, Azure Stack HCI Server, RedHat Enterprise Linux, SUSE Enterprise Linux, Canonical Ubuntu, Oracle Unbreakable Enterprise Kernel, Citrix Hypervisor, Nutanix, and FreeBSD. Visit https://www.amd.com/en/processors/epyc-minimum-operating-system for the complete list.

One of the unique differentiators that AMD brings to the industry is hardware-accelerated encryption that enables Confidential Computing. It is a game-changing paradigm shift for computing in both private and public clouds as well as on hosted services. It helps address key security concerns many organizations have about hosting their sensitive applications in multi-tenant environments by helping safeguard their most valuable information while in-use by their applications.

AMD engages with open-source projects, operating system partners, and cloud vendors to help drive the development of Confidential Computing based on AMD’s Secure Encrypted Virtualization (SEV) technology. SEV support is available from Canonical, Nutanix, Oracle, Red Hat, SUSE, and VMware. In addition, Google Cloud, Microsoft Azure and Oracle Cloud Infrastructure have announced their Confidential VM availability plans. Further, the Confidential Containers project achieved its first full release and includes support for AMD SEV.

Leadership Across Key Workloads

I have always been passionate about performance. AMD EPYC 9004 Series Processors deliver performance leadership, as demonstrated with 300+ world records, without compromising energy efficiency and while providing incredible overall TCO.

In the next several sections, I will present performance data from various workloads starting with some foundational workloads, followed by workloads in the Hyperconverged and Virtualized Environments, Data Management Systems, High Performance Computing, Artificial Intelligence and Machine Learning (AI/ML) domains and finally some industry vertical performance data for Financial Services (Black-Scholes) and Media and Entertainment.

Foundational Workload Performance

  • Integer and floating-point performance using SPEC CPU® 2017: The SPEC CPU® 2017 benchmark is one of the most popular industry standard benchmarks historically designed to provide performance measurements that can be used to compare compute-intensive workloads by stressing the processor, memory subsystem, and compiler on different computer systems. SPEC CPU® 2017 contains 43 benchmarks organized into four suites of which two, SPECrate 2017 Integer, and SPECrate 2017 Floating Point, are discussed in this blog. As shown in Figure 1, the AMD EPYC 9654 processor provides more than double the performance of the previous generation AMD EPYC processors, and more than ~2.5x and ~3.0x, respectively, the performance of our closest competition, for both integer[4] and floating-point measurements.[5]

raghu_nambiar_0-1668102702770.png

Figure 1: SPECrate 2017_Int_base and SPECrate2017_fp_base generational and competitive performance uplifts

  • Matrix multiplication with DGEMM: DGEMM is a popular routine that calculates double precision matrix multiplication C ß aAB+bC, where A, B, and C are matrices that contain double precision floating point values and a and b are scalars. This open-source benchmark uses the AMD BLIS component of AOCL, which is available here.[6] Figure 2 shows that 4th Gen AMD EPYC 9654 processors deliver a generational uplift of ~1.75x compared to 3rd Gen AMD EPYC 7763 processors.[6]

raghu_nambiar_1-1668102702776.png

Figure 2: DGEMM generational performance uplift

  • High-Performance Linpack (HPL): HPL is a free and portable implementation that solves a random, dense system of linear equations using double-precision (64-bit) floating point arithmetic and can be run on systems ranging from credit card–sized computers to the world’s fastest supercomputers. For example, HPL is widely used to generate data for the Top500 supercomputer list (http://top500.org). 4th Gen AMD EPYC 9654 processors deliver a generational uplift of ~1.77x compared to 3rd Gen AMD EPYC 7763 processors. See Figure 3.[7]

raghu_nambiar_2-1668102702782.png

Figure 3: HPL generational performance uplift

  • Web server performance with NGNIX: NGNIX is a popular webserver that can also be used as a reverse proxy, load balancer, mail proxy, and HTTP cache. AMD tested NGINX throughput in connections per second as a high-performance web server in conjunction with the WRK web (http) client. These tests used a single NGINX server instance on a bare-metal dual-socket server system. Testing retained key NGINX server parameters at their default values, including the number of worker processes and cache manager/loader. This same system ran the WRK client with two hundred threads and fourteen thousand connections for a test duration of ninety (90)  seconds (-t 200 -c 14000 -d 90s). The following chart (Figure 4) showcases the requests per second (rps) achieved and demonstrates a very strong generational and competitive performance uplift for the 4th Gen AMD EPYC 9654 processor.[8]

raghu_nambiar_3-1668102702785.png

Figure 4: NGINX generational and competitive performance uplifts

  • Enterprise Java with SPECjbb® 2015: This benchmark enables performance measurements of server-side Java® based applications. SPECjbb® simulates a company with an IT infrastructure that handles a mix of point-of-sale requests, online purchases, and data-mining operations. With the rapid adoption of Java across the industry in the last two decades, this benchmark is relevant to all audiences including Java Virtual Machine (JVM) vendors, hardware developers, Java application developers, researchers, and members of the academic community. 4th Gen AMD EPYC 9654 processors more than doubled the performance of the Intel® Xeon® Platinum 8380 processors and showed a significant performance boost over our 3rd Gen AMD EPYC 7763 processors for both the Composite and MultiJVM suites of this benchmark. See Figure 5 and Figure 6.[9]

raghu_nambiar_4-1668102702787.png

Figure 5: SPECjbb 2015 Composite generational and competitive performance uplifts

raghu_nambiar_5-1668102702788.png

Figure 6: SPECjbb 2015 MultiJVM generational and competitive performance uplifts

  • NVMe® over Fabric: The NVMe over Fabric (NVMe-oF) protocol extends the parallelism and efficiencies of the NVM Express® (NVMe) block protocol over network fabrics such as RDMA (iWARP, RoCE, InfiniBand™), Fiber Channel and TCP. The Storage Performance Development Kit (SPDK) provides both a user space NVMe-oF target and initiator that extends the software efficiencies of the rest of the SPDK stack over the network. The SPDK NVMe-oF target uses the SPDK user-space, polled-mode NVMe driver to submit and complete I/O requests to NVMe devices which reduces the software processing overhead.

    CPU core scaling performance tests with SPDK NVMe-oF 4K QD128 for both random reads and random read/write run on 4 cores/8 threads conducted show that 4th Gen AMD EPYC processors deliver a generational performance uplift of ~1.75x and ~2.03x over prior-generation AMD EPYC processors. See Figure 7, below.[10]

raghu_nambiar_6-1668102702789.png

Figure 7: NVMe-oF generational performance uplifts

Virtualized Infrastructure

Modern datacenters are highly virtualized and software-defined. Customers are looking for efficiency, scalability, availability, and lower cost of ownership from their virtualized environments, including hyperconverged infrastructure (HCI), converged infrastructure (CI), and/or public cloud deployments. Virtual environments demand high performance and density, and every generation of AMD EPYC processors has given customers a choice to choose more cores, more memory bandwidth, and more IO.

4th Gen AMD EPYC processors raise the bar in the key features important for virtualized environments. The most important “feature” may be balance. 4th Gen AMD EPYC processors deliver both the high density needed by public and private cloud deployments and a balanced solution by significantly increasing performance in the key pipelines that feed data into the cores: memory bandwidth, PCIe performance, and Infinity Fabric (inter-socket communication) performance.

The ever-increasing number of applications and increasing application workload complexity drive the need for performance, but power efficiency is just as critical. The new processor technologies delivered in 4th Gen AMD EPYC processors help drive performance and efficiency to support virtualization and excel in enabling all virtualization environments.

  • Virtualization and consolidation with VMmark: AMD EPYC® 9004 Series Processors deliver outstanding performance on the VMmark virtualization benchmarks. VMmark is a benchmark software suite that measures the performance, power consumption, and scalability of virtualized servers while running under load on a set of physical hardware. It also supports making comparisons between multiple virtualization platforms. As shown in Figure 8, a two-node cluster powered by dual 4th Gen AMD EPYC 9654 processors achieved a score of 44 tiles, compared to 24 tiles and 14 tiles achieved in similar set ups powered by 3rd Gen AMD EPYC 7763 and Intel Xeon Platinum 8380 respectively.[11]

raghu_nambiar_7-1668102702794.png

Figure 8: VMmark 3.1.1 matched pair generational and competitive performance uplift

  • Virtual Desktop Infrastructure density: Virtual Desktop Infrastructure (VDI) is a technology that refers to the use of virtual machines to provide and manage virtual desktops. Login VSI is a testing platform used to measure, analyze, and optimize VDI deployments. One of the key measurements is the number of concurrent virtual users that a single server can support while still delivering acceptable performance levels. The results shown in Figure 9 are the numbers of concurrent knowledge workers that a single server can support. The 4th Gen AMD EPYC 9654 processor exceeds the Intel Xeon Platinum 8380 by more than 2.00x.[12]

raghu_nambiar_8-1668102702798.png

Figure 9: Virtual Desktop Density competitive performance uplift

Database Management Systems

The use of databases across structured, un-structured, and time-series data types has exploded across the enterprise as application complexity has increased. AMD EPYC 9004 Series Processors with expanded memory channels, PCI Gen5 storage, and network support enable increased database throughput, performance, and predictability compared to both prior generation AMD EPYC and competing processors. The breadth of tests across proprietary and open-source database offerings highlights the strength and power delivered by 4th Gen AMD EPYC processors. This section discusses performance results for relational, NoSQL, and graph databases.

  • Relational Database Management Systems: Relational Database Management Systems (RDBMS) continue to be the foundation for business-critical applications. AMD benchmarked Online Transaction Processing (OLTP) and Decision Support Systems (DSS) on AMD EPYC® 9004 Series Processors with the popular open-source MySQL relational database developed by Oracle. As shown in Figure 10, AMD EPYC 9004 Series Processors delivered ~2.40x and ~2.70x the performance of Intel Xeon Platinum 8380 processors for OLTP and DSS workloads, respectively.[13,14]

raghu_nambiar_9-1668102702803.png

Figure 10: Competitive performance uplift for OLTP and DSS on MySQL

  • ERP performance with SAP SD: SAP Application Performance Standard (SAPS) is the standard SAP benchmark used for measuring the performance of SAP deployments. SAPS uses a hardware-independent unit of measurement that describes the performance of a system operating in the SAP environment. It is derived from the Sales and Distribution (SD) benchmark. As shown in Figure 11, the 4th Gen AMD EPYC 9654 processor displays impressive generational and competitive performance uplifts compared to both the 3rd Gen AMD EPYC 7763 and Intel Xeon Platinum 8380 processors, respectively.[15]

raghu_nambiar_10-1668102702806.png

Figure 11: ERP Performance with SAP SD generational and competitive uplifts

  • Graph database performance with LDBC Social Network Benchmark: The Linked Data Benchmark Council (LDBC) aims to define standard graph benchmarks to foster a community around graph processing technologies. The LDBC Social Network Benchmark (SNB) Business Intelligence (BI) suite defines graph workloads targeting database management systems. AMD tested this benchmark using the massively parallel, scalable, and distributed TigerGraph graph analytics database platform that stores entities as the nodes in a graph and their relationships as the edges that interconnect the nodes.

    This approach enables modeling the natural relationships between entities without the need to structure them in multiple tables, thereby enabling rapid querying of massive datasets for both interactive queries and batch-processed reports. This model is emerging as a replacement for relational, document-based, and key-value database systems. Business verticals use TigerGraph for purposes such as fraud-detection, supply-chain optimization, and healthcare recommendations. The 96-core 4th Gen AMD EPYC 9654 processor shows generational performance uplifts of about 2.40x (transaction processing) and 2.70x (decision support) compared to the 64-core 3rd Gen AMD EPYC 7763 processors at SF1000 (a scale factor of 1000 GB). See Figure 12.[16]

raghu_nambiar_11-1668102702812.png

Figure 12: Graph Database generational performance uplift

  • In-memory database performance with Redis: Redis is an in-memory data structure store used as a distributed, in-memory key–value database, cache, and message broker, with optional durability. Redis supports different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, HyperLogLogs, bitmaps, streams, and spatial indices. Redis works with an in-memory dataset to achieve top performance. Depending on the use case, Redis can persist the data either by periodically dumping the dataset to disk or by appending each command to a disk-based log. As shown in Figure 13, AMD EPYC 9004 Series Processors delivered ~3.00x (for Set) and ~3.20x (for Get) the performance of Intel Xeon Platinum 8380 processors.[17]

raghu_nambiar_12-1668102702817.png

Figure 13: Redis-bench WRK generational and competitive performance uplifts

High Performance Computing (HPC)

HPC touches nearly every aspect of modern daily life. It helps save lives by predicting major climate and weather events and by helping to design safer cars, planes, buildings, and bridges. It helps make everyday items more affordable by minimizing materials used in products, driving efficiency into designs, and reducing development costs. It also helps accelerate time to market by allowing fast simulation of virtual products, thereby reducing the time and expense traditionally required by physical prototyping and testing. These are just a few of the ways in which HPC helps make the world a better place.

The demand for ever-higher HPC workload performance is only increasing. Higher performance allows faster simulations. Faster simulations can enable shorter product development times, simulating more scenarios, and refining granularity in the models tested to help make better, more efficient products.

4th Gen AMD EPYC processors lead the way for commercial, research, and academic HPC workloads. “Zen 4” cores are at the heart of the first Exascale supercomputer, the #1 spot on the Top 500 list, and the number 1 spot on the Green 500 list.[18]

Let me start with the SPEChpc™ 2021 and SPEC MPI® 2007 benchmarks and then offer some performance highlights from some of our key software partners and other open-source workloads. Throughout this section, you will see that the key new features found in 4th Gen AMD EPYC processors yield significant performance improvements on important HPC performance metrics.

  • SPEChpc™ 2021: Figure 14 (below) shows an incredible ~2.8x performance uplift (13.90/4.94) when comparing the top-of-stack 96-core 4th generation AMD EPYC CPUs against the top-of-stack Intel Xeon Platinum 8380.

    SPEChpc 2021 Benchmark Suites were developed to help benchmark various systems with a focus on compute intensive parallel performance. They provide a comprehensive measure of real-world performance for HPC systems by offering a set of codes that are representative of HPC workloads. These benchmark suites are designed to stress many aspects of the overall system.

    Figure 14 clearly shows that the new “Zen 4” cores 96-core AMD EPYC 9654 processors are well balanced with the other aspects of the processor, such as memory bandwidth, inter-socket communication, etc. This balance allows the AMD EPYC processor to achieve a significant overall HPC performance uplift.[19]

raghu_nambiar_13-1668102702819.png

Figure 14: SPEChpc 2021 generational and competitive performance uplifts

  • SPEC MPI® 2007: SPEC MPI 2007 is another standard HPC benchmark suite that stresses various aspects of the system but focuses on performance of the Message-Passing Interface (MPI) for compute-intensive applications. MPI performance is critical for most HPC workloads. Once again, the results (see Figure 15 below) show an impressive performance uplift of ~1.83x (64.1/35). Just as importantly, they show the performance of a well-balanced processor. The 4th Gen AMD EPYC processor delivers the IO and memory bandwidth needed to feed the additional compute capabilities of 50% more Zen 4 cores.[20]

raghu_nambiar_14-1668102702822.png

Figure 15: SPEC MPI 2007 generational performance uplift

Let me highlight some of our joint work with HPC ecosystem partners:

  • Ansys®: Ansys offers a wide variety of engineering simulation applications for both on-premises and cloud deployments. Ansys has also made a commitment to high performance on AMD EPYC processors. Through our deep engineering engagement, Ansys has adopted the AMD accelerated math libraries (AOCL) with Ansys Mechanical. They have also adopted the AMD performance compiler (AOCC) with Ansys LS-DYNA. This helps produce great performance today and lays the foundation for continued performance gains in future generations of AMD processors.

    Figure 16 shows that commitment is paying off. 4th Gen AMD EPYC processors show an incredible competitive performance uplift over the top-of-stack Intel Xeon Platinum 8380 and an impressive generational uplift compared to the top-of-stack 3rd Gen AMD EPYC 7763 processors.

    We tested performance on the Ansys ions in our labs, including CFX® (Computational Fluid Dynamics), Fluent® (Computational Fluid Dynamics), LS-DYNA® (Explicit Finite Element Analysis), and Mechanical® (Implicit Finite Element Analysis) applications in our labs. This testing used standard sets of benchmarks that Ansys provides for each application to help evaluate performance running their software. These benchmark cases represent typical usage and cover a range of sizes.[21]

raghu_nambiar_15-1668102702823.png

Figure 16: Ansys generational and competitive performance uplifts

  • Altair®: Altair provides software and cloud solutions for simulation, high performance computing, data analytics, and AI. AMD worked closely with Altair’s engineering teams to test several applications across a broad spectrum of application areas, including AcuSolve® (Computational Fluid Dynamics), Feko® (Computational Electromagnetics), and Radioss® (Finite Element Analysis). Radioss is now also offered as an open-source project called OpenRadioss, allowing broader collaboration for performance and functionality.

    Each of the workloads tested puts different demands on the system. The performance comparison (Figure 17) shows that the highest core-count 4th Generation AMD EPYC processors (96-cores) provide truly exceptional generational and competitive uplifts across the board.[22]

raghu_nambiar_16-1668102702825.png

Figure 17: Altair generational and competitive performance uplifts

  • Dassault Systèmes®: SIMULIA offers applications for realistic engineering simulations. AMD tested both Abaqus/Explicit (Explicit Finite Element Analysis) and PowerFLOW (Computational Fluid Dynamics) in our labs. Both delivered incredible generational uplifts running 4th Gen AMD EPYC processors, as shown in Figure 18.[23]

raghu_nambiar_17-1668102702826.png

Figure 18: Dassault Systèmes SIMULIA generational performance uplift

  • Siemens Digital Industries Software: Simcenter STAR-CCM+ ™ is a multiphysics computational fluid dynamics (CFD) application that simulates products operating under real-world conditions. Memory bandwidth tends to heavily influence CFD application performance. The outstanding generational performance uplift shown in Figure 19 takes advantage of the significant memory bandwidth found in 4th Gen AMD EPYC processors.[24]

raghu_nambiar_18-1668102702829.png

Figure 19: Simcenter STAR-CCM+ generational performance uplift

Artificial Intelligence and Machine Learning (AI/ML)

Artificial intelligence (AI) and Machine Learning (ML) are infiltrating all aspects of the datacenter, including physical and virtual, bare metal, and cloud deployments. ML models are being deployed across a wide range of business applications such as (but not limited to) image classification, object detection, natural language processing, and speech detection. As shown in Figure 20, 4th Gen AMD EPYC processors demonstrate impressive competitive performance uplifts on the following AI/ML CPU-based inference workloads:

  • ResNet50: Residual Networks (ResNet) is a Convolutional Neural Network (CNN) used for computer vision. ResNet-50 is a CNN that is 50 layers deep and is commonly used for image classification and training using an image dataset such as ImageNet before the trained model can be used for inference. We ran the pretrained ResNet-50v1.5 DeepSparse INT8 model from Neural Magic on multiple platforms to evaluate its CPU-only inference performance with ImageNet.[25]
  • BERT Large: Bidirectional Encoder Representations for Transformers (BERT) is a deep learning model used for various natural language processing tasks that has been pre-trained on Wikipedia and BooksCorpus and requires additional tuning for specific tasks. We ran the pretrained BERT-large DeepSparse INT8 model from Neural Magic on multiple platforms to evaluate its CPU-only inference performance on answering questions using the Stanford Question Answer Database (SQuAD).[25]
  • Yolo v5: You Only Look Once (YOLO) is a fast, accurate object detection algorithm that divides images into a grid where each grid cell is responsible for detecting objects within itself. We ran the pretrained YOLOv5 DeepSparse INT8 model from Neural Magic on multiple platforms to evaluate its CPU-only inference performance with Common Objects in COntext (COCO).[25]

raghu_nambiar_19-1668102702831.png

Figure 20: AI/ML Performance competitive performance uplifts

Several AMD customers using complex AI inference engines are already taking advantage of the high performance offered by 4th Gen AMD EPYC processors along with targeted software optimizations via the AMD Zen Deep Neural Network (ZenDNN) version 4.0 library to enjoy a performance uplift in select applications. ZenDNN is a library that includes APIs that implement a framework for a software implementation of neural networking concepts. These APIs are enabled, tuned, and optimized for inference on AMD EPYC processors. Targeted applications including computer vision, natural language processing (NLP), and recommender systems are integrated into popular AI frameworks, such as TensorFlow, ONNX Runtime and PyTorch. These applications great performance, as shown in the multiple benchmarks results you will see below.

The following bullet points showcase four representative AI benchmark workloads: TPCx-AI, ResNet-50, BERT-Large, and DLRM. TPCx-AI represents a broad end-to-end AI workflow, and the other three workloads represent the most common AI use cases: image classification, natural language processing, and recommendation engines. These use cases showcase the performance uplift from the tight integration of the ZenDNN 4.0 library with 4th Gen AMD EPYC processors.

  • TPCx-AI: The TPCx Benchmark-AI (TPCx-AI) benchmark focuses on emulating the behavior of AI workloads that are relevant in today’s datacenters and cloud environments, making these world record results important anywhere AI and AMD EPYC processors and data centers are mentioned together. AMD collaborated with Dell Technologies to post five new TPCx-AI world record results with systems powered by 4th Gen AMD EPYC processors at scale factors SF3, SF10, SF30, SF100, and SF300.[26] The results at scale factors SF3, SF30, SF100, and SF300 are the industry’s first-ever results, and the SF10 results was the industry’s best result. These records represent the leading-edge performance that 4th Gen AMD EPYC processors bring to bear for the AI market. Please note that the performance shown in Figure 21 reflects several different system configurations, which are described in AI/ML Performance Highlights.

raghu_nambiar_0-1680660971526.jpeg

Figure 21: AMD EPYC TPCx-AI performance and price/performance

  • ResNet-50: AMD ran the resnet50_fp32_pretrained_model.pb (FP32) model on two systems. This model won the 2015 ImageNet competition and is commonly used to classify images. As shown below, the 4th Gen AMD EPYC system processed ~919.52 images per second with a batch size of 640 and ~927.42 images per second with a batch size of 960, a generational performance uplift of ~2.10x and ~2.09x over the 3rd Gen AMD EPYC system, respectively. The results shown in Figure 22 are the average of three runs.[27]

raghu_nambiar_1-1680660971510.jpeg

Figure 22: ResNet-50 generational performance uplift using AMD ZenDNN 4.0

  • BERT-Large: AMD engineers ran the wwm_uncased_L-24_H-1024_A-16 (FP32) model on the systems described above to evaluate the relative performance of the 3rd and 4th Gen AMD EPYC systems. As shown below, the 4th Gen AMD EPYC system processed ~28.74 samples per second (sequence length = 256) and ~18.65 samples per second (sequence length = 384), which translates into generational performance uplifts of ~1.83x and ~1.82x over the 3rd Gen AMD EPYC system, respectively. Each of the results shown in Figure 23 are the average of three runs.[27]

raghu_nambiar_2-1680660971567.jpeg

Figure 23: BERT-Large generational performance uplift using AMD ZenDNN 4.0

  • DLRM: AMD engineers ran the MLPerf™ DLRM models, tb00_40M.pt (90GB FP32). As shown below, these unofficial and unpublished results show that the 4th Gen AMD EPYC system processed ~2948.38 samples per second at a batch size of 1 and ~3132,42 samples per second at a batch size of 2, a generational performance uplift of ~1.72x and ~1.83x over the 3rd Gen AMD EPYC system, respectively. The results shown in Figure 24 are the average of three runs.[27] Please note: The MLPerf™ trademark is a registered and unregistered trademark and service mark of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. Further, these results have not been verified by the MLCommons Association.

raghu_nambiar_3-1680660971546.jpeg

Figure 24: DLRM generational performance uplift using AMD ZenDNN 4.0

Financial Services Vertical  

The financial services industry includes banks, insurance companies, investment advisors, and more. Financial institutions require accurate data and extreme performance where an advantage of even a few microseconds can reap millions of dollars. The financial services industry is adopting virtualized infrastructure and running traditional and emerging workloads such as big data analytics, and artificial intelligence. Further, performance of vertical applications like Black-Scholes simulation are critical.

  • Black-Scholes options pricing model: The Black-Scholes model is widely used to determine option pricing based on several variables. AMD conducted several generational and competitive Black-Scholes performance tests across a range of option sizes and iterations and measured the elapsed time taken to run each test on a single system. Figure 25 shows the speedup achieved in elapsed time demonstrating a very strong generational and competitive performance uplift for the 4th Gen AMD EPYC 9554 processor.[28]

raghu_nambiar_20-1668102702833.png

Figure 25: Black-Scholes generational and competitive performance uplifts

Media and Entertainment

This segment covers a broad set of workloads that involve rendering still images and videos for a multitude of uses from architectural visualizations to shows, movies, simulations, and much more. The integration of digital graphics into everyday life is occurring during the ongoing migration from physical media such as tapes and discs to streaming services that serve stills and videos on demand to every type of user from handheld phones and tablets to laptop and desktop computers and televisions of many sizes, resolutions, and capabilities.

Today’s users demand the highest possible realism at the highest possible resolution delivered using as little bandwidth as possible. These combined demands drive the need to render, encode, decode, and transcode media (convert from one format to another) quickly and efficiently. 4th Gen AMD EPYC processors deliver the superb performance required to meet these demands. Here are some key examples (see Figure 26 below for consolidated results):

  • Autodesk® Arnold: Arnold is an advanced Monte Carlo ray tracing renderer designed for VFX and animation production. It is designed to work with some of the top tools used by digital artists such as Maya, Houdini, 3ds Max, Cinema 4D, and Katana via plugins. AMD tested the 4th Gen AMD EPYC 9654 processor against the 3rd Gen AMD EPYC 7763 and found strong generational performance uplift of ~1.90x rendering the gtc_robot scene.[29]
  • Chaos® V-Ray®: V-Ray5 is a 3D rendering plugin that works seamlessly with major 3D design and CAD programs, such as 3ds Max, Cinema 4D, Houdini, Maya, Nuke, Revit, Rhino, SketchUp, and Unreal. V-ray allows artists and designers to create and share projects with real-time ray tracing and the ability to render high-quality 3D visualizations. It is widely used for film and television productions, advertising, and architectural visualizations. The 4th Gen AMD EPYC 9654 processor delivers an impressive ~1.91x generational performance uplift over the 3rd Gen AMD EPYC 7763 processor.[30]
  • Synamedia® Virtual Digital Content Manager (vDCM): Synamedia vDCM provides advanced virtualized, software-based video, audio, and metadata processing for live delivery across many video formats. Broadcasters, content providers, and service providers can offer excellent viewing experiences that include high picture quality and multi-screen transcoding delivered with high bandwidth efficiency. Synamedia testing showed that the 4th Gen AMD EPYC 9654 processor delivers an average of approximately 77.5% (H.264) and 100% (H.265) generational video encoding performance uplifts over the 3rd Gen AMD EPYC 7763 processor using only 50% more processor cores across a wide range of video bitrates, resolutions, framerates and formats. A single dual-processor system powered by 4th Gen AMD EPYC 9654 processor) can concurrently transcode two 8K video streams at 60 frames per second.[30]
  • Visionular AV1 Codec: The Visionular AV1 codec is an advanced video coding format that offers high performance and fidelity. The 4th Generation AMD EPYC 9654 processor encodes 8 concurrent video streams at ~1.66x higher frame rate of the 3rd Gen AMD EPYC 7763 processor while encoding the crowd_run scene. For the Tears_of_Steel scene, the 4th Gen AMD EPYC 9654 processor encodes 8 concurrent streams at a ~1.55x higher frame rate of the 3rd Gen AMD EPYC 7763 processor. These results indicate a generational performance uplift of about double with only 50% more cores.[31]

raghu_nambiar_21-1668102702835.png

Figure 26: Media and Rendering generational performance uplifts at varying resolutions

AMD Powers the Modern Datacenter

The launch of 4th Gen AMD EPYC processors heralds the introduction of the world’s highest-performance server processor that delivers optimal TCO across workloads and industry leadership x86 energy efficiency[32] to help support sustainability goals, and Confidential Computing across a rich ecosystem of solutions. AMD EPYC processors continue to be the backbone of a growing line of AMD products designed to power the datacenters of today and tomorrow.

  • AMD Instinct™ accelerators are designed to power discoveries at exascale to enable scientists to tackle our most pressing challenges.
  • AMD Pensando solutions deliver highly programmable software-defined cloud, compute, networking, storage and security features wherever data is located, helping to offer improvements in productivity, performance and scale compared to current architectures with no risk of lock-in.
  • AMD Xilinx offers highly flexible and adaptive FPGAs, hardware adaptive SoCs, and the Adaptive Compute Acceleration Platform (ACAP) processing platforms that enable rapid innovation across a variety of technologies from the endpoint to the edge to the cloud.

Raghu Nambiar is a Corporate Vice President of Data Center Ecosystems and Solutions for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

References

  1. EPYC-038: AMD EPYC 9004 Series delivers up to ~14% geomean IPC single thread uplift generationally on representative server workloads.
  2. EPYC-032: AMD EPYC 9004 CPUs support 12 memory channels. Intel Scalable Ice Lake CPUs support 8 memory channels. 12 ÷ 8 = 1.5x the memory channels or 50% more memory channels per https://ark.intel.com/.
  3. GD-183: AMD Infinity Guard features vary by EPYC™ Processor generations. Infinity Guard security features must be enabled by server OEMs and/or Cloud Service Providers to operate. Check with your OEM or provider to confirm support of these features. Learn more about Infinity Guard at https://www.amd.com/en/technologies/infinity-guard.
  4. SP5-010B: SPECrate®2017_int_base based on published scores from www.spec.org as of 11/10/2022. Configurations: 2P AMD EPYC 9654 (1790 SPECrate®2017_int_base, 192 total cores, www.spec.org/cpu2017/results/res2022q4/cpu2017-20221024-32607.html) is 2.97x the performance of published 2P Intel Xeon Platinum 8380 (602 SPECrate®2017_int_base, 80 total cores,  http://spec.org/cpu2017/results/res2021q2/cpu2017-20210521-26364.html). Published 2P AMD EPYC 7763 (861 SPECrate®2017_int_base, 128 total cores, http://spec.org/cpu2017/results/res2021q4/cpu2017-20211121-30148.html)  is shown at 1.43x for reference. SPEC®, SPEC CPU®, and SPECrate® are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org for more information.
  5. SP5-009C: SPECrate®2017_fp_base based on published scores from www.spec.org as of 11/10/2022. Configurations: 2P AMD EPYC 9654 (1480 SPECrate®2017_fp_base, 192 total cores, www.spec.org/cpu2017/results/res2022q4/cpu2017-20221024-32605.html) is 2.52x the performance of published 2P Intel Xeon Platinum 8380 (587 SPECrate®2017_fp_base, 160 total cores, www.spec.org/cpu2017/results/res2022q4/cpu2017-20221010-32542.html).Published 2P AMD EPYC 7763 (663 SPECrate®2017_fp_base, 128 Total Cores, http://spec.org/cpu2017/results/res2021q4/cpu2017-20211121-30146.html)  is shown at 1.13x for reference. SPEC®, SPEC CPU®, and SPECrate® are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org for more information.
  6. SP5-076: DGEMM comparison based on AMD internal testing as of 11/10/2022 on a “Titanite” reference platform populated by 2P 96-Core EPYC™ 9654 delivers ~1.75x the GFLOPS compared to a “Daytona-X” reference platform populated by 2P 64-core EPYC 7763 processors. Results may vary due to factors such as OS and BIOS versions and settings, use of production servers, and other variables.
  7. SP5-077: HPL comparison based on AMD internal testing as of 11/10/2022 on a “Titanite” reference platform populated by 2P 96-Core EPYC™ 9654 delivers ~1.77x the GFLOPS compared to a “Daytona-X” reference platform populated by 2P 64-core EPYC 7763 processors. Results may vary due to factors such as OS and BIOS versions and settings, use of production servers, and other variables.
  8. SP5-074: NGNIX WRK comparison based on AMD measured median scores on 2P 96-core EPYC 9654 compared to 2P 40-core Xeon Platinum 8380 running NGNIX WRK workload as of 11/10/2022. Configurations: 2x AMD EPYC 9654 (3076105 rps) vs. 2x Xeon Platinum 8380 (1502721 rps) for ~2.05x the rps performance. 2P AMD EPYC 7763 scores 2490726 rps shown at 1.67x for reference. Results may vary.
  9. The SPECjbb® 2015 results are published at the following locations: 
  10. SPDK Perf comparison based on AMD internal testing as of 11/10/2022 on a “Titanite” reference platform populated by 2P 96-Core EPYC™ 9654 delivers ~1.75x the avg random reads (4K blocks) and ~2.03x the avg random read/write (4K blocks) compared to a  DELL PowerEdge R6525 populated by 1P 32-core EPYC 7543P processor. Results may vary.
  11. SP5-049A: VMmark® 3.1.1 matched pair comparison based on published results as of 11/10/2022. Configurations: 2-node, 2P 96-core EPYC 9654 powered server running VMware ESXi 8 RTM (40.19 @ 44 tiles/836 VMs, https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vmmark/2022-10-18-HPE-ProLiant-DL3...) versus 2-node, 2P 40-core Xeon Platinum 8380 running VMware ESXi v7 U2  (14.19 @ 14 tiles/266 VMs, https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vmmark/2021-04-20-Fujitsu-PRIMERGY...) for 2.8x the score and 3.1x the tile (VM) capacity. 2-node, 2P EPYC 7763-powered server (23.33 @ 24 tiles/456 VMs, https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/vmmark/2022-02-08-Fujitsu-RX2450M1...) shown at 1.6x the performance for reference. VMmark is a registered trademark of VMware in the US or other countries.
  12. SP5-054: Login VSI™ Pro v4.1.40.1 comparison based on AMD internal testing as of 10/19/2022 measuring the maximum “knowledge worker” desktop sessions (VSImax) within VSI Baseline +1000ms response time using VMware ESXi 8 GA and VMware Horizon 8 on a server using 2x AMD EPYC 9654 (933 average VSImax sessions) versus a server with 2x Intel Xeon Platinum 8380 (400 average VSImax sessions).  Results may vary.
  13. SP5-070: 2P 96-core EPYC™ 9654 delivers ~2.7x the median queries/hour vs. 2P 40-core Xeon® Platinum 8380 using HammerDB TPROC-H
  14. SP5-071: 2P 96-core EPYC™ 9654 delivers ~2.4x the median transactions/min vs. 2P 40-core Xeon® Platinum 8380 using HammerDB TPROC-C
  15. SP5-056: SAP® SD 2-tier comparison based on published results as of 11/10/2022. Configurations: 2P 96-core EPYC 9654 powered server (148,000 benchmark users, https://www.sap.com/dmc/benchmark/2022/Cert22023.pdf  versus 2P 40-core Xeon Platinum 8380 (48,000, https://www.sap.com/dmc/benchmark/2021/Cert21026.pdf) for 3.08x the number of SAP SD benchmark users. 2P EPYC 7763   powered server (75,000 benchmark users, https://www.sap.com/dmc/benchmark/2021/Cert21021.pdf) shown at 1.79x the performance for reference. For more details see http://www.sap.com/benchmark. SAP and SAP logo are the trademarks or registered trademarks of SAP SE (or an SAP affiliate company) in Germany and in several other countries.
  16. SP5-075: LDBC Social Networking BI SF1000 comparison based on AMD measured median scores on 2P 32-core EPYC 9534 compared to 2P 32-core 7543 running LDBC Social Networking BI workload on TigerGraph 3.7.0 Enterprise as of 11/10/2022. Configurations: 2x AMD EPYC 9354 (5164.8 seconds/16.7 queries per day throughput) vs. 2x EPYC 7543 (5710.7 seconds/15.1 queries per day throughput) for ~1.11x the throughput performance on 32 queries. Results may vary.
  17. SP5-078: Redis-Benchmark comparison based on AMD measured median scores on 2P 96-core EPYC 9654 compared to 2P 40-core Xeon Platinum 8380 running Redis-Benchmark workload on Redis 6.0 as of 11/10/2022. Configurations: 2P AMD EPYC 9654 (2128736 set rps/2566882 get rps using 24 threads/12 instances) vs. 2P Xeon Platinum 8380 (709235 set rps/795167 get rps using 16 threads/8 instances) for ~3x the set and ~3.2x the get rps performance. 2P EPYC 7763 (1393626 set rps/1728928 get rps using 24 threads/8 instances) shown at ~1.96x the set rps and ~2.17x get rps the performance for reference. Results may vary.
  18. Top 500 and Green500 as of June 2022. https://www.top500.org/
  19. The SPEChpc™ 2021 Tiny result is published at the following locations: 
  20. The SPECmpi™ 2007 result is published at the following locations:
  21. Please see https://www.amd.com/system/files/documents/amd-epyc-9004-pb-ansys-generational.pdf.
  22. Please see https://www.amd.com/system/files/documents/amd-epyc-9004-pb-altair-generational.pdf.
  23. Please see https://www.amd.com/system/files/documents/amd-epyc-9004-pb-simulia-generational.pdf.
  24. Please see https://www.amd.com/system/files/documents/amd-epyc-9004-pb-simcenter-star-ccm-generational.pdf.
  25. SP5-022: Neural Magic measured results on AMD reference systems as of 9/29/2022. Configurations:2P EPYC 9654 “Titanite” vs. 2P EPYC 7763 “Daytona” running on Ubuntu 22.04 LTS, Python 3.9.13, pip==22.12/deepsparse==1.0.2. BERT-Large Streaming Throughput items/sec (seq=384, batch 1, 48 streams, INT8 + sparse) using SQuAD v1.1 dataset; ResNet50 Batched Throughput items/sec (batch 256, single-stream, INT8 sparse) using ImageNet dataset; YOLOv5s Streaming Throughput ([image 3, 640, 640], batch 1, multi-stream, per-stream latency <=33ms) using COCO dataset. Testing not independently verified by AMD.
  26. The TPCx-AI results are posted at:
  27. Please see https://www.amd.com/system/files/documents/amd-epyc-9004-pb-aiml.pdf for detailed test information.
  28. SP5-031: Black-Scholes European Option Pricing benchmark comparison based on AMD measurements for 100, 200, 400, 800, and 1600M options as of 10/4/2022. Max score is based on 200M options. Configurations: 2x 40-core Intel Xeon Platinum 8380 vs. 2x 64-core EPYC 9554 all systems on Ubuntu 22.04 and compiled with ICC 2022.1.0. Results may vary.
  29. SP5-039: Autodesk® Arnold gtc_robot workload comparison based on internal AMD reference platform measurements as of 09/27/2022. Comparison of 2P AMD EPYC 9654 (99 avg. seconds/872.73 ray-traces/day) is ~2.4x the performance of 2P Intel Xeon Platinum 8380 (235 avg seconds/367.66 ray-traces/day). Results may vary. 2P EPYC 7763 shown for reference (167 avg seconds/517.37 ray-traces/day) at ~1.4x.
  30. SP5-038A: V-Ray based on published scores from https://benchmark.chaos.com/v5/vray as of 11/10/2022. Comparison of 2P AMD EPYC 9654 (209,102 max/206419 median vsamples, https://benchmark.chaos.com/v5/vray/#####) is 3.32x the performance of published 2P Intel Xeon Platinum 8380 (62,619 max/median vsamples, https://benchmark.chaos.com/v5/vray/29746). 2P EPYC 7763 shown for reference (109,248/99,443 median vsamples, https://benchmark.chaos.com/v5/vray/29746). Chaos®, V-Ray® and Phoenix FD® are registered trademarks of Chaos Software EOOD in Bulgaria and/or other countries. Note: Include text in red only if using 7763 score.
  31. Testing not independently verified by AMD.
  32. SPCTCO-002A: A 2P AMD EPYC 96 core 9654 CPU powered server, to deliver 10,000 units of integer performance takes an estimated: 59% fewer servers (7 AMD servers vs 17 Intel servers), 46% less power, and a 48% lower 3-yr TCO than a 2P server based on the 40 core Intel Xeon Platinum 8380 CPUs. The 2P EPYC 96 core CPU solution also provides estimated Greenhouse Gas Emission savings emissions avoided equivalent to 145,443 pounds of coal not burned in the USA over 3 years and carbon sequestration equivalent of 53 acres of forest annually in the USA.
About the Author
Raghu Nambiar currently holds the position of Corporate Vice President at AMD, where he leads a global engineering team dedicated to shaping the software and solutions strategy for the company's datacenter business. Before joining AMD, Raghu served as the Chief Technology Officer at Cisco UCS, instrumental in driving its transformation into a leading datacenter compute platform. During his tenure at Hewlett Packard, Raghu made significant contributions as an architect, pioneering several groundbreaking solutions. He is the holder of ten patents, with several more pending approval, and has made extensive academic contributions, including publishing over 75 peer-reviewed papers and 20 books in the LNCS series. Additionally, Raghu has taken on leadership roles in various industry standards committees. Raghu holds dual Master's degrees from the University of Massachusetts and Goa University, complemented by completing an advanced management program at Stanford University.