Engineering Insights: Unveiling MLPerf® Results on AMD Instinct™ MI300X Accelerators

Ronak_Shah · ‎08-28-2024

AMD Instinct MI300X GPUs, advanced by one of the latest versions of open-source ROCm™ achieved impressive results in the MLPerf Inference v4.1 round, highlighting strength of the full-stack AMD inference platform. The initial submission focused on the widely recognized LLaMA2-70B model, known for its high performance and versatility. It demonstrated strong Gen AI inference performance against the NVIDIA H100, setting a strong precedent for the capabilities of AMD Instinct MI300X accelerators.

Understanding MLPerf and Its Industry Significance

As large language models (LLMs) continue to scale-up in size and complexity, the need for efficient, cost-effective performance becomes increasingly critical for inference and training. Achieving high-performance LLMs requires robust parallel computing and a well-optimized software stack. This is where MLPerf, the industry’s leading benchmarking suite, plays a crucial role. Developed by the cross-industry consortium MLCommons®—of which AMD is a founding member—MLPerf offers a set of open-source AI benchmarks including Gen AI, LLMs and other models that provide rigorous, peer-reviewed metrics. These benchmarks enable enterprises to evaluate the effectiveness of AI hardware and software. Excelling in MLPerf Inference v4.1 is a significant milestone for AMD, highlighting our commitment to transparency and delivering standardized data that empowers enterprises to make informed decisions.

In-Depth Look at the LLaMA2-70B Benchmark

The AMD inaugural MLPerf submission used the LLaMA2-70B model. The LLaMA2-70B model is a significant advancement in LLMs, crucial for real-world applications like natural-language processing and large-scale inference. The MLPerf benchmarking test included a Q&A scenario with 24,576 samples from the OpenORCA dataset, each with up to 1,024 input and output tokens. The benchmark evaluated inference performance in two scenarios:

Offline Scenario: Batch processing of input questions to maximize throughput in tokens per second

Server Scenario: Simulates real-time queries with strict latency limits (TTFT* ≤ 2s, TPOT* ≤ 200ms), assessing the hardware’s ability to deliver fast, responsive performance for low-latency tasks.

(*TTFT – Time to First Token, *TPOT – Time per output token)

AMD Instinct MI300X Performance in MLPerf

The AMD Instinct MI300X delivered impressive performance in its inaugural MLPerf submission using the Supermicro AS-8125GS-TNMR2 system, with four key entries for the LLaMA2-70B model. These results are particularly significant as they offer an apples-to-apples comparison with competing AI accelerators, are validated through peer review, reproducible, and based on industry-relevant use cases.

CPU-GPU Performance Combination:

Submission ID 4.1-0002: 8x AMD Instinct MI300X accelerators with 2x AMD EPYC 9374F (Genoa) CPUs in the Available category
This configuration showcased the powerful synergy between AMD Instinct MI300X GPU accelerators and 4^th Gen EPYC CPUs (formerly codenamed “Genoa”) for AI workloads, delivering performance within 2-3% of NVIDIA DGX H100 with 4^th Gen Intel Xeon CPUs in both server and offline scenarios at FP8 precision (see Figure 1 below)

Figure 1 - Showcasing performance of CPU-GPU combination for AI workload^1,2

Previewing Performance with Next-Gen CPU:

Submission ID 4.1-0070: 8x AMD Instinct MI300X with 2x AMD EPYC “Turin” CPUs in the Preview category.
Demonstrated the performance gains from the forthcoming 5th Gen AMD EPYC™ “Turin” CPU with AMD Instinct MI300X GPU accelerators, having a slight edge over NVIDIA DGX H100 with Intel Xeon in the server scenario and maintaining comparable performance in the offline scenario at FP8 precision (see Figure 1 above)

Single GPU Efficiency

Submission ID 4.1-0001: 1x AMD Instinct MI300X accelerator with 2x 4^th Gen AMD EPYC 9374F CPUs (Genoa) in the Available category.
This entry highlighted the vast 192 GB memory of AMD Instinct MI300X, enabling a single GPU to efficiently run the entire LLaMA2-70B model, avoiding the network overhead associated with model splitting across multiple GPUs at FP8 precision (see Figure 2 below).

Figure 2 - Single GPU Running the Entire Llama 2 70B Model¹

The AMD CDNA™ 3 architecture in the AMD Instinct MI300X features 192 GB of HBM3 memory and delivers a peak memory bandwidth of 5.3 TB/s. This substantial capacity allows the AMD Instinct MI300X to comfortably host and run a full 70 billion parameter model, like LLaMA2-70B, on a single GPU. With the ROCm software stack, the scaling efficiency from 1x AMD Instinct MI300X (TP1) to 8x AMD Instinct MI300X (8x TP1) is nearly linear as seen from the results in Figure 2, demonstrating the ability of AMD Instinct MI300X to handle the largest MLPerf inference model to date.

Compelling Dell Server Design Results with AMD Instinct MI300X Accelerators

Submission ID 4.1-0022: 8x AMD Instinct MI300X accelerators with 2x Intel(R) Xeon(R) Platinum 8460Y+ in the Available category

In addition to AMD submissions, Dell validated platform-level performance of AMD Instinct accelerators by submitting their results with LLaMA2-70B on an 8x AMD Instinct MI300X setup using their PowerEdge XE9680 server. This submission highlights our partnership and underscoring the strength of our ecosystem making them an excellent choice for both data center and edge inference deployments. You can find more details on those results here.

You can preproduce the results on your own, following instructions in our ROCm Blog post here: Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission. Full results of all the submissions can be found on the MLCommons website. Code and other artifacts are available in this repository.

Performance Highlights – Engineering Insights

The strong competitive performance of the AMD Instinct MI300X accelerators can be attributed to its high compute power, extensive memory capacity with fast bandwidth, and the optimized ROCm software stack, which helps ensure efficient handling of large AI models like LLaMA2-70B. A few key factors played a crucial role:

Large GPU Memory Size:

Capacity: AMD Instinct MI300X offers the largest GPU memory available, allowing the entire LLaMA2-70B model to fit into memory while still accommodating KV cache. This avoids network overhead by preventing model splitting across GPUs, maximizing inference throughput.
Batch Sizes: In the offline scenario, we used a max_num_seqs parameter of 2048 to maximize throughput, while 768 was set for the server scenario to meet latency targets—both significantly higher than the default 256 value used in vLLM.
KV Cache Management: The vLLM’s support for paged attention enables efficient KV cache management, avoiding memory fragmentation issues because of large memory AMD Instinct MI300X accelerators.

FP8 Support:

AMD Instinct MI300X accelerator hardware supports the FP8 numerical format, and we extended this capability across the entire inference software stack. Using Quark, we quantized LLaMA2-70B model weights to FP8, retaining 99.9% accuracy as required by MLPerf. We also added FP8 support to vLLM, upgraded the hipBLASLt library, and implemented FP8 KV cache, significantly boosting performance.

Software Optimizations:

Kernel Optimization: We performed extensive profiling and optimization, including AMD Composable Kernels (CK) based prefill attention, FP8 decode paged attention, and fused kernels such as residual add RMS Norm, SwiGLU with FP8 output scaling.

vLLM Enhancements: Improvements were made to the scheduler for faster decode scheduling and better prefill batching, optimizing both offline and server use cases.

CPU Optimization:

Although bulk of the AI workload processing happens on GPUs, CPU performance is also critical. Lower core count CPUs with high boost frequencies, like the EPYC 9374F with 32 cores and up to 4.3 GHz boost, provided optimal performance, especially for server scenarios. Testing with the upcoming “Turin” generation of EPYC CPUs revealed performance gains versus 4^th Gen EPYC, which were submitted as a Preview submission.

Setting a Precedent for the Largest Open-Source Model

The successful results in MLPerf with LLaMA2-70B validate the performance of the AMD Instinct MI300X GPU accelerators, and offer a strong precedent for their future effectiveness with even larger models like Llama 3.1. We are proud to power Meta's new LLaMa 3.1 405B parameter model, launched with Day 0 support on AMD Instinct MI300X accelerators. Thanks to the industry-leading memory capabilities of the AMD Instinct MI300X platform^MI300-25, only a server powered by eight AMD Instinct MI300X GPU accelerators can accommodate the entire LLaMa 3.1 model, with 405 billion parameters, in a single server using FP16 datatype^MI300-7A (see Figure 3). This helps in reducing server usage and bringing down costs. AMD Instinct MI300X accelerators are the ultimate solution to power the largest open models available today.

Figure 3 – LLaMa 3.1 (405B) Estimated Memory Requirements vs Available GPU Memory

(Source – Artificial Analysis)

Looking Ahead

We’re excited to continue showcasing the versatility and performance of AMD Instinct accelerators across future benchmarks as we expand our testing and optimization efforts. This is just the beginning of our journey. In the coming months, we plan to launch the next iterations of the AMD Instinct series, featuring among other advances, additional memory, support for lower precision data types, and increased compute power. Future ROCm releases target bringing software enhancements, including kernel improvements and advanced quantization support. Stay tuned for our next MLPerf submission—we look forward to sharing our progress and insights with you.

Co-Authors:

Meena Arunachalam - Fellow Systems Design Engineer

Miro Hodak - SMTS Systems Design Engineer

Endnotes:

¹MI300-56 - Official MLPerf™ score Inference v4.1 Llama2-70B-99.9 server tokens/ and offline tokens/s results retrieved from https://mlcommons.org/benchmarks/inference-datacenter/ on August 28, 2024, from the following entries: 4.1-0001 (available), 4.1-0002 (available) and 4.1-0043 The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.

²MI300-57 - Official MLPerf™ Inference score v4.1 Llama2-70B-99.9 server tokens/s and offline tokens/s results retrieved from https://mlcommons.org/benchmarks/inference-datacenter/ on August 28, 2024, from the following entries: 4.1-0070 (preview) and 4.1.0043. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.