AMD Instinct MI300X GPUs, advanced by one of the latest versions of open-source ROCm™ achieved impressive results in the MLPerf Inference v4.1 round, highlighting strength of the full-stack AMD inference platform. The initial submission focused on the widely recognized LLaMA2-70B model, known for its high performance and versatility. It demonstrated strong Gen AI inference performance against the NVIDIA H100, setting a strong precedent for the capabilities of AMD Instinct MI300X accelerators.
As large language models (LLMs) continue to scale-up in size and complexity, the need for efficient, cost-effective performance becomes increasingly critical for inference and training. Achieving high-performance LLMs requires robust parallel computing and a well-optimized software stack. This is where MLPerf, the industry’s leading benchmarking suite, plays a crucial role. Developed by the cross-industry consortium MLCommons®—of which AMD is a founding member—MLPerf offers a set of open-source AI benchmarks including Gen AI, LLMs and other models that provide rigorous, peer-reviewed metrics. These benchmarks enable enterprises to evaluate the effectiveness of AI hardware and software. Excelling in MLPerf Inference v4.1 is a significant milestone for AMD, highlighting our commitment to transparency and delivering standardized data that empowers enterprises to make informed decisions.
The AMD inaugural MLPerf submission used the LLaMA2-70B model. The LLaMA2-70B model is a significant advancement in LLMs, crucial for real-world applications like natural-language processing and large-scale inference. The MLPerf benchmarking test included a Q&A scenario with 24,576 samples from the OpenORCA dataset, each with up to 1,024 input and output tokens. The benchmark evaluated inference performance in two scenarios:
(*TTFT – Time to First Token, *TPOT – Time per output token)
The AMD Instinct MI300X delivered impressive performance in its inaugural MLPerf submission using the Supermicro AS-8125GS-TNMR2 system, with four key entries for the LLaMA2-70B model. These results are particularly significant as they offer an apples-to-apples comparison with competing AI accelerators, are validated through peer review, reproducible, and based on industry-relevant use cases.
Figure 1 - Showcasing performance of CPU-GPU combination for AI workload1,2
Figure 2 - Single GPU Running the Entire Llama 2 70B Model1
The AMD CDNA™ 3 architecture in the AMD Instinct MI300X features 192 GB of HBM3 memory and delivers a peak memory bandwidth of 5.3 TB/s. This substantial capacity allows the AMD Instinct MI300X to comfortably host and run a full 70 billion parameter model, like LLaMA2-70B, on a single GPU. With the ROCm software stack, the scaling efficiency from 1x AMD Instinct MI300X (TP1) to 8x AMD Instinct MI300X (8x TP1) is nearly linear as seen from the results in Figure 2, demonstrating the ability of AMD Instinct MI300X to handle the largest MLPerf inference model to date.
Submission ID 4.1-0022: 8x AMD Instinct MI300X accelerators with 2x Intel(R) Xeon(R) Platinum 8460Y+ in the Available category
In addition to AMD submissions, Dell validated platform-level performance of AMD Instinct accelerators by submitting their results with LLaMA2-70B on an 8x AMD Instinct MI300X setup using their PowerEdge XE9680 server. This submission highlights our partnership and underscoring the strength of our ecosystem making them an excellent choice for both data center and edge inference deployments. You can find more details on those results here.
You can preproduce the results on your own, following instructions in our ROCm Blog post here: Benchmarking Machine Learning using ROCm and AMD GPUs: Reproducing Our MLPerf Inference Submission. Full results of all the submissions can be found on the MLCommons website. Code and other artifacts are available in this repository.
The strong competitive performance of the AMD Instinct MI300X accelerators can be attributed to its high compute power, extensive memory capacity with fast bandwidth, and the optimized ROCm software stack, which helps ensure efficient handling of large AI models like LLaMA2-70B. A few key factors played a crucial role:
Large GPU Memory Size:
FP8 Support:
Software Optimizations:
CPU Optimization:
The successful results in MLPerf with LLaMA2-70B validate the performance of the AMD Instinct MI300X GPU accelerators, and offer a strong precedent for their future effectiveness with even larger models like Llama 3.1. We are proud to power Meta's new LLaMa 3.1 405B parameter model, launched with Day 0 support on AMD Instinct MI300X accelerators. Thanks to the industry-leading memory capabilities of the AMD Instinct MI300X platformMI300-25, only a server powered by eight AMD Instinct MI300X GPU accelerators can accommodate the entire LLaMa 3.1 model, with 405 billion parameters, in a single server using FP16 datatypeMI300-7A (see Figure 3). This helps in reducing server usage and bringing down costs. AMD Instinct MI300X accelerators are the ultimate solution to power the largest open models available today.
Figure 3 – LLaMa 3.1 (405B) Estimated Memory Requirements vs Available GPU Memory
(Source – Artificial Analysis)
We’re excited to continue showcasing the versatility and performance of AMD Instinct accelerators across future benchmarks as we expand our testing and optimization efforts. This is just the beginning of our journey. In the coming months, we plan to launch the next iterations of the AMD Instinct series, featuring among other advances, additional memory, support for lower precision data types, and increased compute power. Future ROCm releases target bringing software enhancements, including kernel improvements and advanced quantization support. Stay tuned for our next MLPerf submission—we look forward to sharing our progress and insights with you.
Meena Arunachalam - Fellow Systems Design Engineer
Miro Hodak - SMTS Systems Design Engineer
1MI300-56 - Official MLPerf™ score Inference v4.1 Llama2-70B-99.9 server tokens/ and offline tokens/s results retrieved from https://mlcommons.org/benchmarks/inference-datacenter/ on August 28, 2024, from the following entries: 4.1-0001 (available), 4.1-0002 (available) and 4.1-0043 The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.
2MI300-57 - Official MLPerf™ Inference score v4.1 Llama2-70B-99.9 server tokens/s and offline tokens/s results retrieved from https://mlcommons.org/benchmarks/inference-datacenter/ on August 28, 2024, from the following entries: 4.1-0070 (preview) and 4.1.0043. The MLPerf™ name and logo are trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information.