Llama 3.1: Ready to Run on AMD platforms from data center, edge to AI PCs

Ramine_Roane · ‎07-23-2024

Our AI strategy at AMD is focused on enabling the AI ecosystem with a broad portfolio of optimized training and inference compute engines, open and proven software capabilities, and deep-rooted co-innovation with our partners and customers. High performance, new innovations and broad compatibility are foundational vectors driving this strategy as the AI universe evolves. A significant focus of ours is to enable the next generation of AI models for everyone, making the benefits of AI pervasive.

Meta has been a critical contributor to the AI movement, providing the technology behind the widely used Llama large language model (LLM). AMD and Meta both believe that an open approach contributes to better, safer and more cost-effective products, fueling faster innovation , and a healthier overall market. The latest Llama 3 models continue to showcase the growth and importance of open-source AI and LLMs.

With Llama 3.1, the LLM expands context length to 128K, adds support across 8 languages, and introduces Llama 3.1 405B, which according to Meta, is the largest openly available foundation model. With Llama 3.1 405B, it will enable the community to unlock new capabilities, such as synthetic data generation and model distillation.

We are encouraged by the recent release of the Llama 3.1 models from Meta and have them up and running in the labs at AMD on our broad portfolio of compute engines showing positive results. In the meantime, we want to showcase some of the impressive work our teams have done with Llama 3 and what Llama 3.1 means for AMD AI customers.

AMD Instinct™ MI300X GPU Accelerators and Llama 3.1

Every generation of models brings new capabilities and performance to its community of users and Llama 3.1 is no different, revolutionizing complex conversations with unparalleled contextual understanding, reasoning, and text generation, running seamlessly on AMD Instinct MI300X GPU Accelerator and Platform from day-0.

AMD Instinct MI300X GPUs continue to provide the leading memory capacity and bandwidth that enables users run a single instance of Llama 3 70B on a single MI300X accommodate and up to 8 parallel instances simultaneously on a single server.^1,2

But, with the new 405B parameter model, the largest openly available foundation model, the need for memory capacity is more important than ever. We have confirmed that a server powered by eight AMD Instinct MI300X accelerators can fit the entire Llama 3.1 405B parameter model using the FP16 datatype. This means organizations can benefit from significant cost savings, simplified infrastructure management, and enhanced performance efficiency. This is made possible by the industry-leading memory capabilities of the AMD Instinct MI300X platform.³

Finally, Meta used the latest versions of the ROCm™ Open Ecosystem and AMD Instinct MI300X GPUs in parts of the development process of Llama 3.1. This is a continuation of our ongoing collaboration with Meta, and we look forward to furthering this productive collaboration.

AMD EPYC™ CPUs and Llama 3.1

Beyond data center GPUs, AMD enables a leading server platform for data center computing, offering high performance, energy efficiency, and x86 compatibility for a variety of data center workloads with our AMD EPYC™ CPUs. AI is an increasingly vital part of many data center applications, boosting creativity, productivity and efficiency across myriad workloads.

As most modern data centers support a variety of workloads, using AMD EPYC CPUs gives customers leadership enterprise workload performance, energy efficiency and the ability to run AI and LLMs for inferencing, small model development, testing, and batch training.

Llama’s use as a benchmark has emerged as a consistent, easy to access and useful tool to help data center customers identify the key characteristics (performance, latency, scale) that guide assessments of technology and infrastructure to help model suitability to business’ data center server needs.

Llama 3.1 extends the value as a source of critical reference data with more scale, flexibility on data generation and synthesis, expanded context length and language support to better map to global business’ needs.

For those that are running a CPU only environment, with a smaller model like Llama 3 8B, our leadership 4th Gen AMD EPYC processors provide compelling performance and efficiency without requiring GPU acceleration. Modestly sized LLMs such as this are proving to be foundational elements to enterprise-class AI implementations. The ability to test CPU-only performance using the Llama 3 tools has given numerous customers the insight that there are many classes of workloads that they can develop and deploy on readily available compute infrastructure. And as the workloads grow more demanding and the models get larger, that same AMD EPYC server infrastructure is a powerful and efficient host to accommodate advanced GPU acceleration solutions such as AMD Instinct or other 3rd party accelerators.

AMD AI PCs and Llama 3.1

Not a coder? No problem! Harness the power of Meta’s Llama 3.1 at your fingertips with AMD Ryzen AI™ series of processors.

While developers can use code blocks and repos to get started with Llama 3.1, AMD is committed to the democratization of AI and lowering the barrier to entry for AI – which is why we partnered with LM Studio to bring Meta’s Llama 3.1 model to customers with AMD AI PCs.

To try it out, please head over to LM Studio and experience a state of the art, completely local, chat bot powered by Llama 3.1 in a just a few clicks. You can now use it to type emails, proofread documents, generate code and a lot more!

AMD Radeon™ GPUs and Llama 3.1

For users that are looking to drive generative AI locally, AMD Radeon™ GPUs can harness the power of on-device AI processing to unlock new experiences and gain access to personalized and real-time AI performance.

LLMs are no longer the preserve of big businesses with dedicated IT departments, running services in the cloud. With the combined power of select AMD Radeon desktop GPUs and AMD ROCm software, new open-source LLMs like Meta's Llama 2 and 3 – including the just released Llama 3.1 – mean that even small businesses can run their own customized AI tools locally, on standard desktop PCs or workstations, without the need to store sensitive data online⁴.

AMD AI desktop systems equipped with a Radeon PRO W7900 GPU running AMD ROCm 6.1 software and powered by Ryzen™ Threadripper™ PRO processors, represent a new client solution to fine-tune and run inference on LLMs with high precision.

As well, AMD PCs equipped with DirectML supported AMD GPUs can also take advantage of running Llama3.1 locally on your devices accelerated via DirectML AI frameworks optimized for AMD device. Read more here: https://community.amd.com/t5/ai/reduce-memory-footprint-and-improve-performance-running-llms-on/ba-p...

Conclusion

As we push the boundaries of AI, the collaboration between AMD and Meta plays a crucial role in advancing open-source AI. The compatibility of Llama 3.1 with AMD Instinct MI300X GPUs, AMD EPYC CPUs, AMD Ryzen AI, AMD Radeon GPUs, and AMD ROCm offers users a diverse choice of hardware and software, ensuring unparalleled performance and efficiency. AMD remains committed to providing cutting-edge technology that empowers innovation and growth across all sectors.

Endnotes

MI300-46: Testing completed on 05/13/2024 by AMD performance lab attempting text generated with Llama 3-70B with a batch size of 1024 using input sequence length of 128 and 32 output token on a single server with 8 GPUs (Tensor Parallelism = 8)
Configurations:
2P AMD EPYC 9554 64-core CPU powered reference server with 8x AMD Instinct™ MI300X 192GB 750W GPUs, ROCm® 6.1.0 RC2, Ubuntu® 22.04.4 LTS with Linux® kernel 6.5.0-28-generic.
Vs.
An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors, 8x Nvidia H100 (80GB, 700W) GPUs, Ubuntu 22.04.2 LTS with Linux kernel 5.15.0-1029-nvidia

Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.
MI300-47: Testing completed on 05/13/2024 by AMD performance lab attempting text generated with Llama 3-8B with a batch size of 1024 using input sequence length of 128 and 32 output token on a single GPU (tensor parallelism of 1)
Configurations:
2P AMD EPYC 9554 64-core CPU powered reference server with 8x AMD Instinct™ MI300X 192GB 750W GPUs, ROCm® 6.1.0 RC2, Ubuntu® 22.04.4 LTS with Linux® kernel 6.5.0-28-generic.
Vs.
An Nvidia DGX H100 with 2x Intel Xeon Platinum 8480CL Processors, 8x Nvidia H100 (80GB, 700W) GPUs, Ubuntu 22.04.2 LTS with Linux kernel 5.15.0-1029-nvidia

Server manufacturers may vary configurations, yielding different results. Performance may vary based on use of latest drivers and optimizations.
MI300-05A: Calculations conducted by AMD Performance Labs as of November 17, 2023, for the AMD Instinct™ MI300X OAM accelerator 750W (192 GB HBM3) designed with AMD CDNA™ 3 5nm FinFet process technology resulted in 192 GB HBM3 memory capacity and 5.325 TFLOPS peak theoretical memory bandwidth performance. MI300X memory bus interface is 8,192 and memory data rate is 5.2 Gbps for total peak memory bandwidth of 5.325 TB/s (8,192 bits memory bus interface * 5.2 Gbps memory data rate/8).
The highest published results on the NVidia Hopper H200 (141GB) SXM GPU accelerator resulted in 141GB HBM3e memory capacity and 4.8 TB/s GPU memory bandwidth performance.
https://nvdam.widen.net/s/nb5zzzsjdf/hpc-datasheet-sc23-h200-datasheet-3002446

The highest published results on the NVidia Hopper H100 (80GB) SXM5 GPU accelerator resulted in 80GB HBM3 memory capacity and 3.35 TB/s GPU memory bandwidth performance.

https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
GD-241: For a full list of Radeon parts supported by ROCm™ software as of 5/1/2024, go to https://rocm.docs.amd/en/latest/reference/gpu-arch-specs.html GD-241