Unlocking New Horizons in AI and HPC with the Release of AMD ROCm™ 6.3

Ronak_Shah · ‎11-25-2024

ROCm 6.3 marks a significant milestone for the AMD open-source platform, introducing advanced tools and optimizations to elevate AI, ML, and HPC workloads on AMD Instinct GPU accelerators. ROCm 6.3 is engineered to empower a wide range of customers—from innovative AI startups to HPC-driven industries—by enhancing developer productivity

This blog delves into the standout features of this release, including seamless SGLang integration for accelerated AI inferencing, a re-engineered FlashAttention-2 for optimized AI training and inference, the introduction of multi-node Fast Fourier Transform (FFT) to revolutionize HPC workflows and more. Explore these exciting updates and more as ROCm 6.3 continues to drive innovation across industries.

1. SGLang in ROCm 6.3: Super-Fast Inferencing of Generative AI (GenAI) Models

GenAI is transforming industries, but deploying large models often means grappling with latency, throughput, and resource utilization challenges. Enter SGLang, a new runtime supported by ROCm 6.3, purpose-built for optimizing inference of cutting-edge generative models such as LLMs and VLMs on AMD Instinct GPUs.

Why It Matters to You:

6X Higher Throughput: Achieve up to 6X higher performance on LLM inferencing compared to existing systems as researchers have found¹, enabling your business to serve AI applications at scale.
Ease of Use: Python™-integrated and pre-configured in the ROCm Docker containers enable developers to accelerate deployment for interactive AI assistants, multimodal workflows, and scalable cloud backends with reduced setup time.

Whether you're building customer-facing AI solutions or scaling AI workloads in the cloud, SGLang delivers the performance and ease-of-use needed to meet enterprise demands. Discover the powerful features of SGLang and learn how to seamlessly set up and run models on AMD Instinct GPU accelerators here > Get started now!

2. Next-Level Transformer Optimization: Re-Engineered FlashAttention-2 on AMD Instinct™

Transformer models are at the core of modern AI, but their high memory and compute demands have traditionally limited scalability. With FlashAttention-2 optimized for ROCm 6.3, AMD addresses these pain points, enabling faster, more efficient training and inference².

Why Developers Will Love It:

3X Speedups: Achieve up to 3X speedups on backward pass and a highly efficient forward pass compared to FlashAttention-1², accelerating model training and inference to reduce time-to-market for enterprise AI solutions.
Extended Sequence Lengths: Efficient memory utilization and reduced I/O overhead make handling longer sequences on AMD Instinct GPUs seamless.

Optimize your AI pipelines with FlashAttention-2 on AMD Instinct GPU accelerators today, seamlessly integrated into existing workflows through ROCm’s PyTorch container with Composable Kernel (CK) as the backend.

3. AMD Fortran Compiler: Bridging Legacy Code to GPU Acceleration

Enterprises running legacy Fortran based HPC applications can now unlock the power of modern GPU acceleration with AMD Instinct™ accelerators, thanks to the new AMD Fortran compiler introduced in ROCm 6.3.

Key Benefits:

Direct GPU Offloading: Leverage AMD Instinct GPUs with OpenMP offloading, accelerating key scientific applications.
Backward Compatibility: Build on existing Fortran code while taking advantage of AMD’s next-gen GPU capabilities.
Simplified Integrations: Seamlessly interface with HIP Kernels and ROCm Libraries, eliminating the need for complex code rewrites.

Enterprises in industries such as aerospace, pharmaceuticals, and weather modeling can now future proof their legacy HPC applications, realizing the power of GPU acceleration without the need for extensive code overhauls previously required. Get started with the AMD Fortran Compiler on AMD Instinct GPUs through this detailed walkthrough.

4. New Multi-Node FFT in rocFFT: Game changer for HPC Workflows

Industries relying on HPC workloads—from oil and gas to climate modeling—require distributed computing solutions that scale efficiently. ROCm 6.3 introduces multi-node FFT support in rocFFT, enabling high-performance distributed FFT computations.

Why It Matters for HPC:

Built-in Message Passing Interface (MPI) Integration: Simplifies multi-node scaling, helping reduce complexity for developers and accelerating the enablement of distributed applications.
Leadership Scalability: Scale seamlessly across massive datasets, optimizing performance for critical workloads like seismic imaging and climate modeling.

Organizations in industries like oil and gas and scientific research can now process larger datasets with greater efficiency, driving faster and more accurate decision-making.

5. Enhanced Computer Vision Libraries: AV1, rocJPEG, and Beyond

AI developers working with modern media and datasets require efficient tools for preprocessing and augmentation. ROCm 6.3 introduces enhancements to its computer vision libraries, rocDecode, rocJPEG, and rocAL, empowering enterprises to tackle diverse workloads from video analytics to dataset augmentation.

Why It Matters to You:

AV1 Codec Support: Cost-effective, royalty-free decoding for modern media processing via rocDecode and rocPyDecode.
GPU-Accelerated JPEG Decoding: Seamlessly handle image preprocessing at scale with built-in fallback mechanisms that come with rocJPEG library.
Better Audio Augmentation: Improved preprocessing for robust model training in noisy environments with rocAL library.

From media and entertainment to autonomous systems, these features enable developers to create better AI-advanced solutions for real-world applications.

Beyond these standout features, it’s worth highlighting that Omnitrace and Omniperf, introduced in ROCm 6.2, have been rebranded as ROCm System Profiler and ROCm Compute Profiler. This rebranding will help with enhanced usability, stability and seamless integration into the current ROCm profiling ecosystem.

Why ROCm 6.3?

AMD ROCm has been making strides with every release, and version 6.3 is no exception. It delivers cutting-edge tools to simplify development while driving better performance and scalability for AI and HPC workloads. By embracing the open-source ethos and continuously evolving to meet developer needs, ROCm empowers businesses to innovate faster, scale smarter, and stay ahead in competitive industries.

Ready to Take the Leap? Explore the full potential of ROCm and see how AMD Instinct accelerators can power your enterprise’s next big breakthrough. The ROCm Documentation Hub and other avenues are being updated as we write this blog with the latest ROCm 6.3 content—details will be available very soon, so stay tuned!

Stay updated with the latest developments, tips, and insights by visiting AMD ROCm Blogs. Don’t forget to sign up for the RSS feed to receive regular updates directly to your inbox.

Key Contributors:

Jayacharan Kolla – Product Manager

Aditya Bhattacharji - Software Development Engineer

Ronnie Chatterjee – Director Product Management

Saad Rahim – SMTS Software Development Engineer

¹https://arxiv.org/pdf/2312.07104 – at p.8

²Based on informal internal testing conducted for specific customer/s, performance for FlashAttention-2 has demonstrated 2-3X of performance uplift vs the previous version of FlashAttention-1 results. Please note that performance can vary depending on individual system configurations, workloads, and environmental factors. This information is provided solely for illustrative purposes and should not be interpreted as a guarantee of future performance in all use cases.

The information contained herein is for informational purposes only and is subject to change without notice. While every precaution has been taken in the preparation of this document, it may contain technical inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD products are as set forth in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale. GD-18

Unless stated otherwise, AMD has not tested or verified the third-party claims in this document. GD-182.

© 2024 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Instinct, AMD ROCm and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other product names used in this publication are for identification purposes only and may be trademarks of their respective owners. Python is a trademark of the Python Software Foundation. PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.