ROCm™ 6.1 New Capabilities - Unlocking Performance for AI and HPC with AMD Instinct™ Accelerators

AMD_AI · ‎06-19-2024

When we built the AMD ROCm™ 6 open-source software platform, we aimed to engineer an environment that lets you make the most of the performance and capabilities of AMD Instinct™ accelerators while honoring our commitment to open-source and device-independent solutions. Think of ROCm 6 as the bridge between your biggest AI ideas and their successful implementation. It offers exceptional compatibility with leading industry frameworks and freedom of movement in today's market, providing developers the flexibility to innovate at their own pace, testing and deploying applications across a wide range of GPU architectures.

Our latest update to the platform, ROCm 6.1, introduces numerous capabilities for developers and researchers alike. Ahead, we'll look at how ROCm 6.1 expands on the core strengths of ROCm 6, supporting the latest AMD Instinct™ and Radeon™ GPUs, increasing optimizations across numerous computational domains, and expanding ecosystem support to keep pace with the rapid advancements in AI frameworks. The new features and fixes provided by ROCm 6.1 are designed to improve the stability and performance of applications, enabling AI and HPC developers to explore the farthest boundaries of what's possible.

Introducing rocDecode for video processing

This new ROCm library enables high-performance video decoding directly on the GPU, leveraging the specialized media engines, known as Video Core Next (VCN), built into AMD GPUs. These hardware-based decoders handle video streams efficiently.

rocDecode allows compressed video to be decoded directly into video memory, minimizing data transfers over the PCIe bus and eliminating common bottlenecks in video processing. This capability allows for instant post-processing with the ROCm HIP framework, essential for real-time applications like video scaling, color conversion, and augmentation, which are crucial for advanced analytics, inferencing, and machine learning training.

rocDecode maximizes the efficiency and scalability of video decoding tasks. By enabling the creation of multiple decoder instances that can operate in parallel, the API takes full advantage of all the available VCNs on a GPU device. This parallel processing capability helps ensure that even high-volume video streams can be decoded and processed simultaneously. In short, rocDecode reinforces the video processing pipeline, enabling performance gains and power efficiency improvements essential for modern AI and HPC applications.

New in MIGraphX: Flash Attention, and PyTorch backend support

MIGraphX is the AMD graph inference engine. Designed to accelerate deep learning neural networks, MIGraphX is accessible through interfaces including C++ APIs, Python APIs, and the command-line tool migraphx-driver. This flexibility allows developers to easily integrate advanced model inference capabilities into their applications.

ROCm 6.1 improves performance for transformer-based models with support for Flash Attention, which boosts the memory efficiency of popular models such as BERT, GPT, and Stable Diffusion, helping ensure faster, more power-efficient processing of complex neural networks.

ROCm 6.1 also adds a new Torch-MIGraphX library that integrates MIGraphX capabilities directly into PyTorch workflows. It defines a “migraphx” backend that can be used directly with the torch.compile API. The Torch-MIGraphX library supports a range of data types, including FP32, FP16, and INT8, accommodating diverse computational needs.

Improved Performance of the MIOpen Library

MIOpen is the AMD open-source, deep-learning primitives library designed specifically for enhancing GPU performance. It features a comprehensive set of tools to optimize memory bandwidth and GPU launch overheads through advanced techniques such as fusion and an auto-tuning infrastructure. This infrastructure effectively handles many problem configurations, tailoring algorithms to optimize convolutions for various filter and input sizes.

The latest updates to MIOpen focus on increasing performance, particularly for inference and convolutions. ROCm 6.1 introduces Find 2.0 fusion plans, designed to improve the library's ability to perform inference tasks more efficiently by optimizing the use of system resources. We've improved the convolution kernels for the Number of samples, Height, Width, and Channels (NHWC) format. NHWC prioritizes the height and width dimensions followed by channels, and the updated heuristics specifically optimize performance for this format, enabling better handling and processing of convolution operations across various applications.

New Architecture Support in Composable Kernel Library

ROCm 6.1 introduces new architecture support to the Composable Kernel (CK) library, offering highly efficient capabilities across a wider range of AMD GPUs. A significant update in this version is the replacement of FP8 rounding logic with stochastic rounding. This method of rounding, which mimics more realistic data behavior, is crucial to improving model convergence, offering a more accurate and reliable approach to handling data within machine learning models.

Expanded hipSPARSELt Sparse Computations

ROCm 6.1 introduces extensions to hipSPARSELt that support structured sparsity matrices for accelerating deep-learning workloads. Notable in this release is support for configurations where ‘B’ represents the sparse matrix and ‘A’ is the dense matrix in Sparse Matrix-Matrix Multiplication (SPMM). This addition broadens the library’s capabilities beyond the previous limitation of only supporting multiplications where the sparse matrix was ‘A’ and the dense matrix was ‘B.’ The support for different matrix configurations increases the flexibility and performance of SPMM operations, further optimizing deep-learning computations.

Advanced Tensor Operations with hipTensor

hipTensor is the AMD dedicated C++ library to accelerate tensor operations, utilizing the primitives of the Composable Kernel Library. We designed hipTensor to harness general-purpose kernel languages such as HIP C++. hipTensor optimizes the execution of tensor primitives in applications requiring complex tensor computations.

The latest iteration of hipTensor introduces support for 4D tensor permutation and contraction. With ROCm 6.1, users can now efficiently perform permutations on 4D tensors, a crucial operation in many tensor-based computations. The library now supports 4D contractions for F16, BF16, and Complex F32/F64 data types. This new functionality broadens the scope of operations that can be optimized by hipTensor, allowing for more intricate and diverse manipulations of tensor data, which are essential in advanced computational tasks such as neural network training and complex simulations.

Our goal for the ROCm platform is to give you access to the latest in high-performance computing. Each update in ROCm 6.1 has been designed to improve performance, streamline workflows, and help you achieve your goals more efficiently by providing practical, powerful tools that unlock your innovative potential. Sign up here to keep up with the latest ROCm developments.

Benchmark Graph Systems:

ROCm 6.1

ROCm 6.0

MI300X Supermicro BKC 24.07.06

MI300X-NPS1-SPX-192GB-750W

ROCm 6.1.0 Container

MI300X Banff, BKC X24.05.00

MI300X-None-None-192GB-750W

ROCm 6.0 Container