Graph optimization plays an important role in reducing time and resources for training and inference of AI models. One of the most important functionalities of graph optimization is to identify the opportunities for fusing various combinations of tensor operators, which can improve computational efficiency by reducing memory allocation and traffic. However, a solution that can deliver high-performance backend kernels for a wide range of fused tensor operators at a rapid pace is a big challenge. Previously, due to the lack of such a solution, AI applications could only enable a limited set of graph optimizations on real hardware.
AMD Composable Kernel (CK) library aims to address these challenges for AMD's current and future generations of GPUs [2]. CK uses the general-purpose HIP C++ language and is completely open-sourced. CK is created with the following goals in mind:
CK uses two concepts to increase backend developers' productivity:
The tensor coordinate transformation primitives and tile tensor operator primitives together form CK's reusable primitives.
Fig. 1 Use CK’s tensor coordinate transformation primitives to reinterpret convolution as GEMM
Fig. 2 CK components (bottom: reusable primitives, top: conventional and fused operators)
The current CK library is structured into four layers from bottom to top: "Templated Tile Operator," "Templated Kernel and Invoker," "Instantiated Kernel and Invoker" and "Client API" [3] for different groups of developers:
Fig. 3 Multiple Layers of CK library
Meta’s AITemplate [7] (AIT) is a unified inference system with separate acceleration backends for AMD and NVIDIA GPUs. AITemplate utilizes CK as the backend on AMD GPUs. It interacts with CK's "Templated Kernel and Invoker" layer.
AITemplate + CK approach establishes state-of-the-art inference performance of several important models on AMD Instinct MI250. Most of the advanced fused operators CK supported in these models are driven by the vision of AITemplate team, and many fused kernels are co-designed by CK and AITemplate teams.
Here we compare end-to-end model performance on AMD Instinct™ MI250 accelerator and the alternative offerings [8]. All the benchmark models running on AMD Instinct MI250 here use AITemplate [9] + CK [10].
The following benchmark results show the ResNet-50 performance results of AIT+CK on AMD Instinct MI250 and TensorRT v8.5.0.12 [11] (TRT) on NVIDIA A100-PCIe-40GB and A100-DGX-80GB. The results show that AIT+CK on MI250 can deliver up to 1.08x speedup over TRT on NVIDIA A100-PCIe-40GB.
A Batched GEMM + Softmax + GEMM back-to-back fused operator template for attention layer, implemented using CK, completely removes the data traffic between compute unit and HBM for the intermediate result. By using this template, many workloads in the attention layer that used to be bandwidth-bound now become compute-bound, which can utilize GPU compute much more efficiently. This algorithm is heavily inspired by FlashAttention [12], with improvement that greatly reduces memory traffic compared with original FlashAttention.
The benchmark results shown below compare the BERT base model (uncased) performance results of AIT+CK on the AMD Instinct MI250 GPU and FasterTransformer v5.1.1 bug fix [13] (FT) on NVIDIA A100-PCIe-40GB and A100-DGX-80GB. With sequence length of 4096, FT will be out-of-memory on the A100-PCIe-40GB and A100-DGX-80GB at batch size 32. Therefore, for sequence 4096, we only show results of batch size 16, which is supported by FT on A100-PCIe-40GB and A100-DGX-80GB. The results show that AIT+CK on the MI250 can provide up to 3.28x speedup over FT on NVIDIA A100-PCIe-40GB and up to 2.91x speedup over FT on NVIDIA A100-DGX-80GB.
The benchmark results below show Vision Transformer base model (image size 224x224) performance of AIT+CK on the Instinct MI250 and TensorRT v8.5.0.12 (TRT) on NVIDIA A100-PCIe-40GB and A100-DGX-80GB. The results show that AIT+CK on Instinct MI250 can provide up to 1.8x speedup over TRT on NVIDIA A100-PCIe-40GB and up to 1.4x speedup over TRT on NVIDIA A100-DGX-80GB.
The following benchmark results show end-to-end Stable Diffusion performance results of AIT+CK on the AMD Instinct MI250 using batch size 1, 2, 4 and 6. Note that for batch size 1, only a single GCD of MI250 is used, while for batch 2, 4 and 6 both GCDs are used.
Batch size |
Average latency (ms) of AIT + CK on MI250 |
1 |
2604 |
2 |
2604 |
4 |
3951 |
6 |
5368 |
As of writing of this article, there is no publicly available information on how to run Stable Diffusion end-to-end using TensorRT. However, this article Making stable diffusion 25% faster using TensorRT [14] shows how to use TensorRT to accelerate UNet part of Stable Diffusion. Since UNet is the most critical and time-consuming part in Stable Diffusion, the performance of UNet largely reflects that of Stable Diffusion.
The benchmark results below show UNet performance results of AITemplate + CK on the AMD Instinct MI250 and TensorRT v8.5.0.12 (TRT) on NVIDIA A100-PCIe-40G and A100-DGX-80GB. The results show that AITemplate + CK on the AMD Instinct MI250 can deliver up to 2.45x speedup over TRT on NVIDIA A100-PCIe-40G and up to 2.03x speedup over TRT on NVIDIA A100-DGX-80GB.
Two popular approaches for end-to-end performance optimization of AI models exist today:
Vendor-provided proprietary software that handles everything from graph optimizations to backend in black-box. We think the black-box and limited customizability nature of this approach creates artificial boundary between AI framework and backend, and limits what could possibly be optimized.
A compiler-based approach is being developed for application specific AI chips without a general-purpose language. This approach allows vendors to build compiler for AI specific backend programming language.
A compiler-based approach is also being developed for architectures that already have general purpose languages. Sometimes, it's to streamline the entire process from model-level graph optimizations to last mile kernel generation with Intermediate Representation (IRs). Other times, it's to implement Domain Specific Language (DSL) aiming at making programming hardware simple. The challenge with these endeavors is that the need for architecture and workload specific HPC algorithms is not removed, it's just changed into the need for writing architecture and workload specific compiler passes, making it hard for most HPC algorithm developers, who play a crucial role in pushing performance boundary of AI acceleration, to contribute.
AITemplate + CK demonstrates a third option which combines the best of breed from the current approaches: a graph optimizing framework identifies kernel fusing opportunities and construct kernels using a flexible and efficient backend library. The performance results show this approach performs well on GPUs and provides an agility which is difficult for the existing approaches to achieve. Through good library design, this approach serves the needs of AI system developers well, while also being inviting for HPC algorithm developers to contribute. Additionally, the general-purpose nature of the backend language means this approach is likely to adapt more quickly to new AI problems and architectural changes.
https://github.com/ROCmSoftwarePlatform/composable_kernel
Multiple teams at AMD have contributed to CK developement, including Chao Liu, Jing Zhang, Letao Qin, Qianfeng Zhang, Liang Huang, Shaojie Wang, Anthony Chang, Chunyu Lai, Illia Silin, Adam Osewski, Poyen Chen, Rosty Geyyer, Hanwen Chang [15], Tejash Shah [15], Xiaoyan Zhou [15], Jianfeng Yan [15], Dan Yao, Guangzhao Lu, Raman Jana, Jehandad Khan, Wen-Heng (Jack) Chung, Austin Kerbow, Stanislav Mekhanoshin, Vang Thao, Jeff Byrnes, and Chaitanya Sankis.
More Information
ROCm webpage: AMD ROCm™ Open Software Platform | AMD
ROCm Information Portal: AMD Documentation - Portal
AMD Instinct Accelerators: AMD Instinct™ Accelerators | AMD
AMD Infinity Hub: AMD Infinity Hub | AMD
Endnotes: