AMD is excited to announce the release of the AMD ROCm™ 5.5. With AMD ROCm open software platform built for flexibility and performance, the HPC and AI communities can gain access to open compute languages, compilers, libraries and tools designed to accelerate code development and solve the toughest challenges in the world today. The latest version of the AMD ROCm platform adds new functionality while building on your favorite features from ROCm 5.5 and other previous releases. Here we will be highlighting some of our favorite and newly enhanced features, including rocFFT and hipBLASL. If you are interested in a more in-depth look at ROCm 5.5, we encourage you to check out the release notes.
rocFFT is AMD’s open-source GPU FFT library. The Fast Fourier Transform (FFT) is a very important algorithm in computer science and is used in a lot of interesting applications from digital signal processing to molecular dynamics simulations. The FFT works by divide-and-conquer: a transform of length N is computed by recursively computing the Fourier transforms whose lengths are the factors of N. This works particularly well when N has lots of divisors (for example if N is a power of two). When N isn’t very divisible, for example if N is prime, then there are techniques that we can use to still compute the FFT efficiently and robustly. Prime factorization lies at the heart of the FFT, and this means that small changes in problem size can have big effects on our choice of algorithm! For example, transforms of 8191 and 8192 are computed in a radically different fashion – 8192 is a nice power-of-two, and 8191 is a Mersenne prime.
At AMD, we know that the more we target our code to a specific problem, the better performance we can get. So, to us, it’s worth having a lot of kernels for the GPU in the rocFFT library. But, as with anything in tech, there are limits! If we were to compile all the kernels that we think people might possibly use in the rocFFT library, it would grow to the size of tens of gigabytes, which is cumbersome and costly. As we move to support more devices and target more problem sizes, this will keep getting bigger and bigger. For those of you who love FFTs, perhaps filling your hard drive with FFT kernels doesn’t seem like such a bad thing, but it’s going to be a pain point for most users.
The number of kernels in rocFFT has increased over past AMD ROCm platform releases, as we have added specialized kernels for higher performance and support for new GPU architectures. In order to keep the file size on disk of the library reasonable, rocFFT completes a transition in ROCm 5.5 to build its kernels using hipRTC. This means that rocFFT will have more kernels to do faster transforms, but we don’t have to worry about the library getting bloated.
The bulk of rocFFT's kernels are produced by a code generator. In previous versions, this code generator was only invoked when rocFFT was built. The resulting code was built for a set of GPU architectures chosen at build time. Generating code at build time has some disadvantages:
Specialized Kernels: Any kernels generated at this time are built into rocFFT. Additional specialized kernels can help performance but must be built and distributed to users even if they would never run workloads that benefit from these kernels.
Fixed Architectures: The list of supported GPU architectures is fixed once the library is built. While rocFFT is open-source and can be built by anyone, it's inconvenient for users to have to rebuild rocFFT to work on a different architecture. Adding new architectures to the build-time list also increases library size, even for users that don't use them.
rocFFT in ROCm 5.5 instead builds the code generator into the library. Any time a kernel is needed but not available for an FFT, rocFFT will generate code at runtime and use hipRTC to compile it. As a result:
Specialized kernels that are not commonly used do not need to be distributed to everyone. Instead, they are built as needed.
Users with GPUs that are not on the build-time architecture list can use rocFFT as-is without needing to recompile the whole library.
Compiling a kernel at runtime is done at the plan generation stage, so it doesn’t affect transform execution performance, and runtime-compiled kernels have the same performance as those compiled ahead of time. The time to RTC a kernel is typically about one second, and this can be done in parallel. Typically, applications will execute only a few different types and lengths of transforms, so this extra plan generation time isn’t very important. The rocFFT library ships with a pre-populated, read-only system level kernel cache, so most of the time users won’t end up compiling anyway. We also enable caching of user-compiled kernels in a read/write user-level cache. In ROCm 5.5, the default is to store the user-level cache in memory. This can be changed via the environment variable ROCFFT_RTC_CACHE_PATH.
Runtime code generation and compilation via hipRTC allows for high-performance, specialized kernels. It’s zero-cost in terms of execution, enables developers to implement new features and support new architectures, and it means that you won’t have to install a separate hard drive – you can just store what you end up using. Implementing this on rocFFT is part of the plan for enabling new exciting features which we think that you’ll really like – stay tuned for more!
hipBLASLt is a new library, first released in ROCm 5.5, that provides General Matrix Multiplication (GEMM) operations and extended operations to speed up end-to-end AI workloads. Apart from the traditional BLAS library, this new library adds multiple extended operations fused with GEMM, a run-time kernel searching mechanism, and flexibility in matrix data types and compute types to match the mixed precision capabilities of the GPUs Matrix Cores. Also, hipBLASLt is continuously optimizing kernel performance targeting to provide the optimal run time performance for different AI workloads.
hipBLASLt provides high-performance assembly kernels for GEMM and extended operations fused with GEMM, such as bias, scaling factor and widely-used activation functions, e,g, RELU, GELU, etc. These fusion operations improve performance by minimizing memory operations and GPU kernel launch overhead.
hipBLASLt designs a run-time kernel searching mechanism for users to choose preferred solutions. With a kernel selection algorithm in the backend, users are able to get the recommended kernel in an efficient manner. The library also gives users the opportunity to explore more kernels with heuristic search.
hipBLASLt is continuously optimizing kernel performance on different GEMM shapes, such as tiny GEMM. The library targets to provide optimal run time for different AI workloads. Download at https://github.com/ROCmSoftwarePlatform/hipBLASLt and give it a try!
Try it for Yourself
See for yourself the power of ROCm 5.5. Download the latest version here.
Carson Liao - Senior Manager in AI library group at AMD Taiwan.
Jimmy Chang - SMTS software engineer in AI library group team at AMD Taiwan.
Henry Ho - SMTS software engineer in AI library group at AMD Taiwan.
Yangwen Huang- Senior software engineer in AI library group at AMD Taiwan.
Victor Wu - MTS software engineer in AI library group at AMD Taiwan.
Steve Leung – SMTS Software Development Engineer.
Saad Rahim - Senior Member of Technical Staff in the AI Group.
Malcolm Roberts - Technical Lead for rocFFT.
Making the ROCm platform even easier to adopt
For ROCm users and developers, AMD is continually looking for ways to make ROCm easier to use, easier to deploy on systems and to provide learning tools and technical documents to support those efforts.
The ROCm web pages provide an overview of the platform and what it includes, along with HPC & AI markets and workloads it supports.
ROCm Information Portal is a portal for users and developers that posts the latest ROCm versions along with API and support documentation. This portal also hosts ROCm learning materials to help introduce the ROCm platform to new users, as well as to provide existing users with curated videos, webinars, labs, and tutorials to help in developing and deploying systems on the platform.
AMD Infinity Hub gives you details on ROCm supported HPC applications and ML frameworks, and how to get the latest versions and install documents. You can also access the ROCm Application Catalog there, which includes an up-to-date listing of ROCm enabled applications.
Sydney Freeman is Sr. Product Marketing Specialist for AMD. Her postings are her own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.