Computational and Data science have emerged as powerful modes of scientific inquiry and engineering design. Often referred to as the “third” and “fourth” pillars of the scientific method, they are interdisciplinary fields where computer models and simulations of physical, biological, or data-driven processes are used to probe, predict, and analyze complex systems of interest. All of this necessitates the use of more computational power and resources to keep up with increasing scientific and industrial demands. To fully utilize emerging hardware designed to tackle these challenges, the development of robust software for high-performance computing (HPC) and Machine Learning (ML) applications is now more crucial than ever. This challenge is made even greater as hardware trends continue to achieve massive parallelism through GPU acceleration, which requires the adoption of sophisticated heterogenous programming environments and carefully tuned application code – which is precisely what this blog series aims to address.
In this AMD lab notes blog series, engineers share lessons learned from tuning a wide range of scientific applications, libraries, and frameworks for AMD GPUs. Tune in each month for a new lab notes blog focusing on topics such as:
- AMD GPU implementations of computational science algorithms such as PDE discretizations, linear algebra, solvers, and more
- AMD GPU programming tutorials showcasing optimizations
- Instructions for leveraging ML frameworks, data science tools, post-processing, and visualization on AMD GPUs
- Best practices for porting and optimizing HPC and ML applications targeting AMD GPUs
- Guidance on using libraries and tools from the ROCm™ software stack
Readers can also look forward to accompanying code examples so domain experts and computational/data scientists alike can try out and experiment with the code on their own systems. Most of the lab notes primarily focus on AMD Instinct™ GPUs, but we expect users of other AMD graphics cards to also benefit from the strategies outlined. Come back each month to learn directly from the experts on how best to optimize application performance on AMD GPUs and get inspired on how to accelerate your own application code even further. Eager to get started? Keep reading for a preview of the topics included in our first set of blog posts in the series and check back for many more AMD lab notes blogs.
AMD Matrix Cores
Matrix multiplication is a fundamental aspect of Linear Algebra and it is an ubiquitous computation within High Performance Computing (HPC) Applications. Since the introduction of AMD’s CDNA Architecture, Generalized Matrix Multiplication (GEMM) computations are now hardware-accelerated through Matrix Core Processing Units. Matrix Core accelerated GEMM kernels lie at the heart of BLAS libraries like rocBLAS but they can also be programmed directly by developers. This blog post will focus on showcasing compiler intrinsics to leverage AMD Matrix Cores.
AMD ROCm™ Installation
Installation of the AMD ROCm™ software package can be challenging without a clear understanding of the pieces involved and the flow of the installation process. This introductory material shows how to install ROCm on a workstation with an AMD GPU card that supports the AMD GFX9 architecture. Three installation options will be described in this post. Interested readers are pointed to several relevant websites and online documentation for further details.
Finite Difference Method – Laplacian Part 1
The finite difference method is a canonical example of a computational physics stencil discretization commonly used in applications ranging from geophysics (weather and oil & gas) and electromagnetics (semiconductors and astrophysics) to gas dynamics (airflow and plasmas). Stencil codes are identified by an underlying requirement of accessing a local neighborhood of grid points (stencil) in order to evaluate a value at a single grid point, meaning that the performance of the algorithm is strongly tied its memory access pattern. In this blog post we will develop an initial GPU-accelerated stencil code of the Laplace operator using AMD’s Heterogeneous Interface for Portability (HIP) API, and present a performance target we expect to achieve when we optimize the code in future posts.
Finite Difference Method – Laplacian Part 2
In this second part of the finite difference method series, we focus on two optimizations that can be applied to reduce data movement and thereby increase the effective memory bandwidth of the previously implemented HIP kernel. The first optimization employs loop tiling to explicitly reduce memory loads – this reduces the load store ratio for each GPU thread. The second optimization reorders the memory access pattern – this improves the L2 cache hit rate. Both optimizations combined bring the performance of the kernel much closer to the expected performance, and the next part will answer some remaining open questions.
All technical content and accompanying code examples can be found here at AMD LAB Notes.
Helpful Resources:
- The ROCm web pages provide an overview of the platform and what it includes, along with markets and workloads it supports.
- ROCm Information Portal is a new one-stop portal for users and developers that posts the latest versions of ROCm along with API and support documentation. This portal also now hosts the ROCm Learning Center to help introduce the ROCm platform to new users, as well as to provide existing users with curated videos, webinars, labs, and tutorials to help in developing and deploying systems on the platform. It replaces the former documentation and learning sites.
- AMD Infinity Hub gives you access to HPC applications and ML frameworks packaged as containers and ready to run. You can also access the ROCm Application Catalog, which includes an up-to-date listing of ROCm enabled applications.
- AMD Accelerator Cloud offers remote access to test code and applications in the cloud, on the latest AMD Instinct™ accelerators and ROCm software. Finally, learn more about our latest AMD Instinct accelerators, including the latest AMD Instinct MI200 series family of accelerators and supporting partner server solutions in our AMD Instinct Server Solutions Catalog.
Sydney Freeman is Sr. Product Marketing Specialist for AMD. Her postings are her own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.