Introducing AMD Instinct™ MI200 Series, featuring the World’s Fastest HPC and AI GPU Accelerator1

guy_ludden · ‎11-08-2021

With MI200 series accelerators, initially available in a new Open Accelerator Module (OAM) form factor seen in Figure 1 below, scientists can tackle their most pressing challenges—from climate change to vaccine research—using exascale-class supercomputers like the HPE Cray EX Supercomputer or off-the-shelf servers from our partners like ATOS, Gigabyte, Penguin Computing, Supermicro and others.

3D Device.png

Figure 1: AMD Instinct™ MI250 accelerator (OAM Module)

When it comes to performance, MI200 series accelerators provide customers with the industries’ fastest accelerator, the MI250X, delivering up to 47.9 TFLOPs peak theoretical double precision (FP64) HPC performance, and up to 383 TFLOPS peak theoretical half-precision (FP16) AI performance as seen in Graph 1 below1. In real-world benchmarks that represent the work that the Frontier supercomputer is expected to do, the AMD Instinct MI200 series accelerators boast up to a 3x speedup over competitive data center GPUs today2. Visit the AMD Instinct benchmark page to learn more on real-world application performance of MI200 GPUs.

Graph 1: AMD Instinct™ MI250X accelerator delivers performance for HPC.2

How did we manage this performance feat? First, we engineered our Next-Gen AMD CDNA™ 2 architecture reducing our process technology to 6 nanometers and expanding our Matrix Core Technology capabilities with new FP64 Matrix Cores. Then we combined two dies in one package using the same multi-die technology that has made AMD EPYC™ processors the fastest x86 server processors in the world3. This allowed us to increase our core density with the MI250X accelerator by 83% over our previous gen GPUs providing 220 Compute Units with 14,080 stream cores and 880 Matrix Cores4. Furthermore, it allows us to provide customers with an industry leading 128GB of HBM2e memory with up to 3.2 TB/s of theoretical memory throughput5. Connectivity is the next challenge—how do you move data in and out of the GPUs and between peers in hives (or groups) of four or eight accelerators? The MI200 series accomplishes this with the 3rd generation AMD Infinity architecture. This allows us to interconnect accelerators within a hive through up to eight AMD Infinity Fabric™ links on the MI200 accelerator, delivering up to 800 GB/s of peer-to-peer transfer bandwidth capability per MI200 accelerator6. Figure 2 below shows what a typical dual AMD EPYC CPU with eight AMD Instinct MI250X accelerators would look like with the AMD Infinity Architecture. This high-speed 3rd Gen Infinity Fabric can connect directly to 3rd Gen AMD EPYC™ CPUs, accelerating data movement among devices. This approach not only speeds data transfer but also enables us to support cache coherency between optimized 3rd Gen AMD EPYC CPUs and MI250X accelerators6.

Figure 2: Typical server platform diagram with dual 3rd Gen AMD EPYC™ CPUs and eight AMD Instinct™ MI250 accelerators.

Just as our packaging meets open standards, so does the AMD ROCm™ 5.0 open software platform. ROCm’s underlying vision has always been to provide open, portable and performant software for accelerated GPU computing. With ROCm 5.0, we’re adding support and optimizations for the MI200, expanding ROCm support to include Radeon™ Pro W6800 Workstation GPUs, and improving developer tools that increase end-user productivity. ROCm continues to provide developers with choice, making accelerated software portable across a range of accelerators and helping ensure alignment with industry standards with our collection of open source software and APIs.

Now you can write your software once and run it practically anywhere. And with the introduction of the AMD Infinity Hub, a collection of advanced GPU software containers and deployment guides for HPC, AI & Machine Learning applications, is available to help speed-up your system deployments and time to science and discovery.

Today you can power discoveries with the most advanced accelerator available anywhere combined with the fastest x86 server CPUs, and powered by the ROCm 5.0 platform1,3,7. If you aren’t already using AMD Instinct accelerators and AMD EPYC processors, today is the time to start.

Learn More:

Learn more about the latest AMD Instinct™ MI200 Series Accelerators

Visit the AMD Infinity Hub to learn about our AMD Instinct™ supported containers.

Learn more about the 2nd Gen AMD CDNA™ architecture

To learn more about the AMD ROCm™ open software platform

To learn more about AMD Instinct™ MI200 series performance

Guy Ludden is Sr. Product Marketing Mgr. for AMD. His postings are his own opinions and may not represent AMD’s positions, strategies or opinions. Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied.

Endnotes:

World’s fastest data center GPU is the AMD Instinct™ MI250X. Calculations conducted by AMD Performance Labs as of Sep 15, 2021, for the AMD Instinct™ MI250X (128GB HBM2e OAM module) accelerator at 1,700 MHz peak boost engine clock resulted in 95.7 TFLOPS peak theoretical double precision (FP64 Matrix), 47.9 TFLOPS peak theoretical double precision (FP64), 95.7 TFLOPS peak theoretical single precision matrix (FP32 Matrix), 47.9 TFLOPS peak theoretical single precision (FP32), 383.0 TFLOPS peak theoretical half precision (FP16), and 383.0 TFLOPS peak theoretical Bfloat16 format precision (BF16) floating-point performance. Calculations conducted by AMD Performance Labs as of Sep 18, 2020 for the AMD Instinct™ MI100 (32GB HBM2 PCIe® card) accelerator at 1,502 MHz peak boost engine clock resulted in 11.54 TFLOPS peak theoretical double precision (FP64), 46.1 TFLOPS peak theoretical single precision matrix (FP32), 23.1 TFLOPS peak theoretical single precision (FP32), 184.6 TFLOPS peak theoretical half precision (FP16) floating-point performance. Published results on the NVidia Ampere A100 (80GB) GPU accelerator, boost engine clock of 1410 MHz, resulted in 19.5 TFLOPS peak double precision tensor cores (FP64 Tensor Core), 9.7 TFLOPS peak double precision (FP64). 19.5 TFLOPS peak single precision (FP32), 78 TFLOPS peak half precision (FP16), 312 TFLOPS peak half precision (FP16 Tensor Flow), 39 TFLOPS peak Bfloat 16 (BF16), 312 TFLOPS peak Bfloat16 format precision (BF16 Tensor Flow), theoretical floating-point performance. The TF32 data format is not IEEE compliant and not included in this comparison. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/nvidia-ampere-architecture-whitepaper..., page 15, Table 1. MI200-01
Visit the AMD Instinct™ accelerators benchmark at https://www.amd.com/en/graphics/server-accelerators-benchmarks.
MLN-016B: Results as of 07/06/2021 using SPECrate®2017_int_base. The AMD EPYC 7763 scored 854, http://spec.org/cpu2017/results/res2021q3/cpu2017-20210622-27664.html which is higher than all other 2P scores published on the SPEC® website. SPEC®, SPECrate® and SPEC CPU® are registered trademarks of the Standard Performance Evaluation Corporation. See www.spec.org for more information.
The AMD Instinct™ MI250X accelerator has 220 compute units (CUs) and 14,080 stream cores. The AMD Instinct™ MI100 accelerator has 120 compute units (CUs) and 7,680 stream cores. MI200-27
Calculations conducted by AMD Performance Labs as of Sep 21, 2021, for the AMD Instinct™ MI250X and MI250 (128GB HBM2e) OAM accelerators designed with AMD CDNA™ 2 6nm FinFet process technology at 1,600 MHz peak memory clock resulted in 128GB HBM2e memory capacity and 3.2768 TFLOPS peak theoretical memory bandwidth performance. MI250/MI250X memory bus interface is 4,096 bits times 2 die and memory data rate is 3.20 Gbps for total memory bandwidth of 3.2768 TB/s ((3.20 Gbps*(4,096 bits*2))/8). The highest published results on the NVidia Ampere A100 (80GB) SXM GPU accelerator resulted in 80GB HBM2e memory capacity and 2.039 TB/s GPU memory bandwidth performance. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/a100/pdf/nvidia-a100-datasheet-us-nvi... MI200-07
Calculations as of SEP 18th, 2021. AMD Instinct™ MI250 built on AMD CDNA™ 2 technology accelerators support AMD Infinity Fabric™ technology providing up to 100 GB/s peak total aggregate theoretical transport data GPU peer-to-peer (P2P) bandwidth per AMD Infinity Fabric link, and include up to eight links providing up to 800GB/s peak aggregate theoretical GPU (P2P) transport rate bandwidth performance per GPU OAM card for 800 GB/s. AMD Instinct™ MI100 built on AMD CDNA technology accelerators support PCIe® Gen4 providing up to 64 GB/s peak theoretical transport data bandwidth from CPU to GPU per card, and include three links providing up to 276 GB/s peak theoretical GPU P2P transport rate bandwidth performance per GPU card. Combined with PCIe® Gen4 support, this provides an aggregate GPU card I/O peak bandwidth of up to 340 GB/s. Server manufacturers may vary configuration offerings yielding different results. MI200-13
As of October 20th, 2021, the AMD Instinct™ MI200 series accelerators are the “Most advanced server accelerators (GPUs) for data center,” defined as the only server accelerators to use the advanced 6nm manufacturing technology on a server. AMD on 6nm for AMD Instinct MI200 series server accelerators. Nvidia on 7nm for Nvidia Ampere A100 GPU. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth/MI200-31