AMD EPYC™ and AMD Instinct™ Processors to power the new accelerated compute HPE Apollo System

john_morris · ‎11-16-2020

Image for Apollo 6500 blog.jpeg

AMD and Hewlett Packard Enterprise (HPE) have had long standing, collaborative history since the 1st generation AMD EPYC™ processors were announced several years ago. From SMBs to HPC to Supercomputers, more and more customers worldwide have built their business on the value and industry leading performance that AMD based HPE systems deliver.

"HPE is focused on bringing to market the most innovative technology that delivers the best value for our customers so they can accelerate the delivery of new ideas, products and services," said Bill Mannel, VP and GM of High Performance Computing (HPC) at Hewlett Packard Enterprise. "With the HPE Apollo 6500 Gen10 Plus System, we are delivering a flexible, accelerated compute platform based on the AMD EPYC SoC and AMD Instinct MI100 GPUs to optimize performance for HPC applications."

The AMD EPYC SoC delivers a better balance of cores, memory, and I/O to deliver optimal performance based upon todays HPC workloads," said John Morris, CVP, Enterprise and HPC Business Group, Advanced Micro Devices. "EPYC enables the new HPE Apollo 6500 Gen10 System to deliver unique capabilities and address the needs of a broad range of accelerated compute applications."

AMD is the new standard for High Performance Computing

AMD has proven leadership in all areas critical for success in the HPC market. Let’s start with performance. AMD EPYC 7002 Series Processors deliver world record performance with ~2.3X generational performance increase and outpace Intel Xeon Platinum 8280L by up to 102%.1 AMD offers a simple SKU stack from 8 cores up to 64 cores plus high frequency SKUs so you can choose the right processor for the job. Does your application require lots of L3 cache? No problem. AMD EPYC processors offer 16 MB per 4 cores. Do you need access to lots of memory? We have that covered. With support for 8 channels and up to 4TB per socket at the fastest 3200MT/S speeds, AMD EPYC processors deliver the highest available memory. Need lots of I/O bandwidth? We offer plenty. With 128 PCIe lanes that can be used to as you wish. And you can choose without compromise since the same feature sets are available from the top of the stack to bottom without a cost increase.

In addition, AMD EPYC 7002 Series Processors offer ground-breaking TCO: two socket AMD EPYC powered servers can deliver up to 49% lower TCO than 2P competition powered servers.2 This is important for HPC applications with high per core license costs to ensure they are being used efficiently.

Next, AMD EPYC processors are “hardened at the core” and boast a set of advanced security features called Infinity Guard. AMD Infinity Guard features a silicon-embedded processor that helps your organization take control of security and decrease risks to your most important assets.

And finally, the AMD Infinity Architecture delivers performance, scale, efficiency and security features for the agility to move at the speed of your business, now and into the future. 2nd generation AMD EPYC processors have a hybrid multi-die SoC design and are the first x86 CPUs built on 7nm and the first to support PCIe Gen4.

What makes the HPE Apollo 6500 Gen10 Plus System with AMD so unique?

The HPE Apollo 6500 Gen10 Plus system is the first accelerated compute system in the HPE portfolio that embraces the power of both the AMD 2nd Gen AMD EPYC processors and the new AMD Instinct™ MI100 GPU with the all-new AMD CDNA architecture bringing Matrix Core Technology and Infinity Fabric™ Link technology for fast P2P data sharing. Designed to deliver performance for AI, machine learning, deep learning as well as typical HPC applications, the AMD EPYC 7002 series processors offer tremendous bandwidth and a high core count to continuously feed data-hungry GPUs. These high-frequency processors, using HDR InfiniBand add up to 200 gigabytes per second of bandwidth for every two GPUs, so even businesses operating at the cluster level can communicate at twice the speed; and with the ability to configure dual AMD Instinct MI100 quad GPU hives, customers can gain access to a server with up to 1.1 TB/s of peak peer-to-peer I/O bandwidth.3 With a refreshed 8 GPU offering that supports two 2nd Gen AMD EPYC processors, enterprises can now harness a system with up to 16 PCIe GPUs, more than twice than HPE has supported in the past.

The AMD Instinct™ MI100 GPUs are expertly engineered for the next wave of HPC and AI, enhancing accelerated computing so that enterprises can propel world-changing discoveries. The AMD Instinct MI00 GPUs are the industry’s first accelerators to deliver over 10 TFLOPs/s double precision (FP64) performance for HPC and they bring new AMD Matrix Core Technology delivering a nearly 7x boost in FP16 throughput performance for AI training workloads compared to the previous generation.4,5

And this is only the beginning

At AMD, we’re committed to delivering the innovative, proven technology and meaningful innovation that our customers expect from the industry leader. More great things are on the horizon that will enable workloads of any size and scale.

Footnotes:

1. ROM-09 - AMD EPYC 7742 has 64 cores vs. Intel Platinum 8280 with 28 cores. 64 / 28 = 2.287 - 1.0 = 1.3x more/130% more [2.3x the/230% the] cores.

ROM-11 - EPYC™ 7002 series has 8 memory channels, supporting 3200 MHz DIMMs yielding 204.8 GB/s of bandwidth vs. the same class of Intel Scalable Gen 2 processors with only 6 memory channels and supporting 2933 MHz DIMMs yielding 140.8 GB/s of bandwidth. 204.8 / 140.8 = 1.454545 – 1.0 = .45 or 45% more. AMD EPYC has 45% more bandwidth. Class based on industry-standard pin-based (LGA) X86 processors.

2. ROM-331 - Compares delivering 11,550,120 jOPS as measured by SPECjbb®2015-MultiJVM Max-jOPS benchmark utilizing 2 socket Intel 8280 servers versus 2 socket AMD EPYC™ 7742 servers. Intel-based server tested to achieve 194,608 jOPS (http://www.spec.org/jbb2015/results/res2019q2/jbb2015-20190313-00374.html) . AMD EPYC server tested to achieve 355,121 jOPS (http://www.spec.org/jbb2015/results/res2019q3/jbb2015-20190717-00460.html). As a result, an estimated 60 Intel based servers versus 33 AMD EPYC™ based servers are needed to meet a jOPS performance of 11,550,120.

All calculations are based on AMD's best estimates of what actual costs and other values will be for both AMD and Intel based platforms.

System Configurations: Intel Xeon based servers include blade chassis with blade servers, each with (2) Intel® Xeon® Platinum 8280 @ $10,009 ea., with (24) 16GB RDIMM DDR4 2933MT/s @ $87 ea., (1) 1TB SATA HDD @ $387, plus chassis with power supplies, and NIC @ 2,500, for a price of $24,606 each for a total hardware acquisition price of $1,476,360. AMD EPYC™ servers include Dual Socket 2U Rack Mount chassis, with (2) AMD EPYC™ 7742 at $6,950 ea., (16) 64GB RDIMM DDR4 2933MT/s @ $349 ea., ((1) 1TB SATA HDD @ $387, plus chassis with power supplies, and NIC @ 2,200, for a price of $21,684 each for a total hardware acquisition price of $715,572.  Estimated System Pricing: Estimated pricing for both systems as of 9/16/2019.

Power estimates: AMD 515 watts per server per hour, for a total solution power of 12236.4 kW per month. Intel 581 watts per server per hour, for a total solution power of 25099.2kW per month. 3-year total power with a PUE of 2 and a power cost of $0.12 per kWh: AMD - $105,722.50 kW; Intel - $216,857.09.

Data center 3-year real estate cost estimates: based on $20/mth/sq ft and 27 sq ft per 40 RU rack are for: Intel is 1 rack cabinet at $19,440 and for AMD of $32,076 (1.6 Rack Cabinets). Server Administration cost is calculated with an estimate of $110,500 annually per server administrator (includes 30% burden) with a ratio of one server administrator per 30 servers resulting in Intel cost of $663,000 (for 60 servers) and for AMD $364,650 (for 33 servers) for 3 yr server administration costs. 

Total estimated 3 Year TCO as a result is $2,375,657 for Intel-based Systems and $1,218,020 for AMD EPYC-based systems.  As a result, AMD EPYC based systems are estimated to deliver up to a 49% lower TCO (excluding software costs).

Disclaimer: This scenario contains many assumptions and estimates and, while based on AMD internal research and best approximations, should be considered an example for information purposes only, and not used as a basis for decision making over actual testing. )

3. Calculations as of SEP 18th, 2020. AMD Instinct™ MI100 built on AMD CDNA technology accelerators supporting PCIe® Gen4 providing up to 64 GB/s peak theoretical transport data bandwidth from CPU to GPU per card.

AMD Instinct™ MI100 accelerators include three Infinity Fabric™ links providing up to 276 GB/s peak theoretical GPU to GPU or Peer-to-Peer (P2P) transport rate bandwidth performance per GPU card. Combined with PCIe Gen4 support providing an aggregate GPU card I/O peak bandwidth of up to 340 GB/s.

MI100s have three links: 92 GB/s * 3 links per GPU = 276 GB/s. Four GPU hives provide up to 552 GB/s peak theoretical P2P performance. Dual 4 GPU hives in a server provide up to 1.1 TB/s total peak theoretical direct P2P performance per server.

AMD Infinity Fabric link technology not enabled: Four GPU hives provide up to 256 GB/s peak theoretical P2P performance with PCIe® 4.0. Server manufacturers may vary configuration offerings yielding different results. MI100-07

4. Calculations conducted by AMD Performance Labs as of Sep 18, 2020 for the AMD Instinct™ MI100 (32GB HBM2 PCIe® card) accelerator at 1,502 MHz peak boost engine clock resulted in 11.54 TFLOPS peak double precision (FP64), 46.1 TFLOPS peak single precision matrix (FP32), 23.1 TFLOPS peak single precision (FP32), 184.6 TFLOPS peak half precision (FP16) peak theoretical, floating-point performance. Published results on the NVidia Ampere A100 (40GB) GPU accelerator resulted in 9.7 TFLOPS peak double precision (FP64). 19.5 TFLOPS peak single precision (FP32), 78 TFLOPS peak half precision (FP16) theoretical, floating-point performance. Server manufacturers may vary configuration offerings yielding different results. MI100-03

5. Calculations performed by AMD Performance Labs as of Sep 18, 2020 for the AMD Instinct™ MI100 accelerator at 1,502 MHz peak boost engine clock resulted in 184.57 TFLOPS peak theoretical half precision (FP16) and 46.14 TFLOPS peak theoretical single precision (FP32 Matrix) floating-point performance. The results calculated for Radeon Instinct™ MI50 GPU at 1,725 MHz peak engine clock resulted in 26.5 TFLOPS peak theoretical half precision (FP16) and 13.25 TFLOPS peak theoretical single precision (FP32 Matrix) floating-point performance. Server manufacturers may vary configuration offerings yielding different results. MI100-04