Server Processors

ssawyer · ‎08-14-2018

Hi,

I am trying to run HPL on an AMD Epyc node (dual socket). I'm getting pretty low numbers so I wanted to see if anyone has had success getting 80-90% of theoretical peak with HPL, i.e. what compiler, MPI version, OMP settings, HPL config , BIOS settings, etc.

I've tried a variety of configurations in HPL.dat including things like 30000 to 82000 for N, 192/200/212 for NB, and a variety of P's and Q's.

I compiled the latest HPL and BLIS framework using version 6.3.0 of the Gnu compiler collection.

Thanks!

linux_monkey · ‎09-04-2018

ssawyer sorry for the delay in responding. Summer holiday and business travel.

First thing to keep in mind is the theoretical maximum double precision FLOPs on EPYC. Although EPYC supports SIMD operations and the corresponding ISA extensions up through AVX2 (256 bit vector width / 4 packed double precision words), the Zen1 core takes two clock cycles to store a 256b vector. Thus the maximum double precision FLOPs per core per clock cycle on Zen1 is 8. That said, EPYC does not suffer from frequency slow-down when running AVX2 heavy code like HPL. The practical upper limit for HPL for a 2P EPYC server running the 7601 SKU (top of stack) is 1200 GFLOPs. If your 2P score is at least 1000 GFLOPs on a similar set-up, chances are your HPL build is fine and you should explore other more minor system tuning options. Lastly note that on the 7601, running the memory at 2400 MTS vs. 2667 MTS will save about 15 watts per socket - this power headroom allows the cores to boost more when running HPL. If you have disabled boost, this is irrelevant. However with boost enabled this will allow about 10% higher scores.

Building HPL is straightforward with BLIS - you need only provide a Makefile with the location of the downloaded BLIS library and installed MPI. You can build BLIS from source (use at least GCC 7 for best results), or you can just download the pre-configured binary from the AMD website.

When running HPL, you have two choices - the difference between the two is minimal. The example below is for OpenMPI:

1) Single-thread BLIS (use one MPI rank per core so insure that P x Q = the total number of cores)

mpirun -np <# of cores> --map-by core --mca btl self,vader xhpl

2) Multi-thread BLIS (use one MPI rank per L3 so insure that P x Q = 16 for a 2P EPYC)

export OMP_PROC_BIND=TRUE

export OMP_PLACES=cores # Only needed if SMT is enabled

export OMP_NUM_THREADS=4 # change to 3 for 24-core EPYC SKUs, change to 2 for 16-core EPYC SKUs

mpirun -np 16 --map-by l3cache --mca btl self,vader xhpl

Best problem size should use up as much of the memory as possible. For example on a machine with 256GB of memory I use 168960. I use NBs = 232 though I haven't seen a lot of sensitivity to this figure and your value should be fine.

-Monkey

ssawyer · ‎09-08-2018

Thanks for the help monkey. I had actually just gotten it to work a few days before your response. I had a problem in the configuration of the job submission script. In the end, I was able to get a tad over 1000 GFlops.

Thank you,

ssawyer

sho1sho1sho1 · ‎04-20-2023

Hi, I know this is an old topic, but could you let me know what you put in the Makefile for the below sections? Thanks!

MPdir = /usr/lib64/openmpi
MPinc =
MPlib = $(MPdir)/lib/libmpi.so

LAdir = /opt/AMD/amd-blis
LAinc =
LAlib = $(LAdir)/lib/LP64/libblis-mt.a

CC = /usr/lib64/openmpi/bin/mpicc
CCNOOPT = $(HPL_DEFS)
CCFLAGS = $(HPL_DEFS) -fomit-frame-pointer -O3 -funroll-loops -W -Wall
#
LINKER = /usb/bin/gcc
LINKFLAGS = $(CCFLAGS)

shrjoshi · ‎04-21-2023

Hello @sho1sho1sho1 ,

Thank you for writing to us.
We would suggest you to build HPL with AOCC.
Find the makefile options as below :
MPdir = /usr/lib64/openmpi
MPinc = $(MPdir)/include
MPlib = $(MPdir)/lib/libmpi.so

LAdir = /opt/AMD/amd-blis
LAinc = -I$(LAdir)/include/blis
LAlib = $(LAdir)/lib/LP64/libblis-mt.a

CC           = clang
CCNOOPT      = $(HPL_DEFS)
CCFLAGS      = $(HPL_DEFS) -O3 -ffast-math -funroll-loops -march=znver3 -fopenmp -

LINKER = clang
LINKFLAGS = -fopenmp -O3 -ffast-math -funroll-loops -march=znver3 -lamdlibm -lm

- please note that use appropriate -march flag for this run. Here, znver3 is for Milan architecture

To get better performance we suggest you to use prebuild HPL from :
https://www.amd.com/en/developer/zen-software-studio/applications/pre-built-applications/zen-hpl-avx... link.

For best use of AMD Architecture, please follow “Run HPL” section mentioned in following link
https://www.amd.com/en/developer/spack/hpl-benchmark.html

Please reach out to us in case further clarification is required.

Server Processors

HPL Benchmarking