I am trying to run HPL on an AMD Epyc node (dual socket). I'm getting pretty low numbers so I wanted to see if anyone has had success getting 80-90% of theoretical peak with HPL, i.e. what compiler, MPI version, OMP settings, HPL config , BIOS settings, etc.
I've tried a variety of configurations in HPL.dat including things like 30000 to 82000 for N, 192/200/212 for NB, and a variety of P's and Q's.
I compiled the latest HPL and BLIS framework using version 6.3.0 of the Gnu compiler collection.
ssawyer sorry for the delay in responding. Summer holiday and business travel.
First thing to keep in mind is the theoretical maximum double precision FLOPs on EPYC. Although EPYC supports SIMD operations and the corresponding ISA extensions up through AVX2 (256 bit vector width / 4 packed double precision words), the Zen1 core takes two clock cycles to store a 256b vector. Thus the maximum double precision FLOPs per core per clock cycle on Zen1 is 8. That said, EPYC does not suffer from frequency slow-down when running AVX2 heavy code like HPL. The practical upper limit for HPL for a 2P EPYC server running the 7601 SKU (top of stack) is 1200 GFLOPs. If your 2P score is at least 1000 GFLOPs on a similar set-up, chances are your HPL build is fine and you should explore other more minor system tuning options. Lastly note that on the 7601, running the memory at 2400 MTS vs. 2667 MTS will save about 15 watts per socket - this power headroom allows the cores to boost more when running HPL. If you have disabled boost, this is irrelevant. However with boost enabled this will allow about 10% higher scores.
Building HPL is straightforward with BLIS - you need only provide a Makefile with the location of the downloaded BLIS library and installed MPI. You can build BLIS from source (use at least GCC 7 for best results), or you can just download the pre-configured binary from the AMD website.
When running HPL, you have two choices - the difference between the two is minimal. The example below is for OpenMPI:
1) Single-thread BLIS (use one MPI rank per core so insure that P x Q = the total number of cores)
mpirun -np <# of cores> --map-by core --mca btl self,vader xhpl
2) Multi-thread BLIS (use one MPI rank per L3 so insure that P x Q = 16 for a 2P EPYC)
export OMP_PLACES=cores # Only needed if SMT is enabled
export OMP_NUM_THREADS=4 # change to 3 for 24-core EPYC SKUs, change to 2 for 16-core EPYC SKUs
mpirun -np 16 --map-by l3cache --mca btl self,vader xhpl
Best problem size should use up as much of the memory as possible. For example on a machine with 256GB of memory I use 168960. I use NBs = 232 though I haven't seen a lot of sensitivity to this figure and your value should be fine.
Thanks for the help monkey. I had actually just gotten it to work a few days before your response. I had a problem in the configuration of the job submission script. In the end, I was able to get a tad over 1000 GFlops.