Hello,
My question is in regards to the performance for a 7742 server and, in particular, to the HPCG benchmarks. Firstly, I'm following the "AMD - High Performance Computing: Tuning Guide for AMD EPYC 7002 Series Processors." The thing is that increasing the number of MPI ranks doesn't scale up too well in the high range. In order to compare my results to someone else's, I found this post (https://www.pugetsystems.com/labs/hpc/HPC-Parallel-Performance-for-3rd-gen-Threadripper-Xeon-3265W-and-EPYC-7742-HPL-HPCG-Numpy-NAMD-1717/) and the 6th figure (HPCG scaling) shows very similar results to mine. As a matter of fact, a 7551 results in higher HPCG results, which was not the expectation. My background is definitely not in microprocessor design so I might be missing something, but my only explanation is that the shared L3 cache is limiting memory access (as the HPCG benchmark solves a sparse matrix, not a dense one). Not knowing how many people in the AMD Server Gurus community are familiar with this benchmark, I was wondering if anyone has any comment or suggestion on system set-up or else that could potentially increase the benchmark outcome.
Thanks in advance.
Hello,
We find HPCG is a memory bandwidth bound benchmark. For better scaling with HPCG I would recommend trying a hybrid config with the MPI ranks and threads.
For example for a single node of 7742. You can try 32 ranks x 4 threads per rank. This way you will have 1 rank per ccx and each ccx will have 4 cores(or threads).
Multiplying the two gives you a total of 128 cores per 7742 server. 32 x 4 =128.
I would also check the pinning of the MPI ranks and threads to ensure that each rank is pinned exactly to 1 ccx.
I believe this should help your scalability problem with HPCG.
Hello @anre-amd,
Thank you for your message. I tried combining MPI and OpenMP earlier this month but the results didn't improve. Your comment about 'pinning each MPI rank to 1 ccx' is very, very interesting. My suspicion was that that situation was not happening (and hence no performance improvement), but do you have any suggestion on to force this pinning without modifying the app itself?
Thanks.
You can use mpi options to ensure the pinning is correct. Which MPI version are you using? OpenMPI? IntelMPI?
Hello @anre,
I apologize for the delay. I'm using OpenMPIv4.0.3 and was able to pin the ranks as desired. However, the hybrid results are a bit lower than using only MPI. I spent quite a bit more time with HPCG than was planned, and have moved on to assess other benchmarks & apps (task currently underway). Thanks.