We recently invested in a server for the sole purpose of solving engineering simulations in Simulia Abaqus (FEA). This server was to be an upgrade to our existing server. For the types of simulations we do, the time required for the simulations scale inversely with the number of cores. We went from 6 cores on a CPU released in 2014 to 32 cores on a CPU released in 2017 and were expecting to realize ~5x increase in speed (solve in 20% of the time). However, we've only realized about 2x increase in speed (solving in 50% of the time). This is much slower than we were expecting.
As a test, we ran the same simulation in the same version of the software on new and old server using 6 cores for both. The new server took 15.5 hrs to solve, while the old server took just 6 hrs to solve (again, both using 6 cores). For this test, we expected the new server to solve in ~6 hrs.
We're hoping to get some help to determine why this is the case and if/how we can speed things up.
New System Specs:
Dell Poweredge R7425
CPUs: 2x Epyc 7351 (16C/32T)
RAM: 192GB DDR4 2666 (12x 16GB)
OS Storage: 480GB SATA SSD
Data Storage: 1.6TB NVMe SSD
OS: CentOS 7.7 (we've tried multiple kernels including 3.10)
PSU: Redundant 1100W (1+1)
Old System Specs:
Dell Unknown
CPU: 1x Xeon E5-2643v3 (6C/12T)
RAM: 128GB DDR4 1600 (8x 16GB)
OS Storage: 250GB SATA SSD
Data Storage: 2TB SAS HDD
OS: Windows Server 2012
PSU: unknown
Did you run any tests how well your application scales?
I could imagine you might be running into issues described by Amdahls Law, or more specifically the Universal Scalability Law.
Regards ~Chris
For the program we're using (abaqus, a widely used FEA software), the number of threads to use is set manually. We tested both the new solver and the old solver using 6 threads. The new solver took 15.5 hrs to solve while the old solver took only 6 hrs to solve.
For reference, if we increase from 6 to 12 threads on the new solver, the run time is reduced from 15.5 hrs to 8 hrs, indicating that things are scaling as expected. Same if we go from 12 to 24, run time is reduced from 8 hrs to 4 hrs.
Abaqus comes with 2 different solvers: standard and explicit. Explicit scales rather well on many cores, and even scales on distributed memory systems. And it runs pretty well overall on 1st gen Epyc CPUs. "Standard" solver is a different beast entirely. If you had some options to control core binding, it would not be as bad. But most of the commands needed are undocumented, and do not always work as intended.
From my own testing, I would conclude that 1st gen Epyc with its complicated NUMA topology is just about the worst-case for this solver. It doesn't help that your server has a suboptimal memory configuration. Fixing that might give you some better -and more consistent- results, but it won't be a game-changer. 2nd gen Epyc might be, at least configured in NPS1 mode. But overall, no CPU available right now or in the foreseeable future would get you the performance increase you hoped for. Not for Abaqus standard solver.
Side-note: from my own testing, GPU acceleration can work pretty well for Abaqus standard. That's something to look into for cutting your solver times in half and more.
As flotus1 pointed out there are 2 different solvers for Abaqus.
Which one are you using?
As for the memory configuration we recommend that you run all 8 channels of memory per socket(especially on the NAPLES generation since there is no IO die)
The pinning of the threads/ranks is also critical to achieving optimal performance. Especially since the memory is less than 1 DIMM per channel.
My recommendation would be to try to get to a 1 DIMM per channel configuration with explicit pinning of the threads to see if this improves your situation.
I have a similar problem and hope somebody can help me:
We have bought a new workstation for FEA simulations (vehicle crash etc.). Unfortunately we are not getting the expected performance.
Our System:
Accroding to the following Benchmark for an older CPU the simulation time should be under 2000 seconds:
https://www.amd.com/system/files/documents/amd-epyc-with-altair-radioss-powering-hpc.pdf
With our current configuration the same simulation model (Neon1M11) needs 10000 seconds (5 times longer).
For the calculation we are using Intel MPI (-mpi -i -np 64). From altair we got following recommendations for the envoirnment variables which already helped that the programm actually using all cores:
KMP_AFFINITY=disabled
I_MPI_DOMAIN=auto
And we turned off hyperthreading in the BIOS.
But still the simulations are 5 times slower than expected.
Are there any further settings, environment variables, Win10-incompatibility we have missed?
Thanks in advance
Hello,
If you have some more memory it might help to try populating 1 DIMM per channel, or in other words 8 channels per socket. Right now it seems as though you have 4 channels per socket populated. If this is not possible maybe you can try a test with just single socket in 1 DIMM per channel/8 channel per socket configuration.
Another thing that would be worth checking are the BIOS settings. This document should help with that https://www.amd.com/system/files/documents/amd-epyc-7002-tg-hpc-56827.pdf
Can you try once more with this env var I_MPI_DEBUG=5
It will print out the rank pinning which would help to debug.
Hi,
as soon as we get the addtional dimms I will give an update. But so far activated I_MPI_DEBUG and started the run with -mpi -np 32 and got following Information:
[0] MPI startup(): Multi-threaded optimized library
[13] MPI startup(): shm data transfer mode
[0] MPI startup(): shm data transfer mode
[1] MPI startup(): shm data transfer mode
[9] MPI startup(): shm data transfer mode
[10] MPI startup(): shm data transfer mode
[11] MPI startup(): shm data transfer mode
[12] MPI startup(): shm data transfer mode
[14] MPI startup(): shm data transfer mode
[15] MPI startup(): shm data transfer mode
[18] MPI startup(): shm data transfer mode
[2] MPI startup(): shm data transfer mode
[3] MPI startup(): shm data transfer mode
[4] MPI startup(): shm data transfer mode
[5] MPI startup(): shm data transfer mode
[6] MPI startup(): shm data transfer mode
[7] MPI startup(): shm data transfer mode
[8] MPI startup(): shm data transfer mode
[16] MPI startup(): shm data transfer mode
[17] MPI startup(): shm data transfer mode
[19] MPI startup(): shm data transfer mode
[20] MPI startup(): shm data transfer mode
[21] MPI startup(): shm data transfer mode
[22] MPI startup(): shm data transfer mode
[23] MPI startup(): shm data transfer mode
[24] MPI startup(): shm data transfer mode
[25] MPI startup(): shm data transfer mode
[26] MPI startup(): shm data transfer mode
[27] MPI startup(): shm data transfer mode
[28] MPI startup(): shm data transfer mode
[29] MPI startup(): shm data transfer mode
[30] MPI startup(): shm data transfer mode
[31] MPI startup(): shm data transfer mode
[9] MPI startup(): Internal info: pinning initialization was done
[8] MPI startup(): Internal info: pinning initialization was done
[10] MPI startup(): Internal info: pinning initialization was done
[11] MPI startup(): Internal info: pinning initialization was done
[15] MPI startup(): Internal info: pinning initialization was done
[28] MPI startup(): Internal info: pinning initialization was done
[30] MPI startup(): Internal info: pinning initialization was done
[31] MPI startup(): Internal info: pinning initialization was done
[3] MPI startup(): Internal info: pinning initialization was done
[13] MPI startup(): Internal info: pinning initialization was done
[14] MPI startup(): Internal info: pinning initialization was done
[20] MPI startup(): Internal info: pinning initialization was done
[21] MPI startup(): Internal info: pinning initialization was done
[29] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Internal info: pinning initialization was done
[1] MPI startup(): Internal info: pinning initialization was done
[2] MPI startup(): Internal info: pinning initialization was done
[12] MPI startup(): Internal info: pinning initialization was done
[22] MPI startup(): Internal info: pinning initialization was done
[23] MPI startup(): Internal info: pinning initialization was done
[4] MPI startup(): Internal info: pinning initialization was done
[5] MPI startup(): Internal info: pinning initialization was done
[6] MPI startup(): Internal info: pinning initialization was done
[7] MPI startup(): Internal info: pinning initialization was done
[26] MPI startup(): Internal info: pinning initialization was done
[27] MPI startup(): Internal info: pinning initialization was done
[19] MPI startup(): Internal info: pinning initialization was done[16] MPI startup(): Internal info: pinning initialization was done
[17] MPI startup(): Internal info: pinning initialization was done
[18] MPI startup(): Internal info: pinning initialization was done
[24] MPI startup(): Internal info: pinning initialization was done
[25] MPI startup(): Internal info: pinning initialization was done
[0] MPI startup(): Rank Pid Node name Pin cpu
[0] MPI startup(): 0 12216 CAD-05 {0,1}
[0] MPI startup(): 1 7984 CAD-05 {2,3}
[0] MPI startup(): 2 11908 CAD-05 {4,5}
[0] MPI startup(): 3 12108 CAD-05 {6,7}
[0] MPI startup(): 4 13684 CAD-05 {8,9}
[0] MPI startup(): 5 13880 CAD-05 {10,11}
[0] MPI startup(): 6 14172 CAD-05 {12,13}
[0] MPI startup(): 7 14328 CAD-05 {14,15}
[0] MPI startup(): 8 9140 CAD-05 {16,17}
[0] MPI startup(): 9 14000 CAD-05 {18,19}
[0] MPI startup(): 10 10808 CAD-05 {20,21}
[0] MPI startup(): 11 14264 CAD-05 {22,23}
[0] MPI startup(): 12 10724 CAD-05 {24,25}
[0] MPI startup(): 13 13848 CAD-05 {26,27}
[0] MPI startup(): 14 7100 CAD-05 {28,29}
[0] MPI startup(): 15 10764 CAD-05 {30,31}
[0] MPI startup(): 16 2612 CAD-05 {32,33}
[0] MPI startup(): 17 13968 CAD-05 {34,35}
[0] MPI startup(): 18 8064 CAD-05 {36,37}
[0] MPI startup(): 19 14324 CAD-05 {38,39}
[0] MPI startup(): 20 13680 CAD-05 {40,41}
[0] MPI startup(): 21 12372 CAD-05 {42,43}
[0] MPI startup(): 22 4732 CAD-05 {44,45}
[0] MPI startup(): 23 11572 CAD-05 {46,47}
[0] MPI startup(): 24 8260 CAD-05 {48,49}
[0] MPI startup(): 25 11360 CAD-05 {50,51}
[0] MPI startup(): 26 13788 CAD-05 {52,53}
[0] MPI startup(): 27 9792 CAD-05 {54,55}
[0] MPI startup(): 28 6920 CAD-05 {56,57}
[0] MPI startup(): 29 2008 CAD-05 {58,59}
[0] MPI startup(): 30 3896 CAD-05 {60,61}
[0] MPI startup(): 31 9264 CAD-05 {62,63}
[0] MPI startup(): I_MPI_ADJUST_BCAST=1
[0] MPI startup(): I_MPI_ADJUST_REDUCE=2
[0] MPI startup(): I_MPI_DEBUG=5
[0] MPI startup(): I_MPI_PIN_DOMAIN=auto
[0] MPI startup(): I_MPI_PIN_MAPPING=32:0 0,1 2,2 4,3 6,4 8,5 10,6 12,7 14,8 16,9 18,10 20,11 22,12 24,13 26,14 28,15 30,16 32,17 34,18 36,19 38,20 40,21 42,22 44,23 46,24 48,25 50,26 52,27 54,28 56,29 58,30 60,31 62
So for us the final solution was to fill all the DIMM slots. Then we achieved the benchmark performance.
Hi all,
sorry to open an old thread.
I'm about to go for the same upgrade with the same aim: from a 6 core old gen Xeon to a dual EPYC 7343 workstation, with all the memory channels populated (16 x 16 GB DDR4 ECC @3200MHz) to run FEA on Abaqus.
My question to @EH is: did you get the desired performance also with Abaqus Standard?
I am actually interested in the implicit solver only.
Best,
w