We recently invested in a server for the sole purpose of solving engineering simulations in Simulia Abaqus (FEA). This server was to be an upgrade to our existing server. For the types of simulations we do, the time required for the simulations scale inversely with the number of cores. We went from 6 cores on a CPU released in 2014 to 32 cores on a CPU released in 2017 and were expecting to realize ~5x increase in speed (solve in 20% of the time). However, we've only realized about 2x increase in speed (solving in 50% of the time). This is much slower than we were expecting.
As a test, we ran the same simulation in the same version of the software on new and old server using 6 cores for both. The new server took 15.5 hrs to solve, while the old server took just 6 hrs to solve (again, both using 6 cores). For this test, we expected the new server to solve in ~6 hrs.
We're hoping to get some help to determine why this is the case and if/how we can speed things up.
New System Specs:
Dell Poweredge R7425
CPUs: 2x Epyc 7351 (16C/32T)
RAM: 192GB DDR4 2666 (12x 16GB)
OS Storage: 480GB SATA SSD
Data Storage: 1.6TB NVMe SSD
OS: CentOS 7.7 (we've tried multiple kernels including 3.10)
PSU: Redundant 1100W (1+1)
Old System Specs:
CPU: 1x Xeon E5-2643v3 (6C/12T)
RAM: 128GB DDR4 1600 (8x 16GB)
OS Storage: 250GB SATA SSD
Data Storage: 2TB SAS HDD
OS: Windows Server 2012
For the program we're using (abaqus, a widely used FEA software), the number of threads to use is set manually. We tested both the new solver and the old solver using 6 threads. The new solver took 15.5 hrs to solve while the old solver took only 6 hrs to solve.
For reference, if we increase from 6 to 12 threads on the new solver, the run time is reduced from 15.5 hrs to 8 hrs, indicating that things are scaling as expected. Same if we go from 12 to 24, run time is reduced from 8 hrs to 4 hrs.
Abaqus comes with 2 different solvers: standard and explicit. Explicit scales rather well on many cores, and even scales on distributed memory systems. And it runs pretty well overall on 1st gen Epyc CPUs. "Standard" solver is a different beast entirely. If you had some options to control core binding, it would not be as bad. But most of the commands needed are undocumented, and do not always work as intended.
From my own testing, I would conclude that 1st gen Epyc with its complicated NUMA topology is just about the worst-case for this solver. It doesn't help that your server has a suboptimal memory configuration. Fixing that might give you some better -and more consistent- results, but it won't be a game-changer. 2nd gen Epyc might be, at least configured in NPS1 mode. But overall, no CPU available right now or in the foreseeable future would get you the performance increase you hoped for. Not for Abaqus standard solver.
Side-note: from my own testing, GPU acceleration can work pretty well for Abaqus standard. That's something to look into for cutting your solver times in half and more.
As flotus1 pointed out there are 2 different solvers for Abaqus.
Which one are you using?
As for the memory configuration we recommend that you run all 8 channels of memory per socket(especially on the NAPLES generation since there is no IO die)
The pinning of the threads/ranks is also critical to achieving optimal performance. Especially since the memory is less than 1 DIMM per channel.
My recommendation would be to try to get to a 1 DIMM per channel configuration with explicit pinning of the threads to see if this improves your situation.