Subject: I need help understanding how the NUMA settings affect my attempt to parallelize a FORTRAN code, and how to remedy the problem.
I am trying to run a FORTRAN code in parallel on an Epyc-Rome dual socket 7552, with 192MB cache (96 cores/192 threads). I have a 4x 128GB DDR4 sdram 3200 (PC4 25600) ECC server memory.
The FORTRAN code is supposed to be the type that is ideal for parallelization: A large 3D domain where there are two major 3D do loops. One updates a set of matrices (call them Xn) using the values of the nearby elements of the stored matrices (call them An). And then when that loop is done, the An matrices' elements are updated using the nearby elements of the (just) stored matrices Xn. In other words, the do loops are explicit. And, in principle, the 3D domain could be partitioned into multiple 3D sub-domains that could run independent of each other except for needing to access memory to use some of the same elements at their common "edges" or "sides". (It is a finite difference time domain code.)
The problem statement is that I am not getting the parallel program to (effectively) use more than about 32 threads at a time so that with 96 cpu available it is only 1.33 times faster than an older 16cpu (threadripper) computer. In other words, if I set OMP_NUM_THREADS higher than 32, nothing improves and eventually gets worse, running into a kmp barrier problem.
My problem is that I am a scientist, not a programmer, and although I am getting help with the proper way of using OMP commands in the program, the biggest issue appears to be I am running into a memory bandwidth limit. But I am told it may also have to do with losing the bandwidth of other NUMA domains, e.g.: I was told that "If you have NUMA balancing enabled and you run much more iterations, the memory pages might move to other NUMA domains."
So, is there anything I can do about NUMA balancing or not balancing? Is this the kind of question that you can help me with in this forum?