cancel
Showing results for 
Search instead for 
Did you mean: 

Server Processors

rudydiaz
Journeyman III

Running two OMP parallelized FORTRAN executables in a 96cpu Epy-Rome

I have a 96cpu dual socket epyc-rome computer. (512GB RAM). I am working on an OMP parallelized FORTRAN code that I launch on a single command window, using the environment variables: set OMP_NUM_THREADS=48   set OMP_PROC_BIND=close  set OMP_PLACES={0}:48:2.

By itself it takes 40.5 seconds to run. If I launch it using set OMP_PLACES={1}:48:2 it also takes 40.5 seconds. And according to the Windows resource monitor, I am using in the first case only the even cpus of Node 0 and the second uses the odd cpus of Node 0. (Trying to use other starting positions or stride that get me into Node 1, increases the time by 50%.   I then try to run the code from two command windows, accessing two different folders, each one containing a copy of the executable file. They run at the same time but now take 80 seconds. In other words, even though I have 96 cpus available, and I am limiting (I think) the code to run on 48,  they are somehow interfering with each other.

I have tried changing the BIOS settings of NUMA nodes per socket and also tried disabling SMT. But nothing works. 

Can someone suggest an approach that could get me to use all those 96 cpus?

0 Likes
3 Replies
shrjoshi
Staff

Hello @rudydiaz 

Thank you for writing to Serverguru.
Currently we do not have the setup required to reproduce this issue. We will look into the same.

In the meantime can you check the behavior when OMP_NUM_THREADS=96 and report the same.
Also as your application is using openmp i.e resources are limited to shared memory of your system.  And you are getting 80 Seconds when you are running the same application from two different terminals on NUMA 0 alone, here again threads are using the same shared memory for its application. Hence the increased time. 
Would suggest to please check affinity/numactl related options if available in Windows to utilize all available cores. 

0 Likes

Thank you. I post this reply in case anyone else runs into the same issue.

I did perform a  variety of experiments using 96 nodes and other combinations, from all in Node 0 to spilling across to Node 1, to all in N ode 1. Involving Node 1 was always worse.

It turns out this problem is due to a combination of factors. Foremost was a combination of my ignorance and the vendor's. I wanted 512GB of ram on this 2-socket server (on a Supermicro mother board) and among the top choices was 4 X 128GB. Which is what I ordered. After researching this, it turns out that there is no favorable configuration of four sticks of memory on a two-socket epyc-rome 7552. The indicated way to buy 512GB of RAM should have been 16 sticks of 32GB each. That would give me the largest possible bandwidth which is what I need for CFS-like computational work. The vendor put the four 128GB sticks all on Socket 0. Which explains why socket 1 was so slow, seeing as that forces it to access memory through socket 0.

Apart from that, Windows is not a friendly environment to figure out how the cpus are allocated among the CCXs and CCDs which is probably why I was trying combinations (Using OMP_PLACES) that probably resulted in too much memory sharing among the threads.

With a little help, I have now found out the way to ID the cpus on Windows and will now start a series of experiments launching two identical executables from two different command windows using a limited number of threads with the goal of avoiding threads accessing the same memory. 

As I have been told, LINUX is a much friendlier environment for doing this and I may eventually switch the machine to LINUX. 

Finally, I am very interested in trying the AMD FORTRAN. But I think that is not available for Windows.

0 Likes

Hello @rudydiaz 

Thank you for update on the issue. 
Yes,  currently AMD FORTRAN is not supported on Windows.
Please refer below link for AOCC (AMD compiler) if you plan to switch over to Linux.
https://www.amd.com/en/developer/aocc.html

0 Likes