I am using Gauss by Aptech programming language on a cluster with 32 cores per node, see
http://products.amd.com/pages/opteroncpudetail.aspx?id=648&AspxAutoDetectCookieSupport=1
and 128 Gb RAM, Linux 64bit. I am running some 32 threads codes written by me in Gauss handy and straightforward syntax.
During execution, I have noticed that the CPU time is taken for 85% by the user and for 15% by the system. I suspect this is due to Moving threads between cores, i.e. thread switching. Because of this, scaling from 16 to 32 threads does not -- generally -- improve performance. CPU times level to the 16 threads even when executing a 32 threads code of mine. Only in 5% of the experiments performed, it occurred mysteriously, that the system CPU % dropped to 0% and the performance swelled to almost linear figures.
Therefore, I have been suggested to use the syntax
KMP_AFFINITY=proclist=[0-15],explicit
to modify the environment variable so to lock a core for each CPU. Unfortunately, this is an Intel processors syntax. While I am using an Opteron.
Therefore, my question is: which is the equivalent syntax for Opteron based systems ?
Finally, does anyone have an explanation for the mystery I depicted above: i.e. several runs of the same code in which in a few cases on the same cluster node performance is much better than in the rest of the cases. A few cases in which user's CPU time is near 100% while system time goes to 0%.
Ideally, this would be the kind of execution pursued in view of an higher scalability.