Archives Discussions

don_peppe_di_prata · ‎03-07-2013

I am using Gauss by Aptech programming language on a cluster with 32 cores per node, see

http://products.amd.com/pages/opteroncpudetail.aspx?id=648&AspxAutoDetectCookieSupport=1

and 128 Gb RAM, Linux 64bit. I am running some 32 threads codes written by me in Gauss handy and straightforward syntax.

During execution, I have noticed that the CPU time is taken for 85% by the user and for 15% by the system. I suspect this is due to Moving threads between cores, i.e. thread switching. Because of this, scaling from 16 to 32 threads does not -- generally -- improve performance. CPU times level to the 16 threads even when executing a 32 threads code of mine. Only in 5% of the experiments performed, it occurred mysteriously, that the system CPU % dropped to 0% and the performance swelled to almost linear figures.

Therefore, I have been suggested to use the syntax

KMP_AFFINITY=proclist=[0-15],explicit

to modify the environment variable so to lock a core for each CPU. Unfortunately, this is an Intel processors syntax. While I am using an Opteron.

Therefore, my question is: which is the equivalent syntax for Opteron based systems ?

Finally, does anyone have an explanation for the mystery I depicted above: i.e. several runs of the same code in which in a few cases on the same cluster node performance is much better than in the rest of the cases. A few cases in which user's CPU time is near 100% while system time goes to 0%.

Ideally, this would be the kind of execution pursued in view of an higher scalability.

yurtesen · ‎03-12-2013

Actually the solution you find would be suitable for both intel and amd based systems there is no proessor based syntax. It is the operating system which does scheduling. However, still the answer depends on what creates the threads you mention. If your code is using OpenMP, then there may be some implementation dependent environment variables, such as KMP_AFFINITY for intel compiler or GOMP_CPU_AFFINITY if gnu compiler is used.

You may want to explore something like this, but it wouldnt be practical to use at every lunch I guess, but perhaps for testing your theory...:

http://linux.die.net/man/1/taskset

Or this if you are able to use it from your code:

http://linux.die.net/man/2/sched_setaffinity

I dont know what is gauss or how your program is creating threads so I have no idea about what is the correct answer unfortunately.

To answer your question about the system time, this greatly depends on what was your system doing exactly when it was spending 15% of the time. For example it could have been fetching something from disk then this data was cached in memory after a run so it did not appear on next run (although this doesnt completely explain random spikes, unless some other program/daemon was making the system busy in these times).

I guess you may try to profile your code to see where the time was spend in those cases where 15% was used and compare it with a profiling output where system time is 0%

You may be able to use CodeXL:

http://developer.amd.com/tools/heterogeneous-computing/codexl/

Archives Discussions

how to lock a thread during a multi-threaded code execution