Archives Discussions

bartwillems · ‎04-06-2010

numactl -m vs -l behavior

Hi All,

I 'm running stream on a dual socket 12-core Magny Cours server. The node inventory is

$ numactl --hardware

available: 4 nodes (0-3)

node 0 cpus: 0 1 2 3 4 5

node 0 size: 8066 MB

node 0 free: 5732 MB

node 1 cpus: 6 7 8 9 10 11

node 1 size: 8080 MB

node 1 free: 7968 MB

node 2 cpus: 18 19 20 21 22 23

node 2 size: 8080 MB

node 2 free: 7651 MB

node 3 cpus: 12 13 14 15 16 17

node 3 size: 8080 MB

node 3 free: 7974 MB

node distances:

node 0 1 2 3

0: 10 20 20 20

1: 20 10 20 20

2: 20 20 10 20

3: 20 20 20 10

I tried three different stream runs with different numactl options:

(1) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 ./stream_c.exe

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 30783.8826 0.0130 0.0125 0.0131

Scale: 30449.8192 0.0129 0.0126 0.0131

Add: 31962.5209 0.0183 0.0180 0.0185

Triad: 34557.0669 0.0168 0.0167 0.0169

-------------------------------------------------------------

(2) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -l ./stream_c.exe

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 30666.0714 0.0130 0.0125 0.0132

Scale: 30600.2344 0.0129 0.0125 0.0130

Add: 31881.5369 0.0183 0.0181 0.0184

Triad: 34611.5257 0.0168 0.0166 0.0169

-------------------------------------------------------------

(3) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -m 0,1,3,2 ./stream_c.exe

-------------------------------------------------------------

Function Rate (MB/s) Avg time Min time Max time

Copy: 10379.1975 0.0371 0.0370 0.0371

Scale: 10964.9034 0.0351 0.0350 0.0352

Add: 9725.6091 0.0593 0.0592 0.0595

Triad: 9790.2447 0.0590 0.0588 0.0591

-------------------------------------------------------------

I 'm puzzled why (3) gives poorer results than (2) as the -m 0,1,3,2 option seemingly forces each core to use local memory just like the -l option. Does anyone have any insight in this?

Thanks,

Bart

s1974 · ‎10-20-2010

have you found your response ? what is your compilation options ?

thanks

omesa · ‎01-06-2011

Originally posted by: s1974 have you found your response ? what is your compilation options ?

thanks

jdmccalpin · ‎02-07-2011

It is important to be aware that the "numactl -C" option defines the *set* of cores where threads can be run, not a *list* of cores to which the individual threads are bound. Although Linux schedulers usually leave threads on the core where they are started, there is no guarantee that this will be the case unless each thread is individually bound to a specific core. I use "sched_setaffinity()" inside an OpenMP parallel loop to provide this binding. (In most OpenMP implementations the mapping of OpenMP threads to O/S threads is guaranteed to remain the same across multiple parallel sections as long as the number of parallel threads requested is not changed.)

Similarly, the "numactl -m" option defines the *set* of nodes where memory can be allocated, not a *list* of nodes to which the individual threads allocate their memory. The standard version of the STREAM benchmark (when compiled with OpenMP support) initializes the arrays in a parallel section using the same loop constructs as the benchmark kernels. This works very well with a "first touch" data placement policy, which is the default on many systems and which is explicitly forced by "numactl -l".

To guarantee processor/data affinity for the whole benchmark run, I usually add explicit code to bind the threads to specific cores *before* the data initialization loop. If I want to test remote memory access, I add a second set of calls to "sched_setaffinity()" *after* the data is initialized and *before* the benchmark kernels.

Some compilers support OpenMP thread binding controlled by environment variables -- these work well with the OpenMP implementation in the standard version of STREAM. The use of "sched_setaffinity()" is uglier, but I have been able to make it work with all C compilers on Linux that support OpenMP.

The results you obtained with your third test case suggest that all of the data was allocated on on "node" (probably node 0) so only 1/4 of the aggregate system bandwidth was available to the four threads.

Archives Discussions

numactl and Magny Cours stream benchmark