cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bartwillems
Journeyman III

numactl and Magny Cours stream benchmark

numactl -m vs -l behavior

Hi All,

I 'm running stream on a dual socket 12-core Magny Cours server. The node inventory is

 

$ numactl --hardware

available: 4 nodes (0-3)

node 0 cpus: 0 1 2 3 4 5

node 0 size: 8066 MB

node 0 free: 5732 MB

node 1 cpus: 6 7 8 9 10 11

node 1 size: 8080 MB

node 1 free: 7968 MB

node 2 cpus: 18 19 20 21 22 23

node 2 size: 8080 MB

node 2 free: 7651 MB

node 3 cpus: 12 13 14 15 16 17

node 3 size: 8080 MB

node 3 free: 7974 MB

node distances:

node   0   1   2   3 

  0:  10  20  20  20 

  1:  20  10  20  20 

  2:  20  20  10  20 

  3:  20  20  20  10 

I tried three different stream runs with different numactl options:

 

(1) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 ./stream_c.exe

 

-------------------------------------------------------------

Function      Rate (MB/s)   Avg time     Min time     Max time

Copy:       30783.8826       0.0130       0.0125       0.0131

Scale:      30449.8192       0.0129       0.0126       0.0131

Add:        31962.5209       0.0183       0.0180       0.0185

Triad:      34557.0669       0.0168       0.0167       0.0169

-------------------------------------------------------------

(2) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -l ./stream_c.exe
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       30666.0714       0.0130       0.0125       0.0132
Scale:      30600.2344       0.0129       0.0125       0.0130
Add:        31881.5369       0.0183       0.0181       0.0184
Triad:      34611.5257       0.0168       0.0166       0.0169
-------------------------------------------------------------
(3) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -m 0,1,3,2 ./stream_c.exe
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10379.1975       0.0371       0.0370       0.0371
Scale:      10964.9034       0.0351       0.0350       0.0352
Add:         9725.6091       0.0593       0.0592       0.0595
Triad:       9790.2447       0.0590       0.0588       0.0591
-------------------------------------------------------------
I 'm puzzled why (3) gives poorer results than (2) as the -m 0,1,3,2 option seemingly forces each core to use local memory just like the -l option. Does anyone have any insight in this?
Thanks,
Bart

 

0 Likes
3 Replies
s1974
Journeyman III

have you found your response ? what is your compilation options ?

thanks

0 Likes

Originally posted by: s1974 have you found your response ? what is your compilation options ?

thanks

0 Likes

It is important to be aware that the "numactl -C" option defines the *set* of cores where threads can be run, not a *list* of cores to which the individual threads are bound.  Although Linux schedulers usually leave threads on the core where they are started, there is no guarantee that this will be the case unless each thread is individually bound to a specific core.  I use "sched_setaffinity()" inside an OpenMP parallel loop to provide this binding.  (In most OpenMP implementations the mapping of OpenMP threads to O/S threads is guaranteed to remain the same across multiple parallel sections as long as the number of parallel threads requested is not changed.)

Similarly, the "numactl -m" option defines the *set* of nodes where memory can be allocated, not a *list* of nodes to which the individual threads allocate their memory.  The standard version of the STREAM benchmark (when compiled with OpenMP support) initializes the arrays in a parallel section using the same loop constructs as the benchmark kernels.  This works very well with a "first touch" data placement policy, which is the default on many systems and which is explicitly forced by "numactl -l".

To guarantee processor/data affinity for the whole benchmark run, I usually add explicit code to bind the threads to specific cores *before* the data initialization loop.  If I want to test remote memory access, I add a second set of calls to "sched_setaffinity()" *after* the data is initialized and *before* the benchmark kernels.

Some compilers support OpenMP thread binding controlled by environment variables -- these work well with the OpenMP implementation in the standard version of STREAM.  The use of "sched_setaffinity()" is uglier, but I have been able to make it work with all C compilers on Linux that support OpenMP.

The results you obtained with your third test case suggest that all of the data was allocated on on "node" (probably node 0) so only 1/4 of the aggregate system bandwidth was available to the four threads.

0 Likes