bartwillems

numactl and Magny Cours stream benchmark

Discussion created by bartwillems on Apr 6, 2010
Latest reply on Feb 7, 2011 by jdmccalpin
numactl -m vs -l behavior

Hi All,

I 'm running stream on a dual socket 12-core Magny Cours server. The node inventory is

 

$ numactl --hardware

available: 4 nodes (0-3)

node 0 cpus: 0 1 2 3 4 5

node 0 size: 8066 MB

node 0 free: 5732 MB

node 1 cpus: 6 7 8 9 10 11

node 1 size: 8080 MB

node 1 free: 7968 MB

node 2 cpus: 18 19 20 21 22 23

node 2 size: 8080 MB

node 2 free: 7651 MB

node 3 cpus: 12 13 14 15 16 17

node 3 size: 8080 MB

node 3 free: 7974 MB

node distances:

node   0   1   2   3 

  0:  10  20  20  20 

  1:  20  10  20  20 

  2:  20  20  10  20 

  3:  20  20  20  10 

I tried three different stream runs with different numactl options:

 

(1) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 ./stream_c.exe

 

-------------------------------------------------------------

Function      Rate (MB/s)   Avg time     Min time     Max time

Copy:       30783.8826       0.0130       0.0125       0.0131

Scale:      30449.8192       0.0129       0.0126       0.0131

Add:        31962.5209       0.0183       0.0180       0.0185

Triad:      34557.0669       0.0168       0.0167       0.0169

-------------------------------------------------------------

(2) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -l ./stream_c.exe
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       30666.0714       0.0130       0.0125       0.0132
Scale:      30600.2344       0.0129       0.0125       0.0130
Add:        31881.5369       0.0183       0.0181       0.0184
Triad:      34611.5257       0.0168       0.0166       0.0169
-------------------------------------------------------------
(3) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -m 0,1,3,2 ./stream_c.exe
-------------------------------------------------------------
Function      Rate (MB/s)   Avg time     Min time     Max time
Copy:       10379.1975       0.0371       0.0370       0.0371
Scale:      10964.9034       0.0351       0.0350       0.0352
Add:         9725.6091       0.0593       0.0592       0.0595
Triad:       9790.2447       0.0590       0.0588       0.0591
-------------------------------------------------------------
I 'm puzzled why (3) gives poorer results than (2) as the -m 0,1,3,2 option seemingly forces each core to use local memory just like the -l option. Does anyone have any insight in this?
Thanks,
Bart

 

Outcomes