3 Replies Latest reply on Feb 7, 2011 11:52 AM by jdmccalpin

    numactl and Magny Cours stream benchmark

    bartwillems
      numactl -m vs -l behavior

      Hi All,

      I 'm running stream on a dual socket 12-core Magny Cours server. The node inventory is

       

      $ numactl --hardware

      available: 4 nodes (0-3)

      node 0 cpus: 0 1 2 3 4 5

      node 0 size: 8066 MB

      node 0 free: 5732 MB

      node 1 cpus: 6 7 8 9 10 11

      node 1 size: 8080 MB

      node 1 free: 7968 MB

      node 2 cpus: 18 19 20 21 22 23

      node 2 size: 8080 MB

      node 2 free: 7651 MB

      node 3 cpus: 12 13 14 15 16 17

      node 3 size: 8080 MB

      node 3 free: 7974 MB

      node distances:

      node   0   1   2   3 

        0:  10  20  20  20 

        1:  20  10  20  20 

        2:  20  20  10  20 

        3:  20  20  20  10 

      I tried three different stream runs with different numactl options:

       

      (1) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 ./stream_c.exe

       

      -------------------------------------------------------------

      Function      Rate (MB/s)   Avg time     Min time     Max time

      Copy:       30783.8826       0.0130       0.0125       0.0131

      Scale:      30449.8192       0.0129       0.0126       0.0131

      Add:        31962.5209       0.0183       0.0180       0.0185

      Triad:      34557.0669       0.0168       0.0167       0.0169

      -------------------------------------------------------------

      (2) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -l ./stream_c.exe
      -------------------------------------------------------------
      Function      Rate (MB/s)   Avg time     Min time     Max time
      Copy:       30666.0714       0.0130       0.0125       0.0132
      Scale:      30600.2344       0.0129       0.0125       0.0130
      Add:        31881.5369       0.0183       0.0181       0.0184
      Triad:      34611.5257       0.0168       0.0166       0.0169
      -------------------------------------------------------------
      (3) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 -m 0,1,3,2 ./stream_c.exe
      -------------------------------------------------------------
      Function      Rate (MB/s)   Avg time     Min time     Max time
      Copy:       10379.1975       0.0371       0.0370       0.0371
      Scale:      10964.9034       0.0351       0.0350       0.0352
      Add:         9725.6091       0.0593       0.0592       0.0595
      Triad:       9790.2447       0.0590       0.0588       0.0591
      -------------------------------------------------------------
      I 'm puzzled why (3) gives poorer results than (2) as the -m 0,1,3,2 option seemingly forces each core to use local memory just like the -l option. Does anyone have any insight in this?
      Thanks,
      Bart

       

        • numactl and Magny Cours stream benchmark
          s1974

          have you found your response ? what is your compilation options ?

          thanks

          • numactl and Magny Cours stream benchmark
            jdmccalpin

            It is important to be aware that the "numactl -C" option defines the *set* of cores where threads can be run, not a *list* of cores to which the individual threads are bound.  Although Linux schedulers usually leave threads on the core where they are started, there is no guarantee that this will be the case unless each thread is individually bound to a specific core.  I use "sched_setaffinity()" inside an OpenMP parallel loop to provide this binding.  (In most OpenMP implementations the mapping of OpenMP threads to O/S threads is guaranteed to remain the same across multiple parallel sections as long as the number of parallel threads requested is not changed.)

            Similarly, the "numactl -m" option defines the *set* of nodes where memory can be allocated, not a *list* of nodes to which the individual threads allocate their memory.  The standard version of the STREAM benchmark (when compiled with OpenMP support) initializes the arrays in a parallel section using the same loop constructs as the benchmark kernels.  This works very well with a "first touch" data placement policy, which is the default on many systems and which is explicitly forced by "numactl -l".

            To guarantee processor/data affinity for the whole benchmark run, I usually add explicit code to bind the threads to specific cores *before* the data initialization loop.  If I want to test remote memory access, I add a second set of calls to "sched_setaffinity()" *after* the data is initialized and *before* the benchmark kernels.

            Some compilers support OpenMP thread binding controlled by environment variables -- these work well with the OpenMP implementation in the standard version of STREAM.  The use of "sched_setaffinity()" is uglier, but I have been able to make it work with all C compilers on Linux that support OpenMP.

            The results you obtained with your third test case suggest that all of the data was allocated on on "node" (probably node 0) so only 1/4 of the aggregate system bandwidth was available to the four threads.