Hi All,
I 'm running stream on a dual socket 12-core Magny Cours server. The node inventory is
$ numactl --hardware
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5
node 0 size: 8066 MB
node 0 free: 5732 MB
node 1 cpus: 6 7 8 9 10 11
node 1 size: 8080 MB
node 1 free: 7968 MB
node 2 cpus: 18 19 20 21 22 23
node 2 size: 8080 MB
node 2 free: 7651 MB
node 3 cpus: 12 13 14 15 16 17
node 3 size: 8080 MB
node 3 free: 7974 MB
node distances:
node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 20
2: 20 20 10 20
3: 20 20 20 10
(1) OMP_NUM_THREADS=4 numactl -C 0,6,12,18 ./stream_c.exe
-------------------------------------------------------------
Function Rate (MB/s) Avg time Min time Max time
Copy: 30783.8826 0.0130 0.0125 0.0131
Scale: 30449.8192 0.0129 0.0126 0.0131
Add: 31962.5209 0.0183 0.0180 0.0185
Triad: 34557.0669 0.0168 0.0167 0.0169
-------------------------------------------------------------
have you found your response ? what is your compilation options ?
thanks
Originally posted by: s1974 have you found your response ? what is your compilation options ?
thanks
It is important to be aware that the "numactl -C" option defines the *set* of cores where threads can be run, not a *list* of cores to which the individual threads are bound. Although Linux schedulers usually leave threads on the core where they are started, there is no guarantee that this will be the case unless each thread is individually bound to a specific core. I use "sched_setaffinity()" inside an OpenMP parallel loop to provide this binding. (In most OpenMP implementations the mapping of OpenMP threads to O/S threads is guaranteed to remain the same across multiple parallel sections as long as the number of parallel threads requested is not changed.)
Similarly, the "numactl -m" option defines the *set* of nodes where memory can be allocated, not a *list* of nodes to which the individual threads allocate their memory. The standard version of the STREAM benchmark (when compiled with OpenMP support) initializes the arrays in a parallel section using the same loop constructs as the benchmark kernels. This works very well with a "first touch" data placement policy, which is the default on many systems and which is explicitly forced by "numactl -l".
To guarantee processor/data affinity for the whole benchmark run, I usually add explicit code to bind the threads to specific cores *before* the data initialization loop. If I want to test remote memory access, I add a second set of calls to "sched_setaffinity()" *after* the data is initialized and *before* the benchmark kernels.
Some compilers support OpenMP thread binding controlled by environment variables -- these work well with the OpenMP implementation in the standard version of STREAM. The use of "sched_setaffinity()" is uglier, but I have been able to make it work with all C compilers on Linux that support OpenMP.
The results you obtained with your third test case suggest that all of the data was allocated on on "node" (probably node 0) so only 1/4 of the aggregate system bandwidth was available to the four threads.