Hi, all.
I decided to measure execution time of some atomic instructions in order
to have deeper understanding of the penalty associated with memory access
to remote NUMA nodes.
Eventually I got result on set of instructions which contradicts the NUMA
theory: in my case access to remote NODEs is faster. But let me describe
everything in order.
1. My configuration is the following:
root@server:~# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 4
NUMA node(s): 8
Vendor ID: AuthenticAMD
CPU family: 21
Model: 2
Model name: AMD Opteron(tm) Processor 6386 SE
Stepping: 0
CPU MHz: 2807.199
BogoMIPS: 5616.32
Virtualization: AMD-V
L1d cache: 16K
L1i cache: 64K
L2 cache: 2048K
L3 cache: 6144K
NUMA node0 CPU(s): 0-7
NUMA node1 CPU(s): 8-15
NUMA node2 CPU(s): 16-23
NUMA node3 CPU(s): 24-31
NUMA node4 CPU(s): 32-39
NUMA node5 CPU(s): 40-47
NUMA node6 CPU(s): 48-55
NUMA node7 CPU(s): 56-63
root@server:~# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16045 MB
node 0 free: 15419 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16127 MB
node 1 free: 16062 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16127 MB
node 2 free: 16085 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 16127 MB
node 3 free: 16031 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 16127 MB
node 4 free: 16083 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 16127 MB
node 5 free: 16060 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 16127 MB
node 6 free: 16080 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 16126 MB
node 7 free: 16050 MB
node distances:
node 0 1 2 3 4 5 6 7
0: 10 16 16 22 16 22 16 22
1: 16 10 22 16 22 16 22 16
2: 16 22 10 16 16 22 16 22
3: 22 16 16 10 22 16 22 16
4: 16 22 16 22 10 16 16 22
5: 22 16 22 16 16 10 22 16
6: 16 22 16 22 16 22 10 16
7: 22 16 22 16 22 16 16 10
2. There is a small userspace tool of mine which aims to burn different
CPUs and measure the execution time in nsecs:
https://github.com/rouming/ccont
Basically what this tool does is the following:
o allocates chunk of page aligned memory on NUMA node #0.
o starts desired number of threads, pins them to certain CPUs, which
belong to different NUMA nodes (what CPUs and nodes to use is
defined by bitmasks, which passed as tool parameters).
o burns those CPUs which execute the same instruction in loop (what
instruction to execute is also defined by tool parameters).
The tool provides set of loads, which define what CPUs and nodes to
occupy. E.g.:
o cpu-increase - on each iteration new thread is created and pinned
to the next CPU to burn. This load shows the
performance degradation in case of cache line
bouncing.
o node-cascade - on each iteration CPUs from next node are burned.
This load shows the performance difference on
different nodes.
o cpu-rollover - on each iteration executor thread rolls to another
CPU on the next node, always keeping the same amount
of CPUs. This load shows the performance difference,
when threads spread over all CPUs, or concentrated
on one specific node.
3. Here I would like to show results which look strange to me. I hope
someone will help to clarify, what happens on my machine. The following
are CPU operations which I measured:
o memset256 - memset of 256 bytes.
o `cmpxchgq` instruction - atomically compares-and-exchanges,
unsigned long (64-bit on gcc).
Results of doing memset of 256 bytes of memory chunk, which is allocated
on NUMA node N0:
# Burn 8 CPUs on local node N0:
root@server:~# ./ccont -o memset256 -n 0
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs ******** -------- -------- -------- -------- -------- -------- -------- 8 memset256 1290.655 1343.644 1326.663 22.636
# Burn 8 CPUs on distant node N7:
root@server:~# ./ccont -o memset256 -n 7
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs -------- -------- -------- -------- -------- -------- -------- ******* 8 memset256 1800.407 1800.966 1800.785 0.221
# Burn 8 CPUs on all nodes:
root@server:~# ./ccont -o memset256 -c 0,8,16,24,32,40,48
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs *------- *------- *------- *------- *------- *------- *------- -------- 7 memset256 2596.499 2662.810 2642.929 30.703
Summary:
Results are expected to me: avg column shows the average execution
time in nsecs, and local node access is much faster.
Results of the 'cmpxchgq' instruction are the following:
# Burn 8 CPUs on local node N0:
root@server:~# ./ccont -op cmpxchg --nodes 0
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs ******** -------- -------- -------- -------- -------- -------- -------- 8 cmpxchg 744.714 745.481 745.176 0.281
# Burn 8 CPUs on distant node #7:
root@server:~# ./ccont -op cmpxchg --nodes 7
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs -------- -------- -------- -------- -------- -------- -------- ******** 8 cmpxchg 497.700 613.177 553.036 53.784
# Burn 8 CPUs on all nodes:
root@server:~# ./ccont -op cmpxchg --cpu 0,8,16,24,32,40,48,56
Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev
CPUs *------- *------- *------- *------- *------- *------- *------- *------- 8 cmpxchg 317.938 476.666 411.078 70.578
Summary:
Local memory access is quite expensive comparing to access from remote
node N7, and especially spread access shows best time.
I can't explain these results. According to my understanding local
access should be faster and access from different nodes should be the
worst because of cache line bouncing, but the numbers are vice versa.
The question is the following:
What I am missing and why my results are so contradicting?
PS: I tried the same measurements on smaller Intel machine and the results
are quite explainable:
Nodes N0 N1 CPUs operation min max avg stdev
CPUs **** ---- 4 cmpxchg 72.287 72.322 72.310 0.016
CPUs **-- **-- 4 cmpxchg 116.803 121.450 119.108 2.658
CPUs ---- **** 4 cmpxchg 72.327 72.333 72.330 0.003
--
Roman