AnsweredAssumed Answered

Cache contention and NUMA nodes locality

Question asked by rouming on Apr 5, 2016

Hi, all.

 

I decided to measure execution time of some atomic instructions in order

to have deeper understanding of the penalty associated with memory access

to remote NUMA nodes.

 

Eventually I got result on set of instructions which contradicts the NUMA

theory: in my case access to remote NODEs is faster.  But let me describe

everything in order.

 

1. My configuration is the following:

 

root@server:~# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    8
Socket(s):             4
NUMA node(s):          8
Vendor ID:             AuthenticAMD
CPU family:            21
Model:                 2
Model name:            AMD Opteron(tm) Processor 6386 SE
Stepping:              0
CPU MHz:               2807.199
BogoMIPS:              5616.32
Virtualization:        AMD-V
L1d cache:             16K
L1i cache:             64K
L2 cache:              2048K
L3 cache:              6144K
NUMA node0 CPU(s):     0-7
NUMA node1 CPU(s):     8-15
NUMA node2 CPU(s):     16-23
NUMA node3 CPU(s):     24-31
NUMA node4 CPU(s):     32-39
NUMA node5 CPU(s):     40-47
NUMA node6 CPU(s):     48-55
NUMA node7 CPU(s):     56-63


root@server:~# numactl -H
available: 8 nodes (0-7)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 16045 MB
node 0 free: 15419 MB
node 1 cpus: 8 9 10 11 12 13 14 15
node 1 size: 16127 MB
node 1 free: 16062 MB
node 2 cpus: 16 17 18 19 20 21 22 23
node 2 size: 16127 MB
node 2 free: 16085 MB
node 3 cpus: 24 25 26 27 28 29 30 31
node 3 size: 16127 MB
node 3 free: 16031 MB
node 4 cpus: 32 33 34 35 36 37 38 39
node 4 size: 16127 MB
node 4 free: 16083 MB
node 5 cpus: 40 41 42 43 44 45 46 47
node 5 size: 16127 MB
node 5 free: 16060 MB
node 6 cpus: 48 49 50 51 52 53 54 55
node 6 size: 16127 MB
node 6 free: 16080 MB
node 7 cpus: 56 57 58 59 60 61 62 63
node 7 size: 16126 MB
node 7 free: 16050 MB
node distances:
node   0   1   2   3   4   5   6   7 
  0:  10  16  16  22  16  22  16  22 
  1:  16  10  22  16  22  16  22  16 
  2:  16  22  10  16  16  22  16  22 
  3:  22  16  16  10  22  16  22  16 
  4:  16  22  16  22  10  16  16  22 
  5:  22  16  22  16  16  10  22  16 
  6:  16  22  16  22  16  22  10  16 
  7:  22  16  22  16  22  16  16  10 

 

 

2. There is a small userspace tool of mine which aims to burn different

   CPUs and measure the execution time in nsecs:

 

      https://github.com/rouming/ccont

 

   Basically what this tool does is the following:

 

      o allocates chunk of page aligned memory on NUMA node #0.

 

      o starts desired number of threads, pins them to certain CPUs, which

        belong to different NUMA nodes (what CPUs and nodes to use is

        defined by bitmasks, which passed as tool parameters).

              

      o burns those CPUs which execute the same instruction in loop (what

        instruction to execute is also defined by tool parameters).

 

   The tool provides set of loads, which define what CPUs and nodes to

   occupy.  E.g.:

 

     o cpu-increase - on each iteration new thread is created and pinned

                      to the next CPU to burn.  This load shows the

                      performance degradation in case of cache line

                      bouncing.

 

     o node-cascade - on each iteration CPUs from next node are burned.

                      This  load shows the performance difference on

                      different nodes.

 

     o cpu-rollover - on each iteration executor thread rolls to another

                      CPU on the next node, always keeping the same amount

                      of CPUs.  This load shows the performance difference,

                      when threads spread over all CPUs, or concentrated

                      on one specific node.

 

3. Here I would like to show results which look strange to me.  I hope

   someone will help to clarify, what happens on my machine.  The following

   are CPU operations which I measured:

 

     o memset256 - memset of 256 bytes.

 

     o `cmpxchgq` instruction - atomically compares-and-exchanges,

                                unsigned long (64-bit on gcc).

 

   Results of doing memset of 256 bytes of memory chunk, which is allocated

   on NUMA node N0:

 

     # Burn 8 CPUs on local node N0:

     root@server:~# ./ccont -o memset256 -n 0

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs ******** -------- -------- -------- -------- -------- -------- --------    8    memset256  1290.655  1343.644  1326.663    22.636

 

     # Burn 8 CPUs on distant node N7:

     root@server:~# ./ccont -o memset256 -n 7

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs -------- -------- -------- -------- -------- -------- -------- *******    8    memset256  1800.407  1800.966  1800.785     0.221

 

     # Burn 8 CPUs on all nodes:

     root@server:~# ./ccont -o memset256 -c 0,8,16,24,32,40,48

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs *------- *------- *------- *------- *------- *------- *------- --------    7    memset256  2596.499  2662.810  2642.929    30.703

 

   Summary:

     Results are expected to me: avg column shows the average execution

     time in nsecs, and local node access is much faster.

 

   Results of the 'cmpxchgq' instruction are the following:

 

     # Burn 8 CPUs on local node N0:

     root@server:~# ./ccont -op cmpxchg --nodes 0

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs ******** -------- -------- -------- -------- -------- -------- --------    8      cmpxchg   744.714   745.481   745.176     0.281

 

     # Burn 8 CPUs on distant node #7:

     root@server:~# ./ccont -op cmpxchg --nodes 7

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs -------- -------- -------- -------- -------- -------- -------- ********    8      cmpxchg   497.700   613.177   553.036    53.784

 

     # Burn 8 CPUs on all nodes:

     root@server:~# ./ccont -op cmpxchg --cpu 0,8,16,24,32,40,48,56

     Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

      CPUs *------- *------- *------- *------- *------- *------- *------- *-------    8      cmpxchg   317.938   476.666   411.078    70.578

 

   Summary:

     Local memory access is quite expensive comparing to access from remote

     node N7, and especially spread access shows best time.

 

     I can't explain these results.  According to my understanding local

     access should be faster and access from different nodes should be the

     worst because of cache line bouncing, but the numbers are vice versa.

 

The question is the following:

     What I am missing and why my results are so contradicting?

 

PS: I tried the same measurements on smaller Intel machine and the results

    are quite explainable:

 

    Nodes  N0   N1  CPUs    operation       min       max       avg     stdev

     CPUs **** ----    4      cmpxchg    72.287    72.322    72.310     0.016

     CPUs **-- **--    4      cmpxchg   116.803   121.450   119.108     2.658

     CPUs ---- ****    4      cmpxchg    72.327    72.333    72.330     0.003

 

 

--

Roman

Outcomes