Archives Discussions

rouming · ‎04-05-2016

Hi, all.

I decided to measure execution time of some atomic instructions in order

to have deeper understanding of the penalty associated with memory access

to remote NUMA nodes.

Eventually I got result on set of instructions which contradicts the NUMA

theory: in my case access to remote NODEs is faster. But let me describe

everything in order.

1. My configuration is the following:

root@server:~# lscpu

Architecture: x86_64

CPU op-mode(s): 32-bit, 64-bit

Byte Order: Little Endian

CPU(s): 64

On-line CPU(s) list: 0-63

Thread(s) per core: 2

Core(s) per socket: 8

Socket(s): 4

NUMA node(s): 8

Vendor ID: AuthenticAMD

CPU family: 21

Model: 2

Model name: AMD Opteron(tm) Processor 6386 SE

Stepping: 0

CPU MHz: 2807.199

BogoMIPS: 5616.32

Virtualization: AMD-V

L1d cache: 16K

L1i cache: 64K

L2 cache: 2048K

L3 cache: 6144K

NUMA node0 CPU(s): 0-7

NUMA node1 CPU(s): 8-15

NUMA node2 CPU(s): 16-23

NUMA node3 CPU(s): 24-31

NUMA node4 CPU(s): 32-39

NUMA node5 CPU(s): 40-47

NUMA node6 CPU(s): 48-55

NUMA node7 CPU(s): 56-63

root@server:~# numactl -H

available: 8 nodes (0-7)

node 0 cpus: 0 1 2 3 4 5 6 7

node 0 size: 16045 MB

node 0 free: 15419 MB

node 1 cpus: 8 9 10 11 12 13 14 15

node 1 size: 16127 MB

node 1 free: 16062 MB

node 2 cpus: 16 17 18 19 20 21 22 23

node 2 size: 16127 MB

node 2 free: 16085 MB

node 3 cpus: 24 25 26 27 28 29 30 31

node 3 size: 16127 MB

node 3 free: 16031 MB

node 4 cpus: 32 33 34 35 36 37 38 39

node 4 size: 16127 MB

node 4 free: 16083 MB

node 5 cpus: 40 41 42 43 44 45 46 47

node 5 size: 16127 MB

node 5 free: 16060 MB

node 6 cpus: 48 49 50 51 52 53 54 55

node 6 size: 16127 MB

node 6 free: 16080 MB

node 7 cpus: 56 57 58 59 60 61 62 63

node 7 size: 16126 MB

node 7 free: 16050 MB

node distances:

node 0 1 2 3 4 5 6 7

0: 10 16 16 22 16 22 16 22

1: 16 10 22 16 22 16 22 16

2: 16 22 10 16 16 22 16 22

3: 22 16 16 10 22 16 22 16

4: 16 22 16 22 10 16 16 22

5: 22 16 22 16 16 10 22 16

6: 16 22 16 22 16 22 10 16

7: 22 16 22 16 22 16 16 10

2. There is a small userspace tool of mine which aims to burn different

CPUs and measure the execution time in nsecs:

https://github.com/rouming/ccont

Basically what this tool does is the following:

o allocates chunk of page aligned memory on NUMA node #0.

o starts desired number of threads, pins them to certain CPUs, which

belong to different NUMA nodes (what CPUs and nodes to use is

defined by bitmasks, which passed as tool parameters).

o burns those CPUs which execute the same instruction in loop (what

instruction to execute is also defined by tool parameters).

The tool provides set of loads, which define what CPUs and nodes to

occupy. E.g.:

o cpu-increase - on each iteration new thread is created and pinned

to the next CPU to burn. This load shows the

performance degradation in case of cache line

bouncing.

o node-cascade - on each iteration CPUs from next node are burned.

This load shows the performance difference on

different nodes.

o cpu-rollover - on each iteration executor thread rolls to another

CPU on the next node, always keeping the same amount

of CPUs. This load shows the performance difference,

when threads spread over all CPUs, or concentrated

on one specific node.

3. Here I would like to show results which look strange to me. I hope

someone will help to clarify, what happens on my machine. The following

are CPU operations which I measured:

o memset256 - memset of 256 bytes.

o `cmpxchgq` instruction - atomically compares-and-exchanges,

unsigned long (64-bit on gcc).

Results of doing memset of 256 bytes of memory chunk, which is allocated

on NUMA node N0:

# Burn 8 CPUs on local node N0:

root@server:~# ./ccont -o memset256 -n 0

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs ******** -------- -------- -------- -------- -------- -------- -------- 8 memset256 1290.655 1343.644 1326.663 22.636

# Burn 8 CPUs on distant node N7:

root@server:~# ./ccont -o memset256 -n 7

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs -------- -------- -------- -------- -------- -------- -------- ******* 8 memset256 1800.407 1800.966 1800.785 0.221

# Burn 8 CPUs on all nodes:

root@server:~# ./ccont -o memset256 -c 0,8,16,24,32,40,48

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs *------- *------- *------- *------- *------- *------- *------- -------- 7 memset256 2596.499 2662.810 2642.929 30.703

Summary:

Results are expected to me: avg column shows the average execution

time in nsecs, and local node access is much faster.

Results of the 'cmpxchgq' instruction are the following:

# Burn 8 CPUs on local node N0:

root@server:~# ./ccont -op cmpxchg --nodes 0

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs ******** -------- -------- -------- -------- -------- -------- -------- 8 cmpxchg 744.714 745.481 745.176 0.281

# Burn 8 CPUs on distant node #7:

root@server:~# ./ccont -op cmpxchg --nodes 7

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs -------- -------- -------- -------- -------- -------- -------- ******** 8 cmpxchg 497.700 613.177 553.036 53.784

# Burn 8 CPUs on all nodes:

root@server:~# ./ccont -op cmpxchg --cpu 0,8,16,24,32,40,48,56

Nodes N0 N1 N2 N3 N4 N5 N6 N7 CPUs operation min max avg stdev

CPUs *------- *------- *------- *------- *------- *------- *------- *------- 8 cmpxchg 317.938 476.666 411.078 70.578

Summary:

Local memory access is quite expensive comparing to access from remote

node N7, and especially spread access shows best time.

I can't explain these results. According to my understanding local

access should be faster and access from different nodes should be the

worst because of cache line bouncing, but the numbers are vice versa.

The question is the following:

What I am missing and why my results are so contradicting?

PS: I tried the same measurements on smaller Intel machine and the results

are quite explainable:

Nodes N0 N1 CPUs operation min max avg stdev

CPUs **** ---- 4 cmpxchg 72.287 72.322 72.310 0.016

CPUs **-- **-- 4 cmpxchg 116.803 121.450 119.108 2.658

CPUs ---- **** 4 cmpxchg 72.327 72.333 72.330 0.003

--

Roman

Archives Discussions

Cache contention and NUMA nodes locality