0 Replies Latest reply on Apr 5, 2016 8:50 AM by rouming

    Cache contention and NUMA nodes locality

    rouming

      Hi, all.

       

      I decided to measure execution time of some atomic instructions in order

      to have deeper understanding of the penalty associated with memory access

      to remote NUMA nodes.

       

      Eventually I got result on set of instructions which contradicts the NUMA

      theory: in my case access to remote NODEs is faster.  But let me describe

      everything in order.

       

      1. My configuration is the following:

       

      root@server:~# lscpu
      Architecture:          x86_64
      CPU op-mode(s):        32-bit, 64-bit
      Byte Order:            Little Endian
      CPU(s):                64
      On-line CPU(s) list:   0-63
      Thread(s) per core:    2
      Core(s) per socket:    8
      Socket(s):             4
      NUMA node(s):          8
      Vendor ID:             AuthenticAMD
      CPU family:            21
      Model:                 2
      Model name:            AMD Opteron(tm) Processor 6386 SE
      Stepping:              0
      CPU MHz:               2807.199
      BogoMIPS:              5616.32
      Virtualization:        AMD-V
      L1d cache:             16K
      L1i cache:             64K
      L2 cache:              2048K
      L3 cache:              6144K
      NUMA node0 CPU(s):     0-7
      NUMA node1 CPU(s):     8-15
      NUMA node2 CPU(s):     16-23
      NUMA node3 CPU(s):     24-31
      NUMA node4 CPU(s):     32-39
      NUMA node5 CPU(s):     40-47
      NUMA node6 CPU(s):     48-55
      NUMA node7 CPU(s):     56-63
      
      
      root@server:~# numactl -H
      available: 8 nodes (0-7)
      node 0 cpus: 0 1 2 3 4 5 6 7
      node 0 size: 16045 MB
      node 0 free: 15419 MB
      node 1 cpus: 8 9 10 11 12 13 14 15
      node 1 size: 16127 MB
      node 1 free: 16062 MB
      node 2 cpus: 16 17 18 19 20 21 22 23
      node 2 size: 16127 MB
      node 2 free: 16085 MB
      node 3 cpus: 24 25 26 27 28 29 30 31
      node 3 size: 16127 MB
      node 3 free: 16031 MB
      node 4 cpus: 32 33 34 35 36 37 38 39
      node 4 size: 16127 MB
      node 4 free: 16083 MB
      node 5 cpus: 40 41 42 43 44 45 46 47
      node 5 size: 16127 MB
      node 5 free: 16060 MB
      node 6 cpus: 48 49 50 51 52 53 54 55
      node 6 size: 16127 MB
      node 6 free: 16080 MB
      node 7 cpus: 56 57 58 59 60 61 62 63
      node 7 size: 16126 MB
      node 7 free: 16050 MB
      node distances:
      node   0   1   2   3   4   5   6   7 
        0:  10  16  16  22  16  22  16  22 
        1:  16  10  22  16  22  16  22  16 
        2:  16  22  10  16  16  22  16  22 
        3:  22  16  16  10  22  16  22  16 
        4:  16  22  16  22  10  16  16  22 
        5:  22  16  22  16  16  10  22  16 
        6:  16  22  16  22  16  22  10  16 
        7:  22  16  22  16  22  16  16  10 
      

       

       

      2. There is a small userspace tool of mine which aims to burn different

         CPUs and measure the execution time in nsecs:

       

            https://github.com/rouming/ccont

       

         Basically what this tool does is the following:

       

            o allocates chunk of page aligned memory on NUMA node #0.

       

            o starts desired number of threads, pins them to certain CPUs, which

              belong to different NUMA nodes (what CPUs and nodes to use is

              defined by bitmasks, which passed as tool parameters).

                    

            o burns those CPUs which execute the same instruction in loop (what

              instruction to execute is also defined by tool parameters).

       

         The tool provides set of loads, which define what CPUs and nodes to

         occupy.  E.g.:

       

           o cpu-increase - on each iteration new thread is created and pinned

                            to the next CPU to burn.  This load shows the

                            performance degradation in case of cache line

                            bouncing.

       

           o node-cascade - on each iteration CPUs from next node are burned.

                            This  load shows the performance difference on

                            different nodes.

       

           o cpu-rollover - on each iteration executor thread rolls to another

                            CPU on the next node, always keeping the same amount

                            of CPUs.  This load shows the performance difference,

                            when threads spread over all CPUs, or concentrated

                            on one specific node.

       

      3. Here I would like to show results which look strange to me.  I hope

         someone will help to clarify, what happens on my machine.  The following

         are CPU operations which I measured:

       

           o memset256 - memset of 256 bytes.

       

           o `cmpxchgq` instruction - atomically compares-and-exchanges,

                                      unsigned long (64-bit on gcc).

       

         Results of doing memset of 256 bytes of memory chunk, which is allocated

         on NUMA node N0:

       

           # Burn 8 CPUs on local node N0:

           root@server:~# ./ccont -o memset256 -n 0

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs ******** -------- -------- -------- -------- -------- -------- --------    8    memset256  1290.655  1343.644  1326.663    22.636

       

           # Burn 8 CPUs on distant node N7:

           root@server:~# ./ccont -o memset256 -n 7

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs -------- -------- -------- -------- -------- -------- -------- *******    8    memset256  1800.407  1800.966  1800.785     0.221

       

           # Burn 8 CPUs on all nodes:

           root@server:~# ./ccont -o memset256 -c 0,8,16,24,32,40,48

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs *------- *------- *------- *------- *------- *------- *------- --------    7    memset256  2596.499  2662.810  2642.929    30.703

       

         Summary:

           Results are expected to me: avg column shows the average execution

           time in nsecs, and local node access is much faster.

       

         Results of the 'cmpxchgq' instruction are the following:

       

           # Burn 8 CPUs on local node N0:

           root@server:~# ./ccont -op cmpxchg --nodes 0

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs ******** -------- -------- -------- -------- -------- -------- --------    8      cmpxchg   744.714   745.481   745.176     0.281

       

           # Burn 8 CPUs on distant node #7:

           root@server:~# ./ccont -op cmpxchg --nodes 7

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs -------- -------- -------- -------- -------- -------- -------- ********    8      cmpxchg   497.700   613.177   553.036    53.784

       

           # Burn 8 CPUs on all nodes:

           root@server:~# ./ccont -op cmpxchg --cpu 0,8,16,24,32,40,48,56

           Nodes    N0       N1       N2       N3       N4       N5       N6       N7    CPUs    operation       min       max       avg     stdev

            CPUs *------- *------- *------- *------- *------- *------- *------- *-------    8      cmpxchg   317.938   476.666   411.078    70.578

       

         Summary:

           Local memory access is quite expensive comparing to access from remote

           node N7, and especially spread access shows best time.

       

           I can't explain these results.  According to my understanding local

           access should be faster and access from different nodes should be the

           worst because of cache line bouncing, but the numbers are vice versa.

       

      The question is the following:

           What I am missing and why my results are so contradicting?

       

      PS: I tried the same measurements on smaller Intel machine and the results

          are quite explainable:

       

          Nodes  N0   N1  CPUs    operation       min       max       avg     stdev

           CPUs **** ----    4      cmpxchg    72.287    72.322    72.310     0.016

           CPUs **-- **--    4      cmpxchg   116.803   121.450   119.108     2.658

           CPUs ---- ****    4      cmpxchg    72.327    72.333    72.330     0.003

       

       

      --

      Roman