cancel
Showing results for 
Search instead for 
Did you mean: 

Discussions

tinux
Adept I

poor bandwidth on dual Radeon Pro VII GPUs

Hello everyone

We recently added a second Radeon Pro VII to our simulation system. Unfortunately, though, it seems the GPUs do not want to talk to each other, although they are directly connected with an Infinity Fabric Link Bridge.

The system usually runs Arch Linux, where I also started a discussion about the issue, but testing with Ubuntu shows the same issue. Everything posted here was done on the Ubuntu system. I also already added an issue on github. Still, help would be appreciated as it has seemingly gone unnoticed for 3 weeks now.

system

hardware setup
  • GPUs: 2 AMD Radeon Pro VII
  • CPU: AMD Ryzen Threadripper 2950X
  • mainboard: Asus X399-A

The GPUs are connected with an Infinity Fabric Link Bridge.

 software
  • OS: Ubuntu 20.04.3
  • kernel: 5.11
  • ROCM: installed via "sudo amdgpu-install --usecase=rocm" with "amdgpu-install" from here
other requirements

I did verify that critical requirements according to the ROCM supported hardware page are met, eg. hardware (see above), but also the following:

IOMMU
$ sudo dmesg | grep -i iommu
[sudo] password for tinux: 
[    0.271162] iommu: Default domain type: Translated 
[    0.471020] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported
[    0.471076] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported
[    0.471120] pci 0000:00:01.0: Adding to iommu group 0
[    0.471133] pci 0000:00:01.1: Adding to iommu group 1
[    0.471146] pci 0000:00:01.2: Adding to iommu group 2
[    0.471166] pci 0000:00:02.0: Adding to iommu group 3
[    0.471183] pci 0000:00:03.0: Adding to iommu group 4
[    0.471195] pci 0000:00:03.1: Adding to iommu group 5
[    0.471213] pci 0000:00:04.0: Adding to iommu group 6
[    0.471231] pci 0000:00:07.0: Adding to iommu group 7
[    0.471243] pci 0000:00:07.1: Adding to iommu group 8
[    0.471261] pci 0000:00:08.0: Adding to iommu group 9
[    0.471273] pci 0000:00:08.1: Adding to iommu group 10
[    0.471297] pci 0000:00:14.0: Adding to iommu group 11
[    0.471308] pci 0000:00:14.3: Adding to iommu group 11
[    0.471368] pci 0000:00:18.0: Adding to iommu group 12
[    0.471379] pci 0000:00:18.1: Adding to iommu group 12
[    0.471390] pci 0000:00:18.2: Adding to iommu group 12
[    0.471401] pci 0000:00:18.3: Adding to iommu group 12
[    0.471414] pci 0000:00:18.4: Adding to iommu group 12
[    0.471425] pci 0000:00:18.5: Adding to iommu group 12
[    0.471436] pci 0000:00:18.6: Adding to iommu group 12
[    0.471447] pci 0000:00:18.7: Adding to iommu group 12
[    0.471506] pci 0000:00:19.0: Adding to iommu group 13
[    0.471517] pci 0000:00:19.1: Adding to iommu group 13
[    0.471529] pci 0000:00:19.2: Adding to iommu group 13
[    0.471542] pci 0000:00:19.3: Adding to iommu group 13
[    0.471553] pci 0000:00:19.4: Adding to iommu group 13
[    0.471565] pci 0000:00:19.5: Adding to iommu group 13
[    0.471577] pci 0000:00:19.6: Adding to iommu group 13
[    0.471588] pci 0000:00:19.7: Adding to iommu group 13
[    0.471622] pci 0000:01:00.0: Adding to iommu group 14
[    0.471635] pci 0000:01:00.1: Adding to iommu group 14
[    0.471649] pci 0000:01:00.2: Adding to iommu group 14
[    0.471654] pci 0000:02:00.0: Adding to iommu group 14
[    0.471658] pci 0000:02:01.0: Adding to iommu group 14
[    0.471662] pci 0000:02:02.0: Adding to iommu group 14
[    0.471666] pci 0000:02:03.0: Adding to iommu group 14
[    0.471670] pci 0000:02:04.0: Adding to iommu group 14
[    0.471674] pci 0000:02:09.0: Adding to iommu group 14
[    0.471678] pci 0000:05:00.0: Adding to iommu group 14
[    0.471683] pci 0000:08:00.0: Adding to iommu group 14
[    0.471695] pci 0000:09:00.0: Adding to iommu group 15
[    0.471707] pci 0000:0a:00.0: Adding to iommu group 16
[    0.471719] pci 0000:0b:00.0: Adding to iommu group 17
[    0.471744] pci 0000:0c:00.0: Adding to iommu group 18
[    0.471759] pci 0000:0c:00.1: Adding to iommu group 19
[    0.471772] pci 0000:0d:00.0: Adding to iommu group 20
[    0.471784] pci 0000:0d:00.2: Adding to iommu group 21
[    0.471798] pci 0000:0d:00.3: Adding to iommu group 22
[    0.471810] pci 0000:0e:00.0: Adding to iommu group 23
[    0.471825] pci 0000:0e:00.2: Adding to iommu group 24
[    0.471838] pci 0000:0e:00.3: Adding to iommu group 25
[    0.471856] pci 0000:40:01.0: Adding to iommu group 26
[    0.471872] pci 0000:40:02.0: Adding to iommu group 27
[    0.471890] pci 0000:40:03.0: Adding to iommu group 28
[    0.471902] pci 0000:40:03.1: Adding to iommu group 29
[    0.471920] pci 0000:40:04.0: Adding to iommu group 30
[    0.471937] pci 0000:40:07.0: Adding to iommu group 31
[    0.471949] pci 0000:40:07.1: Adding to iommu group 32
[    0.471968] pci 0000:40:08.0: Adding to iommu group 33
[    0.471981] pci 0000:40:08.1: Adding to iommu group 34
[    0.471994] pci 0000:41:00.0: Adding to iommu group 35
[    0.472006] pci 0000:42:00.0: Adding to iommu group 36
[    0.472031] pci 0000:43:00.0: Adding to iommu group 37
[    0.472048] pci 0000:43:00.1: Adding to iommu group 38
[    0.472061] pci 0000:44:00.0: Adding to iommu group 39
[    0.472074] pci 0000:44:00.2: Adding to iommu group 40
[    0.472086] pci 0000:44:00.3: Adding to iommu group 41
[    0.472100] pci 0000:45:00.0: Adding to iommu group 42
[    0.472113] pci 0000:45:00.2: Adding to iommu group 43
[    0.502585] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.502595] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40
[    0.503499] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank).
[    0.503517] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank).
[    1.017979] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
CRAT
$ sudo dmesg | grep -i crat
[    0.000000] ACPI: CRAT 0x0000000077CDE878 001DF8 (v01 AMD    AMD CRAT 00000001 AMD  00000001)
[    0.000000] ACPI: Reserving CRAT table memory at [mem 0x77cde878-0x77ce066f]
[    1.168518] amdgpu: Ignoring ACPI CRAT on non-APU system
[    1.168521] amdgpu: Virtual CRAT table created for CPU
[    2.265620] amdgpu: Virtual CRAT table created for GPU
[    3.261272] amdgpu: Virtual CRAT table created for GPU
Atomics
$ sudo dmesg | grep -i kfd
[    2.177738] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    2.265959] kfd kfd: amdgpu: added device 1002:66a1
[    3.169496] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.261636] kfd kfd: amdgpu: added device 1002:66a1

and

$ sudo lspci -vvv -s 43:00.0 | grep Atomic
			 AtomicOpsCap: 32bit+ 64bit+ 128bitCAS-
			 AtomicOpsCtl: ReqEn+

issues

It seems the GPUs are not connected to each other, despite the fact that they are physically connected with an Infinity Fabric Link Bridge.

tests with "rocm-smi"
$ sudo rocm-smi --shownodesbw


======================= ROCm System Management Interface =======================
================================== Bandwidth ===================================
       GPU0         GPU1         
GPU0   N/A          0-0          
GPU1   0-0          N/A          
Format: min-max; Units: mps
"0-0" min-max bandwidth indicates devices are not connected dirrectly
============================= End of ROCm SMI Log ==============================

I also ran a few other test, but I cannot really make sense of it, given the output of the command above.

$ sudo rocm-smi --showtopoaccess


======================= ROCm System Management Interface =======================
===================== Link accessibility between two GPUs ======================
       GPU0         GPU1         
GPU0   True         True         
GPU1   True         True         
============================= End of ROCm SMI Log ==============================

and

$ sudo rocm-smi --showtopo


======================= ROCm System Management Interface =======================
=========================== Weight between two GPUs ============================
       GPU0         GPU1         
GPU0   0            15           
GPU1   15           0            

============================ Hops between two GPUs =============================
       GPU0         GPU1         
GPU0   0            1            
GPU1   1            0            

========================== Link Type between two GPUs ==========================
       GPU0         GPU1         
GPU0   0            XGMI         
GPU1   XGMI         0            

================================== Numa Nodes ==================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 4294967295
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 4294967295
============================= End of ROCm SMI Log ==============================
other benchmarks

I also ran a benchmark from the RCCL repository, which is much slower on 2 GPUs than on a single.

2 GPUs
$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2
# nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   2916 on  ultrafast device  0 [0000:0c:00.0] AMD Radeon (TM) Pro VII
#   Rank  1 Pid   2916 on  ultrafast device  1 [0000:43:00.0] AMD Radeon (TM) Pro VII
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum    24.85    0.00    0.00  0e+00    21.89    0.00    0.00  0e+00
          16             4     float     sum    20.07    0.00    0.00  0e+00    19.69    0.00    0.00  0e+00
          32             8     float     sum    19.91    0.00    0.00  0e+00    19.60    0.00    0.00  0e+00
          64            16     float     sum    19.61    0.00    0.00  0e+00    19.63    0.00    0.00  0e+00
         128            32     float     sum    19.78    0.01    0.01  0e+00    21.51    0.01    0.01  0e+00
         256            64     float     sum    19.76    0.01    0.01  0e+00    19.83    0.01    0.01  0e+00
         512           128     float     sum    19.98    0.03    0.03  0e+00    19.97    0.03    0.03  0e+00
        1024           256     float     sum    35.68    0.03    0.03  0e+00    35.36    0.03    0.03  0e+00
        2048           512     float     sum    20.42    0.10    0.10  0e+00    20.13    0.10    0.10  0e+00
        4096          1024     float     sum    37.20    0.11    0.11  0e+00    37.01    0.11    0.11  0e+00
        8192          2048     float     sum    36.14    0.23    0.23  0e+00    33.72    0.24    0.24  0e+00
       16384          4096     float     sum    33.62    0.49    0.49  0e+00    32.19    0.51    0.51  0e+00
       32768          8192     float     sum    32.93    1.00    1.00  0e+00    32.84    1.00    1.00  0e+00
       65536         16384     float     sum    34.00    1.93    1.93  0e+00    33.47    1.96    1.96  0e+00
      131072         32768     float     sum    35.17    3.73    3.73  0e+00    34.86    3.76    3.76  0e+00
      262144         65536     float     sum    38.97    6.73    6.73  0e+00    38.77    6.76    6.76  0e+00
      524288        131072     float     sum    49.84   10.52   10.52  0e+00    49.69   10.55   10.55  0e+00
     1048576        262144     float     sum    66.13   15.86   15.86  0e+00    65.54   16.00   16.00  0e+00
     2097152        524288     float     sum    97.07   21.61   21.61  0e+00    97.34   21.55   21.55  0e+00
     4194304       1048576     float     sum    160.2   26.19   26.19  0e+00    160.3   26.16   26.16  0e+00
     8388608       2097152     float     sum    284.9   29.45   29.45  0e+00    285.0   29.43   29.43  0e+00
    16777216       4194304     float     sum    532.9   31.48   31.48  0e+00    536.1   31.30   31.30  0e+00
    33554432       8388608     float     sum   1043.1   32.17   32.17  0e+00   1056.0   31.77   31.77  0e+00
    67108864      16777216     float     sum   2072.9   32.37   32.37  0e+00   2074.7   32.35   32.35  0e+00
   134217728      33554432     float     sum   4095.4   32.77   32.77  0e+00   4096.3   32.77   32.77  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 9.86367 
#
1 GPU

 

$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1
# nThreads: 1 nGpus: 1 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 
#
# Using devices
#   Rank  0 Pid   3122 on  ultrafast device  0 [0000:0c:00.0] AMD Radeon (TM) Pro VII
#
#                                                       out-of-place                       in-place          
#       size         count      type   redop     time   algbw   busbw  error     time   algbw   busbw  error
#        (B)    (elements)                       (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)       
           8             2     float     sum     8.38    0.00    0.00  0e+00     4.89    0.00    0.00  0e+00
          16             4     float     sum     7.77    0.00    0.00  0e+00     5.24    0.00    0.00  0e+00
          32             8     float     sum     7.27    0.00    0.00  0e+00     8.03    0.00    0.00  0e+00
          64            16     float     sum     7.46    0.01    0.00  0e+00     3.99    0.02    0.00  0e+00
         128            32     float     sum     8.42    0.02    0.00  0e+00     3.82    0.03    0.00  0e+00
         256            64     float     sum     7.72    0.03    0.00  0e+00     4.25    0.06    0.00  0e+00
         512           128     float     sum     8.05    0.06    0.00  0e+00     4.40    0.12    0.00  0e+00
        1024           256     float     sum     7.66    0.13    0.00  0e+00     4.01    0.26    0.00  0e+00
        2048           512     float     sum     9.16    0.22    0.00  0e+00     4.56    0.45    0.00  0e+00
        4096          1024     float     sum     7.51    0.55    0.00  0e+00     4.10    1.00    0.00  0e+00
        8192          2048     float     sum     7.88    1.04    0.00  0e+00     3.92    2.09    0.00  0e+00
       16384          4096     float     sum     7.84    2.09    0.00  0e+00     3.71    4.41    0.00  0e+00
       32768          8192     float     sum     7.42    4.42    0.00  0e+00     3.80    8.63    0.00  0e+00
       65536         16384     float     sum     7.45    8.80    0.00  0e+00     4.27   15.35    0.00  0e+00
      131072         32768     float     sum     8.17   16.05    0.00  0e+00     4.47   29.31    0.00  0e+00
      262144         65536     float     sum     9.10   28.81    0.00  0e+00     3.71   70.69    0.00  0e+00
      524288        131072     float     sum    39.66   13.22    0.00  0e+00     3.69  142.27    0.00  0e+00
     1048576        262144     float     sum    12.87   81.45    0.00  0e+00     3.96  264.85    0.00  0e+00
     2097152        524288     float     sum    14.53  144.29    0.00  0e+00     2.92  718.67    0.00  0e+00
     4194304       1048576     float     sum    23.76  176.54    0.00  0e+00     3.21  1308.00    0.00  0e+00
     8388608       2097152     float     sum    36.37  230.62    0.00  0e+00     3.60  2330.23    0.00  0e+00
    16777216       4194304     float     sum    67.07  250.16    0.00  0e+00     3.30  5079.62    0.00  0e+00
    33554432       8388608     float     sum    123.2  272.40    0.00  0e+00     3.19  10509.90    0.00  0e+00
    67108864      16777216     float     sum    240.4  279.14    0.00  0e+00     3.18  21079.55    0.00  0e+00
   134217728      33554432     float     sum    470.8  285.08    0.00  0e+00     4.88  27490.11    0.00  0e+00
# Errors with asterisks indicate errors that have exceeded the maximum threshold.
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 0 
#

Any help is highly appreciated.

1 Reply
gc9
Adept III

(A search for "ROCm" "xGMI" error turned up a couple links that sound relevant to xGMI problems.)

https://docs.amd.com/bundle/ROCm-CLI-Guide-v5.0/page/Usage.html

–showxgmierr

Show XGMI error information since last read

–resetxgmierr

Reset XGMI error count

 

https://www.kernel.org/doc/html/v5.9/gpu/amdgpu.html#amdgpu-xgmi-support

AMDGPU XGMI Support

XGMI is a high speed interconnect that joins multiple GPU cards into a homogeneous memory space that is organized by a collective hive ID and individual node IDs, both of which are 64-bit numbers.

The file xgmi_device_id contains the unique per GPU device ID and is stored in the /sys/class/drm/card${cardno}/device/ directory.

Inside the device directory a sub-directory ‘xgmi_hive_info’ is created which contains the hive ID and the list of nodes.

The hive ID is stored in:

/sys/class/drm/card${cardno}/device/xgmi_hive_info/xgmi_hive_id

The node information is stored in numbered directories:

/sys/class/drm/card${cardno}/device/xgmi_hive_info/node${nodeno}/xgmi_device_id

Each device has their own xgmi_hive_info direction with a mirror set of node sub-directories.

The XGMI memory space is built by contiguously adding the power of two padded VRAM space from each node to each other.