Hello everyone
We recently added a second Radeon Pro VII to our simulation system. Unfortunately, though, it seems the GPUs do not want to talk to each other, although they are directly connected with an Infinity Fabric Link Bridge.
The system usually runs Arch Linux, where I also started a discussion about the issue, but testing with Ubuntu shows the same issue. Everything posted here was done on the Ubuntu system. I also already added an issue on github. Still, help would be appreciated as it has seemingly gone unnoticed for 3 weeks now.
The GPUs are connected with an Infinity Fabric Link Bridge.
I did verify that critical requirements according to the ROCM supported hardware page are met, eg. hardware (see above), but also the following:
$ sudo dmesg | grep -i iommu [sudo] password for tinux: [ 0.271162] iommu: Default domain type: Translated [ 0.471020] pci 0000:00:00.2: AMD-Vi: IOMMU performance counters supported [ 0.471076] pci 0000:40:00.2: AMD-Vi: IOMMU performance counters supported [ 0.471120] pci 0000:00:01.0: Adding to iommu group 0 [ 0.471133] pci 0000:00:01.1: Adding to iommu group 1 [ 0.471146] pci 0000:00:01.2: Adding to iommu group 2 [ 0.471166] pci 0000:00:02.0: Adding to iommu group 3 [ 0.471183] pci 0000:00:03.0: Adding to iommu group 4 [ 0.471195] pci 0000:00:03.1: Adding to iommu group 5 [ 0.471213] pci 0000:00:04.0: Adding to iommu group 6 [ 0.471231] pci 0000:00:07.0: Adding to iommu group 7 [ 0.471243] pci 0000:00:07.1: Adding to iommu group 8 [ 0.471261] pci 0000:00:08.0: Adding to iommu group 9 [ 0.471273] pci 0000:00:08.1: Adding to iommu group 10 [ 0.471297] pci 0000:00:14.0: Adding to iommu group 11 [ 0.471308] pci 0000:00:14.3: Adding to iommu group 11 [ 0.471368] pci 0000:00:18.0: Adding to iommu group 12 [ 0.471379] pci 0000:00:18.1: Adding to iommu group 12 [ 0.471390] pci 0000:00:18.2: Adding to iommu group 12 [ 0.471401] pci 0000:00:18.3: Adding to iommu group 12 [ 0.471414] pci 0000:00:18.4: Adding to iommu group 12 [ 0.471425] pci 0000:00:18.5: Adding to iommu group 12 [ 0.471436] pci 0000:00:18.6: Adding to iommu group 12 [ 0.471447] pci 0000:00:18.7: Adding to iommu group 12 [ 0.471506] pci 0000:00:19.0: Adding to iommu group 13 [ 0.471517] pci 0000:00:19.1: Adding to iommu group 13 [ 0.471529] pci 0000:00:19.2: Adding to iommu group 13 [ 0.471542] pci 0000:00:19.3: Adding to iommu group 13 [ 0.471553] pci 0000:00:19.4: Adding to iommu group 13 [ 0.471565] pci 0000:00:19.5: Adding to iommu group 13 [ 0.471577] pci 0000:00:19.6: Adding to iommu group 13 [ 0.471588] pci 0000:00:19.7: Adding to iommu group 13 [ 0.471622] pci 0000:01:00.0: Adding to iommu group 14 [ 0.471635] pci 0000:01:00.1: Adding to iommu group 14 [ 0.471649] pci 0000:01:00.2: Adding to iommu group 14 [ 0.471654] pci 0000:02:00.0: Adding to iommu group 14 [ 0.471658] pci 0000:02:01.0: Adding to iommu group 14 [ 0.471662] pci 0000:02:02.0: Adding to iommu group 14 [ 0.471666] pci 0000:02:03.0: Adding to iommu group 14 [ 0.471670] pci 0000:02:04.0: Adding to iommu group 14 [ 0.471674] pci 0000:02:09.0: Adding to iommu group 14 [ 0.471678] pci 0000:05:00.0: Adding to iommu group 14 [ 0.471683] pci 0000:08:00.0: Adding to iommu group 14 [ 0.471695] pci 0000:09:00.0: Adding to iommu group 15 [ 0.471707] pci 0000:0a:00.0: Adding to iommu group 16 [ 0.471719] pci 0000:0b:00.0: Adding to iommu group 17 [ 0.471744] pci 0000:0c:00.0: Adding to iommu group 18 [ 0.471759] pci 0000:0c:00.1: Adding to iommu group 19 [ 0.471772] pci 0000:0d:00.0: Adding to iommu group 20 [ 0.471784] pci 0000:0d:00.2: Adding to iommu group 21 [ 0.471798] pci 0000:0d:00.3: Adding to iommu group 22 [ 0.471810] pci 0000:0e:00.0: Adding to iommu group 23 [ 0.471825] pci 0000:0e:00.2: Adding to iommu group 24 [ 0.471838] pci 0000:0e:00.3: Adding to iommu group 25 [ 0.471856] pci 0000:40:01.0: Adding to iommu group 26 [ 0.471872] pci 0000:40:02.0: Adding to iommu group 27 [ 0.471890] pci 0000:40:03.0: Adding to iommu group 28 [ 0.471902] pci 0000:40:03.1: Adding to iommu group 29 [ 0.471920] pci 0000:40:04.0: Adding to iommu group 30 [ 0.471937] pci 0000:40:07.0: Adding to iommu group 31 [ 0.471949] pci 0000:40:07.1: Adding to iommu group 32 [ 0.471968] pci 0000:40:08.0: Adding to iommu group 33 [ 0.471981] pci 0000:40:08.1: Adding to iommu group 34 [ 0.471994] pci 0000:41:00.0: Adding to iommu group 35 [ 0.472006] pci 0000:42:00.0: Adding to iommu group 36 [ 0.472031] pci 0000:43:00.0: Adding to iommu group 37 [ 0.472048] pci 0000:43:00.1: Adding to iommu group 38 [ 0.472061] pci 0000:44:00.0: Adding to iommu group 39 [ 0.472074] pci 0000:44:00.2: Adding to iommu group 40 [ 0.472086] pci 0000:44:00.3: Adding to iommu group 41 [ 0.472100] pci 0000:45:00.0: Adding to iommu group 42 [ 0.472113] pci 0000:45:00.2: Adding to iommu group 43 [ 0.502585] pci 0000:00:00.2: AMD-Vi: Found IOMMU cap 0x40 [ 0.502595] pci 0000:40:00.2: AMD-Vi: Found IOMMU cap 0x40 [ 0.503499] perf/amd_iommu: Detected AMD IOMMU #0 (2 banks, 4 counters/bank). [ 0.503517] perf/amd_iommu: Detected AMD IOMMU #1 (2 banks, 4 counters/bank). [ 1.017979] AMD-Vi: AMD IOMMUv2 driver by Joerg Roedel <jroedel@suse.de>
$ sudo dmesg | grep -i crat [ 0.000000] ACPI: CRAT 0x0000000077CDE878 001DF8 (v01 AMD AMD CRAT 00000001 AMD 00000001) [ 0.000000] ACPI: Reserving CRAT table memory at [mem 0x77cde878-0x77ce066f] [ 1.168518] amdgpu: Ignoring ACPI CRAT on non-APU system [ 1.168521] amdgpu: Virtual CRAT table created for CPU [ 2.265620] amdgpu: Virtual CRAT table created for GPU [ 3.261272] amdgpu: Virtual CRAT table created for GPU
$ sudo dmesg | grep -i kfd [ 2.177738] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 2.265959] kfd kfd: amdgpu: added device 1002:66a1 [ 3.169496] kfd kfd: amdgpu: Allocated 3969056 bytes on gart [ 3.261636] kfd kfd: amdgpu: added device 1002:66a1
and
$ sudo lspci -vvv -s 43:00.0 | grep Atomic AtomicOpsCap: 32bit+ 64bit+ 128bitCAS- AtomicOpsCtl: ReqEn+
It seems the GPUs are not connected to each other, despite the fact that they are physically connected with an Infinity Fabric Link Bridge.
$ sudo rocm-smi --shownodesbw ======================= ROCm System Management Interface ======================= ================================== Bandwidth =================================== GPU0 GPU1 GPU0 N/A 0-0 GPU1 0-0 N/A Format: min-max; Units: mps "0-0" min-max bandwidth indicates devices are not connected dirrectly ============================= End of ROCm SMI Log ==============================
I also ran a few other test, but I cannot really make sense of it, given the output of the command above.
$ sudo rocm-smi --showtopoaccess ======================= ROCm System Management Interface ======================= ===================== Link accessibility between two GPUs ====================== GPU0 GPU1 GPU0 True True GPU1 True True ============================= End of ROCm SMI Log ==============================
and
$ sudo rocm-smi --showtopo ======================= ROCm System Management Interface ======================= =========================== Weight between two GPUs ============================ GPU0 GPU1 GPU0 0 15 GPU1 15 0 ============================ Hops between two GPUs ============================= GPU0 GPU1 GPU0 0 1 GPU1 1 0 ========================== Link Type between two GPUs ========================== GPU0 GPU1 GPU0 0 XGMI GPU1 XGMI 0 ================================== Numa Nodes ================================== GPU[0] : (Topology) Numa Node: 0 GPU[0] : (Topology) Numa Affinity: 4294967295 GPU[1] : (Topology) Numa Node: 0 GPU[1] : (Topology) Numa Affinity: 4294967295 ============================= End of ROCm SMI Log ==============================
I also ran a benchmark from the RCCL repository, which is much slower on 2 GPUs than on a single.
$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 2 # nThreads: 1 nGpus: 2 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 # # Using devices # Rank 0 Pid 2916 on ultrafast device 0 [0000:0c:00.0] AMD Radeon (TM) Pro VII # Rank 1 Pid 2916 on ultrafast device 1 [0000:43:00.0] AMD Radeon (TM) Pro VII # # out-of-place in-place # size count type redop time algbw busbw error time algbw busbw error # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum 24.85 0.00 0.00 0e+00 21.89 0.00 0.00 0e+00 16 4 float sum 20.07 0.00 0.00 0e+00 19.69 0.00 0.00 0e+00 32 8 float sum 19.91 0.00 0.00 0e+00 19.60 0.00 0.00 0e+00 64 16 float sum 19.61 0.00 0.00 0e+00 19.63 0.00 0.00 0e+00 128 32 float sum 19.78 0.01 0.01 0e+00 21.51 0.01 0.01 0e+00 256 64 float sum 19.76 0.01 0.01 0e+00 19.83 0.01 0.01 0e+00 512 128 float sum 19.98 0.03 0.03 0e+00 19.97 0.03 0.03 0e+00 1024 256 float sum 35.68 0.03 0.03 0e+00 35.36 0.03 0.03 0e+00 2048 512 float sum 20.42 0.10 0.10 0e+00 20.13 0.10 0.10 0e+00 4096 1024 float sum 37.20 0.11 0.11 0e+00 37.01 0.11 0.11 0e+00 8192 2048 float sum 36.14 0.23 0.23 0e+00 33.72 0.24 0.24 0e+00 16384 4096 float sum 33.62 0.49 0.49 0e+00 32.19 0.51 0.51 0e+00 32768 8192 float sum 32.93 1.00 1.00 0e+00 32.84 1.00 1.00 0e+00 65536 16384 float sum 34.00 1.93 1.93 0e+00 33.47 1.96 1.96 0e+00 131072 32768 float sum 35.17 3.73 3.73 0e+00 34.86 3.76 3.76 0e+00 262144 65536 float sum 38.97 6.73 6.73 0e+00 38.77 6.76 6.76 0e+00 524288 131072 float sum 49.84 10.52 10.52 0e+00 49.69 10.55 10.55 0e+00 1048576 262144 float sum 66.13 15.86 15.86 0e+00 65.54 16.00 16.00 0e+00 2097152 524288 float sum 97.07 21.61 21.61 0e+00 97.34 21.55 21.55 0e+00 4194304 1048576 float sum 160.2 26.19 26.19 0e+00 160.3 26.16 26.16 0e+00 8388608 2097152 float sum 284.9 29.45 29.45 0e+00 285.0 29.43 29.43 0e+00 16777216 4194304 float sum 532.9 31.48 31.48 0e+00 536.1 31.30 31.30 0e+00 33554432 8388608 float sum 1043.1 32.17 32.17 0e+00 1056.0 31.77 31.77 0e+00 67108864 16777216 float sum 2072.9 32.37 32.37 0e+00 2074.7 32.35 32.35 0e+00 134217728 33554432 float sum 4095.4 32.77 32.77 0e+00 4096.3 32.77 32.77 0e+00 # Errors with asterisks indicate errors that have exceeded the maximum threshold. # Out of bounds values : 0 OK # Avg bus bandwidth : 9.86367 #
$ sudo ./build/all_reduce_perf -b 8 -e 128M -f 2 -g 1 # nThreads: 1 nGpus: 1 nRanks: 1 minBytes: 8 maxBytes: 134217728 step: 2(factor) warmupIters: 5 iters: 20 validation: 1 # # Using devices # Rank 0 Pid 3122 on ultrafast device 0 [0000:0c:00.0] AMD Radeon (TM) Pro VII # # out-of-place in-place # size count type redop time algbw busbw error time algbw busbw error # (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s) 8 2 float sum 8.38 0.00 0.00 0e+00 4.89 0.00 0.00 0e+00 16 4 float sum 7.77 0.00 0.00 0e+00 5.24 0.00 0.00 0e+00 32 8 float sum 7.27 0.00 0.00 0e+00 8.03 0.00 0.00 0e+00 64 16 float sum 7.46 0.01 0.00 0e+00 3.99 0.02 0.00 0e+00 128 32 float sum 8.42 0.02 0.00 0e+00 3.82 0.03 0.00 0e+00 256 64 float sum 7.72 0.03 0.00 0e+00 4.25 0.06 0.00 0e+00 512 128 float sum 8.05 0.06 0.00 0e+00 4.40 0.12 0.00 0e+00 1024 256 float sum 7.66 0.13 0.00 0e+00 4.01 0.26 0.00 0e+00 2048 512 float sum 9.16 0.22 0.00 0e+00 4.56 0.45 0.00 0e+00 4096 1024 float sum 7.51 0.55 0.00 0e+00 4.10 1.00 0.00 0e+00 8192 2048 float sum 7.88 1.04 0.00 0e+00 3.92 2.09 0.00 0e+00 16384 4096 float sum 7.84 2.09 0.00 0e+00 3.71 4.41 0.00 0e+00 32768 8192 float sum 7.42 4.42 0.00 0e+00 3.80 8.63 0.00 0e+00 65536 16384 float sum 7.45 8.80 0.00 0e+00 4.27 15.35 0.00 0e+00 131072 32768 float sum 8.17 16.05 0.00 0e+00 4.47 29.31 0.00 0e+00 262144 65536 float sum 9.10 28.81 0.00 0e+00 3.71 70.69 0.00 0e+00 524288 131072 float sum 39.66 13.22 0.00 0e+00 3.69 142.27 0.00 0e+00 1048576 262144 float sum 12.87 81.45 0.00 0e+00 3.96 264.85 0.00 0e+00 2097152 524288 float sum 14.53 144.29 0.00 0e+00 2.92 718.67 0.00 0e+00 4194304 1048576 float sum 23.76 176.54 0.00 0e+00 3.21 1308.00 0.00 0e+00 8388608 2097152 float sum 36.37 230.62 0.00 0e+00 3.60 2330.23 0.00 0e+00 16777216 4194304 float sum 67.07 250.16 0.00 0e+00 3.30 5079.62 0.00 0e+00 33554432 8388608 float sum 123.2 272.40 0.00 0e+00 3.19 10509.90 0.00 0e+00 67108864 16777216 float sum 240.4 279.14 0.00 0e+00 3.18 21079.55 0.00 0e+00 134217728 33554432 float sum 470.8 285.08 0.00 0e+00 4.88 27490.11 0.00 0e+00 # Errors with asterisks indicate errors that have exceeded the maximum threshold. # Out of bounds values : 0 OK # Avg bus bandwidth : 0 #
Any help is highly appreciated.
(A search for "ROCm" "xGMI" error turned up a couple links that sound relevant to xGMI problems.)
https://docs.amd.com/bundle/ROCm-CLI-Guide-v5.0/page/Usage.html
–showxgmierr | Show XGMI error information since last read |
–resetxgmierr | Reset XGMI error count |
https://www.kernel.org/doc/html/v5.9/gpu/amdgpu.html#amdgpu-xgmi-support
XGMI is a high speed interconnect that joins multiple GPU cards into a homogeneous memory space that is organized by a collective hive ID and individual node IDs, both of which are 64-bit numbers.
The file xgmi_device_id contains the unique per GPU device ID and is stored in the /sys/class/drm/card${cardno}/device/ directory.
Inside the device directory a sub-directory ‘xgmi_hive_info’ is created which contains the hive ID and the list of nodes.
The hive ID is stored in:
/sys/class/drm/card${cardno}/device/xgmi_hive_info/xgmi_hive_id
The node information is stored in numbered directories:
/sys/class/drm/card${cardno}/device/xgmi_hive_info/node${nodeno}/xgmi_device_id
Each device has their own xgmi_hive_info direction with a mirror set of node sub-directories.
The XGMI memory space is built by contiguously adding the power of two padded VRAM space from each node to each other.