Hello,
I have access to a pair of MI100 connected only with PCIe.
I used ROCm docker image based on ubuntu 22.04 as a base, in which I installed RCCL.
I installed MPI and after that I cloned RCCL test repository to try the connection between the GPUs. After installing, when I try to run the examples shown in the usage of the repository on 2 GPUs, the execution falls into a deadlock, with maximum usage of VRAM and 0% of the cores. I tried many versions of ROCm and RCCL, worried about some kind of bug in the latest versions, but this happens every time.
I didn't find any information about this, so I wanted to ask, what could be the problem? For instance, i tried to do the same procedure for NVidia, and the examples run without any problems on my 2 V100 connected to PCIe.
From what i understand, RCCL should support PCIe connection, so i don't think this is the problem, and apparently this happens with the latest 2 versions of ROCm/RCCL. Maybe do you have any suggestions?