I have a cluster of 4x AMD Instinct MI100 GPUs i recently picked up a Infinity Fabric bridge for, running on a Proxmox 7.3-3 host, Ubuntu 22.04 virtual machine, and using ROCm 5.7.
I can pass through the GPUs themselves fine, and even with the bridge installed they still work, but checking the interconnect speed shows they're still using PCIe.
This Github Issue shows there should be a log entry for when the bridge initializes, but running that search on the VM returns nothing. I dont see the bridge as its own PCIe device to pass through, so I assume its some low level thing outside of the OS's view. This ROCm Documentation says they should print out XGMI if the bridge is in use, which mine sadly does not.
Solved! Go to Solution.
https://github.com/ROCm/ROCm/issues/2722
Upon getting guidance from an amazing person on the ROCm issue tracker, turns out it wasnt fully seated on the cards. Make sure to fully push the bridge down, and tighten the screws until they dont move anymore. Upon PC boot, the bridge will illuminate LEDs on the top for each detected card.
Once this all happened, everything was detected perfectly fine in the VM client.
here is the rocm-smi --showxgmierr output
Note: The system is running 2x Intel Gold 6148s
This is something for AMD Moderator @fsadough who might be able to help you once you post all your computer information.
Please identify yourself if you are an enterprise customer. The MI-100 is not a retail product. How did you get these GPUs in first place? We don't provide end-user support on our recent Instinct series.
I purchased all items second hand for AI workloads at a small company, and figured since they were 1-2 generations back support may be available from the community. We aren't big enough for the meaty MI250X rigs, so MI100s were right up our alley.
Where do I need to go to get support on these items if not the Community Forums?
You can certainly post your problem on Community Forum and hope someone will come up with an answer. This portal however is not an official AMD Support portal. If you don't see the LED lights on the bridge, something is not right. Probably the bridge is not properly connected, or you might have a defective bridge.
https://github.com/ROCm/ROCm/issues/2722
Upon getting guidance from an amazing person on the ROCm issue tracker, turns out it wasnt fully seated on the cards. Make sure to fully push the bridge down, and tighten the screws until they dont move anymore. Upon PC boot, the bridge will illuminate LEDs on the top for each detected card.
Once this all happened, everything was detected perfectly fine in the VM client.