AI Discussions

ENDzZ

We know that the Infinity Fabric (IF) Link (XGMI) Bridge can greatly improve the performance of Inter-GPU communication just like the NVLink. I’m actually a user who has two Radeon Pro VII with IF Link connected, and I’m sure that this question is the same for those who have four MI100 with IF Link connected. So, the main question is that how can we make use of the advantages of the Infinity Fabric Link in Machine Learning? For example, in PyTorch, can we utilize the high Inter-GPU bandwidth and the shared memory space offered by IF Link so that we can process bigger model and more efficiently? (So far specifically for running the model, I tried running stable diffusion, but after the memory of a single card is full, HIP gave me a OOM error, and the second card’s memory usage was 0, I don’t know if this is a bug, and whether AMD is aware of this.) I have no idea after searching the internet, and all materials I found is about the usage of NVLink. For the Infinity Fabric Link, I don’t even know if PyTorch support the usage of this bridge. Can any dear developers, users or AMD officials share some information on this? Thank you so much!

blakeblossom

Currently, PyTorch doesn't offer native support for Infinity Fabric Link specifically. However, you can still utilize IF Link's high bandwidth for distributed training with some additional configuration.

AI Discussions

How to Utilize Multi-GPU Infinity Fabric Link in ML