cancel
Showing results for 
Search instead for 
Did you mean: 

AI Discussions

ENDzZ
Journeyman III

How to Utilize Multi-GPU Infinity Fabric Link in ML

We know that the Infinity Fabric (IF) Link (XGMI) Bridge can greatly improve the performance of Inter-GPU communication just like the NVLink. I’m actually a user who has two Radeon Pro VII with IF Link connected, and I’m sure that this question is the same for those who have four MI100 with IF Link connected. So, the main question is that how can we make use of the advantages of the Infinity Fabric Link in Machine Learning? For example, in PyTorch, can we utilize the high Inter-GPU bandwidth and the shared memory space offered by IF Link so that we can process bigger model and more efficiently? (So far specifically for running the model, I tried running stable diffusion, but after the memory of a single card is full, HIP gave me a OOM error, and the second card’s memory usage was 0, I don’t know if this is a bug, and whether AMD is aware of this.) I have no idea after searching the internet, and all materials I found is about the usage of NVLink. For the Infinity Fabric Link, I don’t even know if PyTorch support the usage of this bridge. Can any dear developers, users or AMD officials share some information on this? Thank you so much!

0 Likes
5 Replies
blakeblossom
Journeyman III

Currently, PyTorch doesn't offer native support for Infinity Fabric Link specifically. However, you can still utilize IF Link's high bandwidth for distributed training with some additional configuration.

 

0 Likes

If I understand your words correctly, PyTorch does not support RCCL, and can’t utilize IF Link in DataParallel or DistributedDataParallel to sync for the time being, right?

0 Likes
Sophiadavis
Adept I

Did u get any solution for this?

0 Likes

Nope.

0 Likes
bensou
Journeyman III

Hi guys, It's hard to really take advantage of Infinity Fabric Link in PyTorch right now because of the weak documentation and support for it compared to NVLink. Probably you'll need to manually deal with memory or wait for further development and support from both PyTorch and AMD. Of course, the best is to continue watching updates from ROCm and the PyTorch community; participation in discussions or forums on the matter at issue can help to increase understanding and potential solutions with time.

0 Likes