AMD Instinct Accelerators, such as AMD Instinct™ using ROCm in a Linux environment. Note, this issue is observed in all Linux environments.
The IOMMU virtualizes the address space for the guest environment. Therefore, the GPU and RDMA devices must have the same guest physical address space for the peer-to-peer functionality to work correctly within the guest environment.
However, without the SR-IOV virtualization, the IOMMU gives each device its own Input-Output virtual address space for security on a Bare Metal system. In this scenario, the peer-to-peer functionality fails because each device identifies a different address space.
If AMD ROCm is installed, the system may report failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl. Note, it may also result in a system crash.
In a correct ROCm installation, the system must not encounter the errors mentioned above.
The Input-Output Memory Management Unit (IOMMU) option must be enabled with the iommu=pt (passthrough) setting.
When Input-Output Memory Management Unit (IOMMU) is enabled, the input-output virtual addresses match the system’s physical addresses. This enables all devices to have the same view of memory and not cause any address remapping issue or page fault.
Follow the steps below to enable the Input-Output Memory Management Unit (IOMMU) passthrough in Ubuntu.
sudo update-grub
For other Linux operating systems or any questions to enable the IOMMU passthrough setting, contact the designated AMD Field Application Engineer.
Note - IOMMU is required for using x2APIC. Refer to the AMD Application Note section in Workload Tuning Guide for AMD EPYC™ 7002 Series Processor-Based Servers.
A fatal error issue is observed in the RHEL environment when running ROCm with more than one MI100 GPU installed.
To resolve the issue, follow the steps below for IOMMU and boot config file setting.