AMD Instinct Accelerators, such as AMD Instinct™ using ROCm in a Linux environment. Note, this issue is observed in all Linux environments.
The IOMMU virtualizes the address space for the guest environment. Therefore, the GPU and RDMA devices must have the same guest physical address space for the peer-to-peer functionality to work correctly within the guest environment.
However, without the SR-IOV virtualization, the IOMMU gives each device its own Input-Output virtual address space for security on a Bare Metal system. In this scenario, the peer-to-peer functionality fails because each device identifies a different address space.
If AMD ROCm is installed, the system may report failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl. Note, it may also result in a system crash.
IO PAGE FAULT
IRQ remapping doesn’t support X2APIC mode
In a correct ROCm installation, the system must not encounter the errors mentioned above.
The Input-Output Memory Management Unit (IOMMU) option must be enabled with the iommu=pt (passthrough) setting.
When Input-Output Memory Management Unit (IOMMU) is enabled, the input-output virtual addresses match the system’s physical addresses. This enables all devices to have the same view of memory and not cause any address remapping issue or page fault.
Addressing Environment-Specific Issues
Enabling IMMOU Passthrough in Ubuntu
Follow the steps below to enable the Input-Output Memory Management Unit (IOMMU) passthrough in Ubuntu.
Add GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt” to /etc/defaults/grub
Update the boot config file using the following command:
Reboot the system.
For other Linux operating systems or any questions to enable the IOMMU passthrough setting, contact the designated AMD Field Application Engineer.