Product Impacted - AMD Instinct™
AMD Instinct Accelerators, such as AMD Instinct™ using ROCm in a Linux environment. Note, this issue is observed in all Linux environments.
Background
The IOMMU virtualizes the address space for the guest environment. Therefore, the GPU and RDMA devices must have the same guest physical address space for the peer-to-peer functionality to work correctly within the guest environment.
However, without the SR-IOV virtualization, the IOMMU gives each device its own Input-Output virtual address space for security on a Bare Metal system. In this scenario, the peer-to-peer functionality fails because each device identifies a different address space.
Known Impact
If AMD ROCm is installed, the system may report failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl. Note, it may also result in a system crash.
- IO PAGE FAULT
- IRQ remapping doesn’t support X2APIC mode
- NMI error
In a correct ROCm installation, the system must not encounter the errors mentioned above.
Required Action
The Input-Output Memory Management Unit (IOMMU) option must be enabled with the iommu=pt (passthrough) setting.
When Input-Output Memory Management Unit (IOMMU) is enabled, the input-output virtual addresses match the system’s physical addresses. This enables all devices to have the same view of memory and not cause any address remapping issue or page fault.
Addressing Environment-Specific Issues
Enabling IOMMU Passthrough in Ubuntu
Follow the steps below to enable the Input-Output Memory Management Unit (IOMMU) passthrough in Ubuntu.
- Add GRUB_CMDLINE_LINUX="iommu=pt” to /etc/defaults/grub
- Update the boot config file using the following command:
sudo update-grub
- Reboot the system.
For other Linux operating systems or any questions to enable the IOMMU passthrough setting, contact the designated AMD Field Application Engineer.
Note - IOMMU is required for using x2APIC. Refer to the AMD Application Note section in Workload Tuning Guide for AMD EPYC™ 7002 Series Processor-Based Servers.
Addressing Fatal Error Issue in RHEL
A fatal error issue is observed in the RHEL environment when running ROCm with more than one MI100 GPU installed.
To resolve the issue, follow the steps below for IOMMU and boot config file setting.
- Append iommu=pt to the end of the GRUB_CMDLINE_LINUX line in the /etc/default/grub configuration file.
- Refresh the grub.cfg file as "grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg".
- Reboot the host for these changes to take effect.
- After the host comes up, the rvs stress starts as expected and the fatal errors in LC logs are no longer observed.