cancel
Showing results for 
Search instead for 
Did you mean: 

IOMMU Advisory for AMD Instinct™

IOMMU Advisory for AMD Instinct™

Product Impacted - AMD Instinct™

AMD Instinct Accelerators, such as AMD Instinct™ using ROCm in a Linux environment.  Note, this issue is observed in all Linux environments.

Background

The IOMMU virtualizes the address space for the guest environment. Therefore, the GPU and RDMA devices must have the same guest physical address space for the peer-to-peer functionality to work correctly within the guest environment. 

However, without the SR-IOV virtualization, the IOMMU gives each device its own Input-Output virtual address space for security on a Bare Metal system.  In this scenario, the peer-to-peer functionality fails because each device identifies a different address space.

Known Impact

If AMD ROCm is installed, the system may report failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl. Note, it may also result in a system crash.

  • IO PAGE FAULT
  • IRQ remapping doesn’t support X2APIC mode
  • NMI error

In a correct ROCm installation, the system must not encounter the errors mentioned above.

Required Action

The Input-Output Memory Management Unit (IOMMU) option must be enabled with the iommu=pt (passthrough) setting.

When Input-Output Memory Management Unit (IOMMU) is enabled, the input-output virtual addresses match the system’s physical addresses. This enables all devices to have the same view of memory and not cause any address remapping issue or page fault.

Addressing Environment-Specific Issues

Enabling IMMOU Passthrough in Ubuntu

Follow the steps below to enable the Input-Output Memory Management Unit (IOMMU) passthrough in Ubuntu.

  1. Add GRUB_CMDLINE_LINUX="amd_iommu=on iommu=pt” to /etc/defaults/grub
  2. Update the boot config file using the following command:

           sudo update-grub

  1. Reboot the system.

For other Linux operating systems or any questions to enable the IOMMU passthrough setting, contact the designated AMD Field Application Engineer.

Note - IOMMU is required for using x2APIC. Refer to the AMD Application Note section in Workload Tuning Guide for AMD EPYC™ 7002 Series Processor-Based Servers.

Addressing Fatal Error Issue in RHEL

A fatal error issue is observed in the RHEL environment when running ROCm with more than one MI100 GPU installed.

To resolve the issue, follow the steps below for IOMMU and boot config file setting.

  1. Append amd_iommu=on and iommu=pt to the end of the GRUB_CMDLINE_LINUX line in the /etc/default/grub configuration file.

  2. Refresh the grub.cfg file  as  "grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg".

  3. Reboot the host for these changes to take effect.

  4. After the host comes up, the rvs stress starts as expected and the fatal errors in LC logs are no longer observed.
Labels (1)
Version history
Revision #:
3 of 3
Last update:
‎08-11-2021 12:15 PM
Updated by:
 
Contributors