cancel
Showing results for 
Search instead for 
Did you mean: 

IOMMU Advisory for AMD Instinct™

IOMMU Advisory for AMD Instinct™

Product Impacted - AMD Instinct™

AMD Instinct Accelerators, such as AMD Instinct™ using ROCm in a Linux environment.  Note, this issue is observed in all Linux environments.

Background

The IOMMU virtualizes the address space for the guest environment. Therefore, the GPU and RDMA devices must have the same guest physical address space for the peer-to-peer functionality to work correctly within the guest environment. 

However, without the SR-IOV virtualization, the IOMMU gives each device its own Input-Output virtual address space for security on a Bare Metal system.  In this scenario, the peer-to-peer functionality fails because each device identifies a different address space.

Known Impact

If AMD ROCm is installed, the system may report failure or errors when running workloads such as bandwidth test, clinfo, and HelloWord.cl. Note, it may also result in a system crash.

  • IO PAGE FAULT
  • IRQ remapping doesn’t support X2APIC mode
  • NMI error

In a correct ROCm installation, the system must not encounter the errors mentioned above.

Required Action

The Input-Output Memory Management Unit (IOMMU) option must be enabled with the iommu=pt (passthrough) setting.

When Input-Output Memory Management Unit (IOMMU) is enabled, the input-output virtual addresses match the system’s physical addresses. This enables all devices to have the same view of memory and not cause any address remapping issue or page fault.

Addressing Environment-Specific Issues

Enabling IOMMU Passthrough in Ubuntu

Follow the steps below to enable the Input-Output Memory Management Unit (IOMMU) passthrough in Ubuntu.

  1. Add GRUB_CMDLINE_LINUX="iommu=pt” to /etc/defaults/grub
  2. Update the boot config file using the following command:

    sudo update-grub

  3. Reboot the system.

           

For other Linux operating systems or any questions to enable the IOMMU passthrough setting, contact the designated AMD Field Application Engineer.

Note - IOMMU is required for using x2APIC. Refer to the AMD Application Note section in Workload Tuning Guide for AMD EPYC™ 7002 Series Processor-Based Servers.

Addressing Fatal Error Issue in RHEL

A fatal error issue is observed in the RHEL environment when running ROCm with more than one MI100 GPU installed.

To resolve the issue, follow the steps below for IOMMU and boot config file setting.

  1. Append iommu=pt to the end of the GRUB_CMDLINE_LINUX line in the /etc/default/grub configuration file.

  2. Refresh the grub.cfg file  as  "grub2-mkconfig -o /boot/efi/EFI/redhat/grub.cfg".

  3. Reboot the host for these changes to take effect.

  4. After the host comes up, the rvs stress starts as expected and the fatal errors in LC logs are no longer observed.
Labels (1)
Comments

Should these pages be scrubbed as some of them seem outdated or incomplete which could be misleading?

@Roopa_Malavally  I see you updated this article 2/20/2024 and changed the grub recommendation from:

amd_iommu=on iommu=pt

to just

iommu=pt

I had already applied the original fix with the 2 additions prior to this update, and haven't noticed any ill effects. Can you please comment on if there is any harm in leaving amd_iommu=on in grub, and any pros/cons?

Thank you so much for reaching out. I will check with an AMD subject matter expert and get back to you.


@Roopa_Malavally  Thank you!

Version history
Revision #:
7 of 7
Last update:
a week ago
Updated by:
 
Contributors