AnsweredAssumed Answered

13+ GPUs: Fatal error during GPU init, Ubuntu 16.04

Question asked by enzo on Jun 28, 2017
Latest reply on Jul 15, 2017 by enzo

Good day.

I have 13 GPUs RX 480/580 on s2011v1 system with amdgpu-pro 17.10.

# lspci | grep VGA

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

0f:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

11:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

14:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

15:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

16:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

1c:04.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

When i connect one more PCIe device (e.g. one more GPU, HBA controller), i have got an error:

amdgpu 0000:04:00.0: Fatal error during GPU init

amdgpu: probe of 0000:04:00.0 failed with error -12

I can't find what is error code -12. If it is OS error code, i think it means "not enough resources", that's why i'm not sure.

Interesting thing is that problem card is not the last one (if there is any driver limits).

 

Also, this card have such error:

[drm:amdgpu_device_init [amdgpu]] *ERROR* Unable to find PCI I/O BAR

[drm:amdgpu_device_init [amdgpu]] *ERROR* Unable to find PCI I/O BAR; using MMIO for ATOM IIO

And this error can appear with another card, which works fine.

 

If i change pcie slots, i have just changed GPU address from 04:00 to 03:00 or 05:00, etc.

I have connected up to 15 GPUs, all of it were in lspci output, but 14th and 15th GPUs have same errors.

 

Can i fix it somehow?

Outcomes