2 Replies Latest reply on Jul 15, 2017 3:42 AM by enzo

    13+ GPUs: Fatal error during GPU init, Ubuntu 16.04

    enzo

      Good day.

      I have 13 GPUs RX 480/580 on s2011v1 system with amdgpu-pro 17.10.

      # lspci | grep VGA

      01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      04:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

      05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

      09:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      0a:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      0b:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

      0f:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      11:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      14:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      15:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      16:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev c7)

      17:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 67df (rev e7)

      1c:04.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

      When i connect one more PCIe device (e.g. one more GPU, HBA controller), i have got an error:

      amdgpu 0000:04:00.0: Fatal error during GPU init

      amdgpu: probe of 0000:04:00.0 failed with error -12

      I can't find what is error code -12. If it is OS error code, i think it means "not enough resources", that's why i'm not sure.

      Interesting thing is that problem card is not the last one (if there is any driver limits).

       

      Also, this card have such error:

      [drm:amdgpu_device_init [amdgpu]] *ERROR* Unable to find PCI I/O BAR

      [drm:amdgpu_device_init [amdgpu]] *ERROR* Unable to find PCI I/O BAR; using MMIO for ATOM IIO

      And this error can appear with another card, which works fine.

       

      If i change pcie slots, i have just changed GPU address from 04:00 to 03:00 or 05:00, etc.

      I have connected up to 15 GPUs, all of it were in lspci output, but 14th and 15th GPUs have same errors.

       

      Can i fix it somehow?

        • Re: 13+ GPUs: Fatal error during GPU init, Ubuntu 16.04
          ray_m

          I have never seen a mining rig with 15 GPU's so this is something new for me : D

           

          Can you provide your complete system specs - including motherboard model and how the 15 cards are physically installed and I'll follow up with Engineering if this is a supported configuration.

          • Re: 13+ GPUs: Fatal error during GPU init, Ubuntu 16.04
            enzo

            Good day!

            Me either, only 13 GPU's

            I have Supermicro X9SRL-F MB (Supermicro | Products | Motherboards | Xeon® Boards | X9SRL-F), it has single 2011v1 socket and 7 pci-e slots (1 PCH) with Xeon E5-1620 or E5-2620 (both tested), 2x2Kwt PSUs.

            All unnecessary devices was switched off in bios (such usb controller, serial ports, etc). All pci-e slots set to gen2 mode, it allows to get best hashrate. There is no different which slots are used, so i use all of them. I have 3 pci-e splitters, such these:

            post-5592-0-56830900-1470373222.jpg

            post-6973-0-48333900-1496643969_thumb.png

            Tried different MB slots for splitters, even PCH slot, also tried cascading connection. All cards works fine with any connected structure until there were <=13 VC's.

             

            When connected 12+ VC's, mobo goes to bios itself and shows this message:

            PCI OUT OF RESOURCES CONDITION:

             

             

            Error: Insufficient PCI Resources Detected!!!

             

             

            System is running with Insufficient PCI Resources!

            In order to display this message some

            PCI devices were set to disabled state!

            It is strongly recommended to Power Off the system and remove some PCI/PCI Express cards from the system!

            To continue booting, proceed to Menu Option and select Boot Device or .

             

             

            WARNING: If you choose to continue booting some Operating

            Systems might not be able to complete boot correctly!

            But in this case i can manually choose device for boot and OS can be started.

            So, i thing problem is in pci resources, but i can't get why lspci recognized all cards.

             

            I want to try 2 CPU slots MB, because there will be more pci-e lanes, but i still want to know what is error code -12