5 Replies Latest reply on Feb 23, 2018 6:13 AM by admmedlifer

    amdgpu-pro linux 4.10 GPU fault on cape verde

    jpsollie

      Dear AMD people,

       

      I am posting this because one of your colleagues here mailed me I should contact the driver guys:

      My system

      -Mainboard: supermicro H8DGI

      -CPU: 2x opteron 6276 (32 cores total)

      -GPU1 Matrox MGAG200, onboard, no drivers installed

      -GPU2: AMD R9 NANO (Fiji)

      -GPU3: AMD 7750  (Cape Verde)

      - RAM: 128GB registered DDR3

      -OS: gentoo linux

      -kernel version: 4.10, amdGPU with SI and CIK experimental support enabled.

      -amdgpu-pro driver version: 16.60.3

      I slightly modified the module source (in /usr/src/amd...) to be compatible with the 4.10 kernel.  It is here in attachment.  This invokes modifying renamed functions and headers.

      -psu: rosewill 80+ gold 1000W

      this computer is headless: while X is installed, it is rarely used.  The machine is operated via ssh.

      when executing an OpenCL based program, I get the following error when calling the kernel on the cape verde device:

      [  349.362694] amdgpu 0000:41:00.0: GPU fault detected: 146 0x062a770c

      [  349.362700] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001032B1

      [  349.362702] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C

      [  349.362707] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 5) at page 1061553, read from '' (0x00000000) (119)

       

      when booting the amdgpu_pro module with vm_debug=1, I get this error:

      [  538.042397] AMD-Vi: Event logged [

      [  538.042402] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3000 flags=0x0000]

      [  538.042403] AMD-Vi: Event logged [

      [  538.042404] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3040 flags=0x0000]

      [  538.042405] AMD-Vi: Event logged [

      [  538.042406] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3080 flags=0x0000]

      [  538.042406] AMD-Vi: Event logged [

      [  538.042407] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c30c0 flags=0x0000]

      [  539.604731] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

      [  539.604735] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00800000

      [  539.604737] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008002

      [  539.604740] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 8388608, read from '' (0x00000000) (8)

      [  539.604745] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

      [  539.604747] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000087

      [  539.604748] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008002

      [  539.604751] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 135, read from '' (0x00000000) (8)

      [  539.604755] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00e20802

      [  539.604757] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x002B1831

      [  539.604759] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008002

      [  539.604761] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 2824241, write from '' (0x00000000) (8)

      [  539.604765] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

      [  539.604767] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0DAB3B9C

      [  539.604768] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03048001

      [  539.604771] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 229325724, write from '' (0x00000000) (72)

      [  539.604775] amdgpu 0000:41:00.0: GPU fault detected: 147 0x06230802

      [  539.604777] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0BC38C5D

      [  539.604779] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008001

      [  539.604781] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 197364829, write from '' (0x00000000) (8)

      [  539.604785] amdgpu 0000:41:00.0: GPU fault detected: 147 0x02630401

      [  539.604787] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x036735B0

      [  539.604789] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

      [  539.604791] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 57095600, write from '' (0x00000000) (68)

      [  539.604795] amdgpu 0000:41:00.0: GPU fault detected: 147 0x03430401

      [  539.604797] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03A955DA

      [  539.604798] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03004001

      [  539.604801] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 61429210, write from '' (0x00000000) (4)

      [  539.604805] amdgpu 0000:41:00.0: GPU fault detected: 147 0x0f230401

      [  539.604807] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0B35588A

      [  539.604808] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03048001

      [  539.604811] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 188045450, write from '' (0x00000000) (72)

      [  539.604815] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00030401

      [  539.604816] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03706988

      [  539.604818] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

      [  539.604820] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 57698696, write from '' (0x00000000) (68)

      [  539.604824] amdgpu 0000:41:00.0: GPU fault detected: 147 0x04430401

      [  539.604826] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x09DB0521

      [  539.604828] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

      [  539.604830] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 165348641, write from '' (0x00000000) (68)

      [  539.613589] AMD-Vi: Event logged [

      [  539.613593] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x0000003af8a6a000 flags=0x0000]

      as you can see, it is all on PCI id 41

      41 is the PCI ID of the cape verde device:

      00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge only dual slot (2x16) PCI-e GFX Hydra part (rev 02)

      00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)

      00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)

      00:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port B)

      00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

      00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller

      00:12.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller

      00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller

      00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller

      00:13.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller

      00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller

      00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 3d)

      00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller

      00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge

      00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller

      00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

      00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

      00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

      00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

      00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

      00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

      00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

      00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

      00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

      00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

      00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

      00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

      00:1a.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

      00:1a.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

      00:1a.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

      00:1a.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

      00:1a.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

      00:1a.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

      00:1b.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

      00:1b.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

      00:1b.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

      00:1b.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

      00:1b.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

      00:1b.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

      01:04.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

      02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

      02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

      03:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)

      40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge only dual slot (2x16) PCI-e GFX Hydra part (rev 02)

      40:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)

      40:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)

      40:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (NB-SB link)

      41:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750/8740 / R7 250E]

      41:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]

      42:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji XT [Radeon R9 FURY X] (rev ca)

      42:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae8

      The kernel does not report this as a bug, but the program complains that it is unable to enqueue the kernel task and events are logged in dmesg.

       

      I suspect the problem to be with this:

      -my Fiji has 4GB ram and is a 64 bit GCN device

      - My cape verde has 1GB ram and is 32 bit (amd, please confirm!)

      - I have to export GPU_FORCE_64BIT_PTR=1 to make sure no segmentation faults occur, but is it possible that this also leads to the unavailability of the cape verde card?

       

      Anyone who has a solution?

        • Re: amdgpu-pro linux 4.10 GPU fault on cape verde
          jpsollie

          tonight, I found something interesting:

          when disabling the IOMMU functionality on the mainboard, the system does not report any errors any longer, but still, enqueueNDRangeKernel does not work.

          even worse: the call to clCreateCommandQueue never returns for the cape verde device, for the Fiji device there's no problem

          all suggestions are welcome!

           

          *update: it seems to be a thread-safe issue: when I set a breakpoint in GDB at clCreateCommandQueue, and "continue" every time the breakpoint is executed, there is no problem.  But anyhow, it's not workeable.

          To summarize, steps I take for getting to the compiler stage:

          -boot with IOMMU disabled

          -load the module with dpm=1 vm_debug=1

          -adjust the LD_LIBRARY_PATH

          -force the use of 64-bit pointers

           

          where should I file a bug for this?

            • Re: amdgpu-pro linux 4.10 GPU fault on cape verde
              jpsollie

              To all people following this topic: I got good and bad news for you:

              The good news:

              - I got trough it

              The bad news:

              -one step closer to an end doesn't mean we reached the end

               

              What you need to do (additional): use the cl-no-optimizations to disable all optimizations clang and LLVM make.

              Result: the kernel gets invoked om my 3 devices, and finishes(!) on the Fiji device. this is a big step forward!

              Side-effect: watch this, this is my dmesg while my program was waiting for the other kernels to finish:

              [  172.221831] amdgpu 0000:41:00.0: GPU fault detected: 146 0x020c770c

              [  172.221834] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100010

              [  172.221836] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0C07700C

              [  172.221838] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 6) at page 1048592, read from '' (0x00000000) (119)

              [  182.425935] amdgpu 0000:41:00.0: GPU fault detected: 146 0x002e3d0c

              [  182.425937] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100D81

              [  182.425939] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E03D00C

              [  182.425941] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 7) at page 1052033, read from '' (0x00000000) (61)

              [  182.426602] amdgpu 0000:41:00.0: GPU fault detected: 146 0x04033014

              [  182.426603] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00100DA0

              [  182.426604] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426606] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1052064, write from '' (0x00000000) (48)

              [  182.426610] amdgpu 0000:41:00.0: GPU fault detected: 146 0x06033014

              [  182.426611] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101A10

              [  182.426612] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426613] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1055248, write from '' (0x00000000) (48)

              [  182.426617] amdgpu 0000:41:00.0: GPU fault detected: 146 0x06033014

              [  182.426618] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101AB0

              [  182.426619] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426620] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1055408, write from '' (0x00000000) (48)

              [  182.426624] amdgpu 0000:41:00.0: GPU fault detected: 146 0x0a033014

              [  182.426625] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101B50

              [  182.426626] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426627] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1055568, write from '' (0x00000000) (48)

              [  182.426631] amdgpu 0000:41:00.0: GPU fault detected: 146 0x0e033014

              [  182.426632] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101BF0

              [  182.426633] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426634] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1055728, write from '' (0x00000000) (48)

              [  182.426638] amdgpu 0000:41:00.0: GPU fault detected: 146 0x02033014

              [  182.426639] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00101C70

              [  182.426640] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03030014

              [  182.426641] amdgpu 0000:41:00.0: VM fault (0x14, vmid 1) at page 1055856, write from '' (0x00000000) (48)

               

               

              Any hint?

            • Re: amdgpu-pro linux 4.10 GPU fault on cape verde
              jpsollie

              for people still reading this topic: I removed the R9 nano from my system and ran the test again.  The result was the same, however, the amount of page faults did not end: as long as the opencl program ran, it gave page faults.  I clearly have to look for a driver bug related to the cape verde code here.  I'll update as soon as I found something.