AnsweredAssumed Answered

amdgpu-pro linux 4.10 GPU fault on cape verde

Question asked by jpsollie on Feb 21, 2017
Latest reply on Feb 23, 2018 by admmedlifer

Dear AMD people,

 

I am posting this because one of your colleagues here mailed me I should contact the driver guys:

My system

-Mainboard: supermicro H8DGI

-CPU: 2x opteron 6276 (32 cores total)

-GPU1 Matrox MGAG200, onboard, no drivers installed

-GPU2: AMD R9 NANO (Fiji)

-GPU3: AMD 7750  (Cape Verde)

- RAM: 128GB registered DDR3

-OS: gentoo linux

-kernel version: 4.10, amdGPU with SI and CIK experimental support enabled.

-amdgpu-pro driver version: 16.60.3

I slightly modified the module source (in /usr/src/amd...) to be compatible with the 4.10 kernel.  It is here in attachment.  This invokes modifying renamed functions and headers.

-psu: rosewill 80+ gold 1000W

this computer is headless: while X is installed, it is rarely used.  The machine is operated via ssh.

when executing an OpenCL based program, I get the following error when calling the kernel on the cape verde device:

[  349.362694] amdgpu 0000:41:00.0: GPU fault detected: 146 0x062a770c

[  349.362700] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x001032B1

[  349.362702] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A07700C

[  349.362707] amdgpu 0000:41:00.0: VM fault (0x0c, vmid 5) at page 1061553, read from '' (0x00000000) (119)

 

when booting the amdgpu_pro module with vm_debug=1, I get this error:

[  538.042397] AMD-Vi: Event logged [

[  538.042402] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3000 flags=0x0000]

[  538.042403] AMD-Vi: Event logged [

[  538.042404] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3040 flags=0x0000]

[  538.042405] AMD-Vi: Event logged [

[  538.042406] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c3080 flags=0x0000]

[  538.042406] AMD-Vi: Event logged [

[  538.042407] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x00000000fe6c30c0 flags=0x0000]

[  539.604731] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

[  539.604735] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00800000

[  539.604737] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008002

[  539.604740] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 8388608, read from '' (0x00000000) (8)

[  539.604745] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

[  539.604747] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000087

[  539.604748] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02008002

[  539.604751] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 135, read from '' (0x00000000) (8)

[  539.604755] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00e20802

[  539.604757] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x002B1831

[  539.604759] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008002

[  539.604761] amdgpu 0000:41:00.0: VM fault (0x02, vmid 1) at page 2824241, write from '' (0x00000000) (8)

[  539.604765] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00020802

[  539.604767] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0DAB3B9C

[  539.604768] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03048001

[  539.604771] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 229325724, write from '' (0x00000000) (72)

[  539.604775] amdgpu 0000:41:00.0: GPU fault detected: 147 0x06230802

[  539.604777] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0BC38C5D

[  539.604779] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03008001

[  539.604781] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 197364829, write from '' (0x00000000) (8)

[  539.604785] amdgpu 0000:41:00.0: GPU fault detected: 147 0x02630401

[  539.604787] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x036735B0

[  539.604789] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

[  539.604791] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 57095600, write from '' (0x00000000) (68)

[  539.604795] amdgpu 0000:41:00.0: GPU fault detected: 147 0x03430401

[  539.604797] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03A955DA

[  539.604798] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03004001

[  539.604801] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 61429210, write from '' (0x00000000) (4)

[  539.604805] amdgpu 0000:41:00.0: GPU fault detected: 147 0x0f230401

[  539.604807] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0B35588A

[  539.604808] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03048001

[  539.604811] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 188045450, write from '' (0x00000000) (72)

[  539.604815] amdgpu 0000:41:00.0: GPU fault detected: 147 0x00030401

[  539.604816] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x03706988

[  539.604818] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

[  539.604820] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 57698696, write from '' (0x00000000) (68)

[  539.604824] amdgpu 0000:41:00.0: GPU fault detected: 147 0x04430401

[  539.604826] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x09DB0521

[  539.604828] amdgpu 0000:41:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x03044001

[  539.604830] amdgpu 0000:41:00.0: VM fault (0x01, vmid 1) at page 165348641, write from '' (0x00000000) (68)

[  539.613589] AMD-Vi: Event logged [

[  539.613593] IO_PAGE_FAULT device=41:00.0 domain=0x0012 address=0x0000003af8a6a000 flags=0x0000]

as you can see, it is all on PCI id 41

41 is the PCI ID of the cape verde device:

00:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge only dual slot (2x16) PCI-e GFX Hydra part (rev 02)

00:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)

00:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)

00:0d.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx1 port B)

00:11.0 SATA controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 SATA Controller [AHCI mode]

00:12.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller

00:12.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller

00:12.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller

00:13.0 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI0 Controller

00:13.1 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0 USB OHCI1 Controller

00:13.2 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB EHCI Controller

00:14.0 SMBus: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 SMBus Controller (rev 3d)

00:14.3 ISA bridge: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 LPC host controller

00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge

00:14.5 USB controller: Advanced Micro Devices, Inc. [AMD/ATI] SB7x0/SB8x0/SB9x0 USB OHCI2 Controller

00:18.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

00:18.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

00:18.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

00:18.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

00:18.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

00:18.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

00:19.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

00:19.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

00:19.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

00:19.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

00:19.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

00:19.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

00:1a.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

00:1a.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

00:1a.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

00:1a.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

00:1a.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

00:1a.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

00:1b.0 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 0

00:1b.1 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 1

00:1b.2 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 2

00:1b.3 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 3

00:1b.4 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 4

00:1b.5 Host bridge: Advanced Micro Devices, Inc. [AMD] Family 15h Processor Function 5

01:04.0 VGA compatible controller: Matrox Electronics Systems Ltd. MGA G200eW WPCM450 (rev 0a)

02:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

02:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01)

03:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)

40:00.0 Host bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 Northbridge only dual slot (2x16) PCI-e GFX Hydra part (rev 02)

40:00.2 IOMMU: Advanced Micro Devices, Inc. [AMD/ATI] RD990 I/O Memory Management Unit (IOMMU)

40:02.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (PCI express gpp port B)

40:0b.0 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (NB-SB link)

41:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde PRO [Radeon HD 7750/8740 / R7 250E]

41:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Cape Verde/Pitcairn HDMI Audio [Radeon HD 7700/7800 Series]

42:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji XT [Radeon R9 FURY X] (rev ca)

42:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aae8

The kernel does not report this as a bug, but the program complains that it is unable to enqueue the kernel task and events are logged in dmesg.

 

I suspect the problem to be with this:

-my Fiji has 4GB ram and is a 64 bit GCN device

- My cape verde has 1GB ram and is 32 bit (amd, please confirm!)

- I have to export GPU_FORCE_64BIT_PTR=1 to make sure no segmentation faults occur, but is it possible that this also leads to the unavailability of the cape verde card?

 

Anyone who has a solution?

Attachments

Outcomes