AnsweredAssumed Answered

GPU fault detected: 147 0x0122c402 Ubuntu 16.10

Question asked by vitkor on Apr 19, 2017
Latest reply on Jun 3, 2017 by beanow

Hi All!

I work with neural networks with CLtorch.

 

Until some time everything worked well. With the drivers version 16.60.
Suddenly in the logs there were such messages:
[22.863042] amdgpu 0000: 02: 00.0: GPU fault detected: 147 0x01324402
[22.863044] amdgpu 0000: 02: 00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001E0E26
[22.863045] amdgpu 0000: 02: 00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05044002
[22.863047] amdgpu 0000: 02: 00.0: VM fault (0x02, vmid 2) at page 1969702, write from 'TC5' (0x54433500) (68)

 

One of the GPU stopped working at full speed. But in the system this GPU is present.

lspci | grep -i VGA

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)

02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)

 

I write a simple test of GPU. Below the simple lua program  for the torch:

require('cltorch')

require('clnn')

 

count = cltorch.getDeviceCount()

 

print("Amd device quantity " .. count)

 

for i=1, count do

        time = sys.clock()

 

        test = cltorch.test(i)

 

        time = sys.clock() - time

 

        print("\n==> time to test device " .. i .. (time*1000) .. 'ms')

 

end

 

Test execution time:

==> time to test device 13923.1538772583ms (Problem GPU)

==> time to test device 2580.60598373413ms (Normal GPU)

==> time to test device 3569.08583641052ms (Normal GPU)

 

I found that the memory of the problem GPU is running at 300MHz

cat /sys/class/drm/card2/device/pp_dpm_mclk

0: 300Mhz *

1: 2000Mhz

 

I tried to set the required memory speed, but this did not affect the speed of the GPU.

echo 1 > /sys/class/drm/card1/device/pp_dpm_mclk

 

 

My system:

Ubuntu 16.10 with kernel 4.8.0-46-generic

Sawp - 18Gb

Amdgpu pro drivers: i used vesion 16.40, 16.60, now 17.10

8Gb system memory - 2400MGh

SSD AMD Radeon 120Gb

Motherboard - MSI Z170A KRAIT GAMING 3X

PSU 1500W ATX Corsair AX1500i

AMD RX 480 GPU Х 3 (two pieces - Sapphire Radeon RX 480 Nitro+ OC, one piece - HIS Radeon RX 480 IceQ X2 Roaring OC)

It is server machine, so without monitor connected.

All GPU connected to motherboard via PCI-e 1.0 Risers.

 

What I tried to do:
I changed the GPU in places - there is no effect
Changed the risers - there is no effect
Installed new drivers version 17.10 - there is no effect.
I checked the work of the GPU with the Optimizer (Zec miner) - the GPU works but the speed is very low - 0.9 salt / sec
I tried kernel 4.10 for ubuntu - a problem. The drivers do not work correctly.

I installed gpu in the system with Windows 10 - gpu running normally, at full speed. Installed both through the riser and slot PCI-e 3.0

 

Since the GPU in Windows is working fine, and earlier in the same system, too, everything was fine, I think the problem is in the drivers.

I hope to help the community.

Outcomes