I work with neural networks with CLtorch.
Until some time everything worked well. With the drivers version 16.60.
Suddenly in the logs there were such messages:
[22.863042] amdgpu 0000: 02: 00.0: GPU fault detected: 147 0x01324402
[22.863044] amdgpu 0000: 02: 00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR 0x001E0E26
[22.863045] amdgpu 0000: 02: 00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05044002
[22.863047] amdgpu 0000: 02: 00.0: VM fault (0x02, vmid 2) at page 1969702, write from 'TC5' (0x54433500) (68)
One of the GPU stopped working at full speed. But in the system this GPU is present.
lspci | grep -i VGA
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480] (rev c7)
I write a simple test of GPU. Below the simple lua program for the torch:
count = cltorch.getDeviceCount()
print("Amd device quantity " .. count)
for i=1, count do
time = sys.clock()
test = cltorch.test(i)
time = sys.clock() - time
print("\n==> time to test device " .. i .. (time*1000) .. 'ms')
Test execution time:
==> time to test device 13923.1538772583ms (Problem GPU)
==> time to test device 2580.60598373413ms (Normal GPU)
==> time to test device 3569.08583641052ms (Normal GPU)
I found that the memory of the problem GPU is running at 300MHz
0: 300Mhz *
I tried to set the required memory speed, but this did not affect the speed of the GPU.
echo 1 > /sys/class/drm/card1/device/pp_dpm_mclk
Ubuntu 16.10 with kernel 4.8.0-46-generic
Sawp - 18Gb
Amdgpu pro drivers: i used vesion 16.40, 16.60, now 17.10
8Gb system memory - 2400MGh
SSD AMD Radeon 120Gb
Motherboard - MSI Z170A KRAIT GAMING 3X
PSU 1500W ATX Corsair AX1500i
AMD RX 480 GPU Х 3 (two pieces - Sapphire Radeon RX 480 Nitro+ OC, one piece - HIS Radeon RX 480 IceQ X2 Roaring OC)
It is server machine, so without monitor connected.
All GPU connected to motherboard via PCI-e 1.0 Risers.
What I tried to do:
I changed the GPU in places - there is no effect
Changed the risers - there is no effect
Installed new drivers version 17.10 - there is no effect.
I checked the work of the GPU with the Optimizer (Zec miner) - the GPU works but the speed is very low - 0.9 salt / sec
I tried kernel 4.10 for ubuntu - a problem. The drivers do not work correctly.
I installed gpu in the system with Windows 10 - gpu running normally, at full speed. Installed both through the riser and slot PCI-e 3.0
Since the GPU in Windows is working fine, and earlier in the same system, too, everything was fine, I think the problem is in the drivers.
I hope to help the community.