I have several processes working with OpenCL, After some time everything hanged on clEnqueueWriteBuffer. I checked clinfo and one of gpus wasn't there. Also one of gpus wasn't found by software.
lspci showed both of them.
sensors showed temp over 500 degrees and 0 rpm on that gpu.
System:
Ubuntu 16.04, 2 rx470, driver amdgpu-pro 17.10
On first tries performance level was set to high and speed of coolers was maximized (echo 255 > /sys/class/drm/card$i/device/hwmon/hwmon$i/pwm1)
On the last try it was autoset. After last time I tried to run everything on the remaining gpu and it ran during 3 days without falling.
Log from dmesg
[ 9795.684331] [drm] Atomic commit: SET crtc id 0: [ffff880446777000]
[ 9795.684334] [drm] dc_commit_targets: 1 targets
[ 9795.684336] [drm] core_target 0x460ad060: stream_count=1
[ 9795.684338] [drm] core_stream 0x91ff5400: src: 0, 0, 1280, 1024; dst: 0, 0, 1280, 1024;
[ 9795.684340] [drm] pix_clk_khz: 108000, h_total: 1688, v_total: 1066
[ 9795.684341] [drm] sink name: SyncMaster, serial: 1146302775
[ 9795.684342] [drm] link: 3
[ 9795.685580] [drm] [Mode] [DVI][ConnIdx:3] {1280x1024, 1688x1066@108000Khz}^
[ 9795.685599] [drm] dc_pre_update_surfaces_to_target: commit 1 surfaces to target 0x460ad060
The last line repeats several time.
What can be a problem?