AnsweredAssumed Answered

One of two rx470 fell off OpenCL

Question asked by zarubkin on Jul 31, 2017
Latest reply on Aug 2, 2017 by dipak

I have several processes working with OpenCL, After some time everything hanged on clEnqueueWriteBuffer. I checked clinfo and one of gpus wasn't there. Also one of gpus wasn't found by software.

lspci showed both of them.

sensors showed temp over 500 degrees and 0 rpm on that gpu.

System:

Ubuntu 16.04, 2 rx470, driver amdgpu-pro 17.10

 

On first tries performance level was set to high and speed of coolers was maximized (echo 255 > /sys/class/drm/card$i/device/hwmon/hwmon$i/pwm1)

On the last try it was autoset. After last time I tried to run everything on the remaining gpu and it ran during 3 days without falling.

 

Log from dmesg

[ 9795.684331] [drm] Atomic commit: SET crtc id 0: [ffff880446777000]

[ 9795.684334] [drm] dc_commit_targets: 1 targets

[ 9795.684336] [drm] core_target 0x460ad060: stream_count=1

[ 9795.684338] [drm] core_stream 0x91ff5400: src: 0, 0, 1280, 1024; dst: 0, 0, 1280, 1024;

[ 9795.684340] [drm]     pix_clk_khz: 108000, h_total: 1688, v_total: 1066

[ 9795.684341] [drm]     sink name: SyncMaster, serial: 1146302775

[ 9795.684342] [drm]     link: 3

[ 9795.685580] [drm] [Mode]    [DVI][ConnIdx:3] {1280x1024, 1688x1066@108000Khz}^

[ 9795.685599] [drm] dc_pre_update_surfaces_to_target: commit 1 surfaces to target 0x460ad060

The last line repeats several time.

 

What can be a problem?

Outcomes