1 Reply Latest reply on Aug 2, 2017 6:41 AM by dipak

    One of two rx470 fell off OpenCL

    zarubkin

      I have several processes working with OpenCL, After some time everything hanged on clEnqueueWriteBuffer. I checked clinfo and one of gpus wasn't there. Also one of gpus wasn't found by software.

      lspci showed both of them.

      sensors showed temp over 500 degrees and 0 rpm on that gpu.

      System:

      Ubuntu 16.04, 2 rx470, driver amdgpu-pro 17.10

       

      On first tries performance level was set to high and speed of coolers was maximized (echo 255 > /sys/class/drm/card$i/device/hwmon/hwmon$i/pwm1)

      On the last try it was autoset. After last time I tried to run everything on the remaining gpu and it ran during 3 days without falling.

       

      Log from dmesg

      [ 9795.684331] [drm] Atomic commit: SET crtc id 0: [ffff880446777000]

      [ 9795.684334] [drm] dc_commit_targets: 1 targets

      [ 9795.684336] [drm] core_target 0x460ad060: stream_count=1

      [ 9795.684338] [drm] core_stream 0x91ff5400: src: 0, 0, 1280, 1024; dst: 0, 0, 1280, 1024;

      [ 9795.684340] [drm]     pix_clk_khz: 108000, h_total: 1688, v_total: 1066

      [ 9795.684341] [drm]     sink name: SyncMaster, serial: 1146302775

      [ 9795.684342] [drm]     link: 3

      [ 9795.685580] [drm] [Mode]    [DVI][ConnIdx:3] {1280x1024, 1688x1066@108000Khz}^

      [ 9795.685599] [drm] dc_pre_update_surfaces_to_target: commit 1 surfaces to target 0x460ad060

      The last line repeats several time.

       

      What can be a problem?

        • Re: One of two rx470 fell off OpenCL
          dipak

          Hi,

          From the above description, it's difficult to point out anything particular at this point. As you mentioned, sensor showed a high temperature ( though not sure about 500 degree), so overheating might be an issue here.  Following information may help us to better understand the problem:

          1. Did you observe the issue earlier or just this time? Means, do you observe it frequently?

          2. Is it related to any particular application or process that causes the problem?

          3. After a hang, does a reboot make the system working again?

          4. You tried with two performance setting. Any difference in observation?

           

          Regards,