0 Replies Latest reply on Oct 3, 2017 7:45 AM by ianmcc

    FirePro W8100 GPU fault

    ianmcc

      Hi,

       

      I have a brand new FirePro W8100, on Ubuntu, which I bought for double-precision OpenCL calculations.

       

      Unfortunately it is very unstable.  I have tried several combinations of drivers and kernels, I consistently get errors in the log like:

       

      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x096ac802                                                                                                                                             
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0011B328                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A104002                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 1159976, read from 'TC3' (0x54433300) (260)                                                                                                    
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x014a0402                                                                                                                                             
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0004E107                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A184002                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 319751, read from 'TC5' (0x54433500) (388)                                                                                                     
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x062a0402                                                                                                                                             
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x0004DD06                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0A1C4002                                                                                                                                 
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: VM fault (0x02, vmid 5) at page 318726, read from 'TC7' (0x54433700) (452)                                                                                                     
      Sep 23 09:14:46 ryzen kernel: amdgpu 0000:08:00.0: GPU fault detected: 147 0x01eac402                                                                                                                                              

       

      I have found many similar reports via google but I haven't been able to find any definitive solutions.  This error is usually unrecoverable; the process running the calculation freezes, and trying to kill that process locks up the whole machine.

       

      A calculation that seems to reliably cause this problem is 'make alltuners' from the CLBLast project GitHub - CNugteren/CLBlast: Tuned OpenCL BLAS
      It dies at the double-precision complex ZGEMM (if it makes it that far).