2 Replies Latest reply on May 17, 2017 1:14 AM by ekran

    GPU fault detected: 147 0x02007702

    ekran

      OS: CentOS 7.3 with properly installed AMD-APP-SDK-v3.0.130.136-GA-linux64.sh and amdgpu-pro-17.10-410326.tar.xz.

       

      I am running a cryptocoin mining rig, with 5xAMD GPUs (2xRX580 and 3XRX480). The mining starts perfectly and runs smoothly for a while (anything from hours to a few days) then boom! This shows up in the system logfile:

       

      May 13 13:13:06 agamemnon kernel: amdgpu 0000:04:00.0: GPU fault detected: 147 0x02007702

      May 13 13:13:06 agamemnon kernel: amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000C40

      May 13 13:13:06 agamemnon kernel: amdgpu 0000:04:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x0E077002

      May 13 13:13:06 agamemnon kernel: amdgpu 0000:04:00.0: VM fault (0x02, vmid 7) at page 3136, read from 'SDM0'

       

      The mining stops and I have to reboot the rig in order to make things run again. Question is, which GPU is this? Is this a driver issue or could it be something wrong with my hardware?

       

      I have seen people getting the same error elsewhere (ofcourse I googled it first) but I don't see anyone having a good solution to this.

       

      Any help or tips would be appreciated.

       

      Rig info:

      CPU: MD Athlon II X4 860K Black Prosessor - 3.7 GHz

      RAM: Corsair 4GB DDR3 1600MHz Vengeance
      MB: ASRock FM2A88X+ BTC Hovedkort - AMD A88X
      Disk: 1TB SATA Seagate
      GFX: Sapphire Radeon RX 580 8GB Pulse
      GFX: Sapphire Radeon RX 580 8GB Pulse
      GFX: Sapphire Radeon RX 480 4GB NITRO+
      GFX: Sapphire Radeon RX 480 4GB NITRO+
      GFX: Sapphire Radeon RX 480 4GB NITRO+
      PSU: XFX ProSeries XXX Edition 850W Bronze
      PSU: Corsair VS650, 650W PSU

      OS: CentOS 7.3
      Case: Custom + 5 USB/PCI-e Risers

        • Re: GPU fault detected: 147 0x02007702
          ali_d

          Looks like one of the graphic cards is defective.  However, I am looking internally and I will get back to you ASAP with a responce

          • Re: GPU fault detected: 147 0x02007702
            ekran

            Thanks, if it is a hardware problem then I have narrowed it down to 2 of the RX480s or their respective PCI-e risers. The hunt continues, but it is a bit of a slow process. Do you know if there are any test software for AMD cards on Linux? I am also a bit unsure what to tell the shop where I bought the card, I don't think 'This card will crash my machine if I run it continously doing crypto coin calculations, after 1-2 days, but it works fine as a "normal" graphics card.'