9 Replies Latest reply on May 10, 2012 2:59 PM by pwvdendr

    GPU stability test?

    pwvdendr

      Is there some OpenCL stability test available somewhere, to test if a GPU can compute without errors? I think one of my HD7970s has a hardware fault (since an unlikely high amount of statistically unlikely events seem to occur on a regular basis), but I'd need to verify this with some standardized test before I can send it back to the factory. Linux only (although that shouldn't matter for OpenCL).

        • Re: GPU stability test?
          viscocoa

          The Video card is an important part of a computer. A fault of the video card is very serious, and often causes the computer to stall. Considering that a GPU survives many heavy duty 3D games, I guess the GPU is as stable as a CPU.

           

          When I started to program on GPU, I also suspect the stability of GPUs. Later, I found most problems were caused by a buggy kernel.

            • Re: GPU stability test?
              pwvdendr

              It is not my primary GPU, so those arguments don't work. The computer has 8 GPUs. And 7 out of 8 perform as expected, but one does weird things. Consistently. So I would need a formal test.

                • Re: GPU stability test?
                  kbrafford

                  If you move a known good GPU card into the slot now occupied by the questionable card, does that known good card still work?  And if you the questionable card into a slot where a known good card has been known to work correctly, does the questionable card still fail?

                   

                  If so, send it back, you've demonstrated enough to get an RMA.

                   

                  If not, then it could be that slot of the motherboard has issues.  Or it could be that your power supply isn't rated for the demands of 8 HD7970s.  Perhaps it's always the 8th one that's going to have trouble?

                   

                  --Keith Brafford

                    • Re: GPU stability test?
                      pwvdendr

                      If you move a known good GPU card into the slot now occupied by the questionable card, does that known good card still work?  And if you the questionable card into a slot where a known good card has been known to work correctly, does the questionable card still fail?

                       

                      If so, send it back, you've demonstrated enough to get an RMA.

                      Not sure how I have demonstrated enough then. AMD doesn't know what tests I ran (and I don't want to disclose my algorithm for now). So that's why I'm looking for a standard test to confirm.

                      (But thanks for the hint, I'll try that too to be certain.)

                       

                      Or it could be that your power supply isn't rated for the demands of 8 HD7970s.  Perhaps it's always the 8th one that's going to have trouble?

                      Always the 7th (GPU6). Dunno it's physical location though, I believe somewhere in the middle. And I have 2x1200W, which is slightly below 300W per card, since CPU is barely used... should be enough, right?

                • Re: GPU stability test?
                  mikism

                  Although this is memory test, you can try MemtestCL from Stanford:

                  http://folding.stanford.edu/English/DownloadUtils

                   

                  I've just ran it yesterday on both of my 6850s and found out that both () of them produce errors in "Random Blocks" test. However, I just can't believe that both cards are faulty as I can run all the latest games without any glitches for hours. But then again, when I ran the same test on a CPU, it detected no errors so the test itself should be fine.

                   

                  My own OpenCL kernel (FEM based elastic wave model) does weird things when working on a bigger model (grid of 96x96 elements and larger). Strangely enough same thing happens when I run the same kernel on the CPU. So this probably indicates that there is something wrong with my kernel as suggested (APP SDK samples seem to work fine). However I'm wandering if problems may be also caused by some glitch in the runtime (a bug or broken installation)?