cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pwvdendr
Adept II

GPU stability test?

Is there some OpenCL stability test available somewhere, to test if a GPU can compute without errors? I think one of my HD7970s has a hardware fault (since an unlikely high amount of statistically unlikely events seem to occur on a regular basis), but I'd need to verify this with some standardized test before I can send it back to the factory. Linux only (although that shouldn't matter for OpenCL).

0 Likes
9 Replies
viscocoa
Adept I

The Video card is an important part of a computer. A fault of the video card is very serious, and often causes the computer to stall. Considering that a GPU survives many heavy duty 3D games, I guess the GPU is as stable as a CPU.

When I started to program on GPU, I also suspect the stability of GPUs. Later, I found most problems were caused by a buggy kernel.

0 Likes

It is not my primary GPU, so those arguments don't work. The computer has 8 GPUs. And 7 out of 8 perform as expected, but one does weird things. Consistently. So I would need a formal test.

0 Likes

If you move a known good GPU card into the slot now occupied by the questionable card, does that known good card still work?  And if you the questionable card into a slot where a known good card has been known to work correctly, does the questionable card still fail?

If so, send it back, you've demonstrated enough to get an RMA.

If not, then it could be that slot of the motherboard has issues.  Or it could be that your power supply isn't rated for the demands of 8 HD7970s.  Perhaps it's always the 8th one that's going to have trouble?

--Keith Brafford

0 Likes

If you move a known good GPU card into the slot now occupied by the questionable card, does that known good card still work?  And if you the questionable card into a slot where a known good card has been known to work correctly, does the questionable card still fail?

If so, send it back, you've demonstrated enough to get an RMA.

Not sure how I have demonstrated enough then. AMD doesn't know what tests I ran (and I don't want to disclose my algorithm for now). So that's why I'm looking for a standard test to confirm.

(But thanks for the hint, I'll try that too to be certain.)

Or it could be that your power supply isn't rated for the demands of 8 HD7970s.  Perhaps it's always the 8th one that's going to have trouble?

Always the 7th (GPU6). Dunno it's physical location though, I believe somewhere in the middle. And I have 2x1200W, which is slightly below 300W per card, since CPU is barely used... should be enough, right?

0 Likes
mikism
Adept I

Although this is memory test, you can try MemtestCL from Stanford:

http://folding.stanford.edu/English/DownloadUtils

I've just ran it yesterday on both of my 6850s and found out that both () of them produce errors in "Random Blocks" test. However, I just can't believe that both cards are faulty as I can run all the latest games without any glitches for hours. But then again, when I ran the same test on a CPU, it detected no errors so the test itself should be fine.

My own OpenCL kernel (FEM based elastic wave model) does weird things when working on a bigger model (grid of 96x96 elements and larger). Strangely enough same thing happens when I run the same kernel on the CPU. So this probably indicates that there is something wrong with my kernel as suggested (APP SDK samples seem to work fine). However I'm wandering if problems may be also caused by some glitch in the runtime (a bug or broken installation)?

Although this is memory test, you can try MemtestCL from Stanford:

http://folding.stanford.edu/English/DownloadUtils

Oowh yeah now we're talking! Finally at least "something" that works on Linux. I'll test it out asap, thx!

0 Likes

Interestingly, even in the first test with 50 iterations, all my HD7970 cards give errors in every single iteration in the random blocks test and in no other tests. My HD5450 at home yields errors in 3/50 iterations in that thest and none in other tests. So I'm afraid that it is indeed bugged...

0 Likes

It seems to be a bug, I am having similar results. See:

http://devgurus.amd.com/message/1280938

0 Likes

yurtesen wrote:

It seems to be a bug, I am having similar results. See:

http://devgurus.amd.com/message/1280938

I'll reply here since the other topic seems broken (I get a note "Currently Being Moderated" and my post is not visible once I logout).

Edit: and it happens here as well...

<test test>

Edit 2: now it works. Apparently something in my message triggers a moderation alert... I'll try again somewhat later.

0 Likes