cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

yurtesen
Miniboss

memtestCL-1.00-linux64 Random blocks errors on Tahiti

I am running memtestCL-1.00-linux64 on tahiti and it is giving random blocks errors. Is this because of a bug in AMDs OpenCL SDK?

0 Likes
1 Solution

It is not our problem if AMD doesnt care if their products do not function. We tried to raise some voice at least...

The problem seems to be a bug in memtestCL.exe in a kernel that writes blocks of random values to memory.  The kernel has a main loop where each of 256 work items in a workgroup generates a random value and stores it in local memory block. Then each work item reads the same value from local memory and writes it to global memory. In the loop, all work items read the same random value from local memory location #255 to use as the seed for the next iteration.

The problem is a not so obvious missing local barrier needed after the line

      seed = randomBlock[blockDim -1];     // blockDim=256

At first it looks like the barrier after writing randomBlock[threadIdx] should work because there are no other local memory writes in the loop.

The problem occurs when all threads read the same value at randomBlock[255]. If thread #255 (its wave) reads the value first while others are still waiting, then it has a free execution path all the way to the next write of randomBlock[255], thus overwriting the value before slower threads/waves can read it.

In GCN, each wave executes on a single SIMD,  I'm guessing this makes for more flexible execution paths so the bug is most obvious on Tahiti because

................. yes ....... GCN is so powerful!

_

_kernel void deviceWriteRandomBlocks(__global uint* base,uint N,int seed,__local uint* randomBlock) {

    if (seed == 0) seed = 123459876+blockIdx;

    uint bitSeed = deviceRan0p(seed + threadIdx,threadIdx);

    for (uint i=0; i < N; i++) {

        // Generate block of random numbers in parallel       

        randomBlock[threadIdx]= deviceRan0p(seed,threadIdx) |

                (deviceIrbit2(&bitSeed) << 31);

        barrier(CLK_LOCAL_MEM_FENCE);

        // Set the seed for the next round to the last number

        // calculated in this round

        seed = randomBlock[blockDim-1];

//=============================================

        barrier(CLK_LOCAL_MEM_FENCE);                  //! ADD EXTRA LOCAL BARRIER HERE

//=============================================

        // Blit shmem block out to global memory

        *(THREAD_ADDRESS(base,N,i)) = randomBlock[threadIdx];

    }

After making this change the program runs fine on Tahiti. I use mingw in Windows which is not a supported environment for building memtestCL but if anyone wants the binary, just let me know.

memtestCL is copywrite through Stanford U. where a lot of the early GPU development was done. It might be interesting if they have any thoughts or comments on GCN. There is some feedback for memtestCL through SimTK.org, where is comes from.

drallan

View solution in original post

37 Replies
mikism
Adept I

Same test fails for me on Win7 x64 for both of my 6850s (for CPU it does not detect any errors).

0 Likes
nou
Exemplar

it is same for my 5850. sometime it report around 950 errors in random block test.

0 Likes

The number I am getting in tahiti is crazy high. Over a million errors in each iteration.

I get no errors under 5870, 6320(e-450 apu gpu tested under windows), and nvidia gtx 580/680.

0 Likes

I get no errors under 5870, 6320(e-450 apu gpu tested under windows), and nvidia gtx 580/680.

No errors or very few? My HD5450 gave 400-500 errors in 3 out of 50 iterations (and none in the other 47). Given that it's always in this test alone, and everyone seems to be getting them in the very same test (and never errors on any other tests), it might also be due to the test. How extensively have you tested on nvidia gtx 580/680?

0 Likes

I ran the test several times, with default memory setting (256) and also tried 1024 command line option. In all cases got 0 errors on Cypress XT(5870)

I also ran the tests at least once or twice until the end (50 iterations) on 580 with 1024 command line option and I can tell that it is allocating memory with nvidia-smi, it shows 1171MB allocated.  Although I should re-test 680 until the end, I never waited for that to run 50 iterations

If I were you, I would use the aticontrol utility and reduce the card&memory speed and re-test and see if the problem occurs or not. There might be borderline cases where your memory actually fail in some operations. Let us know if you get the same results. I tried this with Tahiti and well it still does millions of errors and that sort of makes me think that there is a bug in AMDs OpenCL.

0 Likes

Correction, I re-ran the tests with different sizes and it appears I sometimes get errors from 5870.
Default 128 gave no errors(but I did not run it so many times), 256 gave errors once in a while after few runs, 512 gave errors and also 1024 gave errors in few iterations...

I reduced memory speed from 1200 to 600 and then ran 1024 again. I still got errors...

On nvidia gtx 580, I ran tests few times up to 1024 and got no errors reported.

I will return back with results from 680, nvidia seems to be having their own problems, the 680 is stuck and machine needs reboot

Also the tests run roughly 8 to 10 times on gtx 580 compared to cypress, tahiti is only 3-4 times slower.

0 Likes

I made a test with 10000 iterations using 2.5gb memory on Nvidia Tesla M2050. I think AMD should fix this problem...

Test summary:

-----------------------------------------

10000 iterations over 2504 MiB of memory on device Tesla M2050

      Moving inversions (ones and zeros): 0 failed iterations

                                         (0 total incorrect bits)

                 Memtest86 walking 8-bit: 0 failed iterations

                                         (0 total incorrect bits)

              True walking zeros (8-bit): 0 failed iterations

                                         (0 total incorrect bits)

               True walking ones (8-bit): 0 failed iterations

                                         (0 total incorrect bits)

              Moving inversions (random): 0 failed iterations

                                         (0 total incorrect bits)

             True walking zeros (32-bit): 0 failed iterations

                                         (0 total incorrect bits)

              True walking ones (32-bit): 0 failed iterations

                                         (0 total incorrect bits)

                           Random blocks: 0 failed iterations

                                         (0 total incorrect bits)

                     Memtest86 Modulo-20: 0 failed iterations

                                         (0 total incorrect bits)

                           Integer logic: 0 failed iterations

                                         (0 total incorrect bits)

                 Integer logic (4 loops): 0 failed iterations

                                         (0 total incorrect bits)

            Integer logic (local memory): 0 failed iterations

                                         (0 total incorrect bits)

   Integer logic (4 loops, local memory): 0 failed iterations

                                         (0 total incorrect bits)

Final error count: 0 errors

0 Likes

So step 1 is fetching the source code.

https://simtk.org/project/xml/downloads.xml?group_id=385#package_id906 offers it but you need a simtk account. Does anyone have a simtk account and can they post it here in a way that needs no such external accounts?

0 Likes

You know it is free to get an account right? Here it is...

0 Likes

yurtesen wrote:

You know it is free to get an account right? Here it is...

Oh? I didn't. Doesn't make sense to restrict it to registred users then. Thanks anyway, I just looked at the source and it seems it's clear and readable, and there are no backdoor tricks that should behave differently between platforms.

I hope now someone from AMD can investigate this RandomBlocks test and see if they can reproduce the errors we experience:

* errors in 2-15 percent of iterations, a few hundred errors per iteration, on all tested pre-GCN devices, even brand new ones;

* errors in 100 percent of iterations, millions (or billions) of errors per iteration, on all tested HD7970s.

0 Likes

Ahh, I finally get why my post didn't get through the first 10+ attempts. I added "so the ba_ll's in AMD's camp now" (without _) at the end of the first paragraph, and that was the culprit. Well, talking about oversensitive moderation flagging...

0 Likes

I am not sure why they want you to register... makes no sense to me, also the code is LGPL licensed... Anyway...

I am getting exactly the same results as you do on GCN and pre-GCN devices.

0 Likes

And then it was silent... perhaps we should make a reduced package with just that 1 test, and post that as a new topic?

0 Likes

It is not our problem if AMD doesnt care if their products do not function. We tried to raise some voice at least...

I have several problems with AMDs OpenCL, in one case it crashes when building kernel even:

http://devgurus.amd.com/thread/158889

or the new tahiti card I have here crashes after several kernel enqueues.

http://devgurus.amd.com/thread/159073

or the Open64 compiler wouldnt function properly

http://devgurus.amd.com/thread/159192

and sure enough OpenCL does not work well on Linux, AMD should at least get the cards to function without X running on them! or we must be able to reset the card without needing to reboot the whole system.

In each case,  nobody seems to be interested (I even tried to report at least the opencl build crash throough amd's support system as well but got no feedback).

I am not surprised that nothing is working properly (well to be honest, I could compile mpich2 with open64, I was surprised that it managed to compile a whole program!) since they are not getting fixed.   As you can see, I am quite disappointed in this situation.

0 Likes

It is not our problem if AMD doesnt care if their products do not function. We tried to raise some voice at least...

I don't know your situation, but as an owner of and a programmer on AMD cards, it is my problem. Not that that solves the problem, of course.

I have several problems with AMDs OpenCL, in one case it crashes when building kernel even:

http://devgurus.amd.com/thread/158889

or the new tahiti card I have here crashes after several kernel enqueues.

http://devgurus.amd.com/thread/159073

or the Open64 compiler wouldnt function properly

http://devgurus.amd.com/thread/159192

None of your topics seems to contain enough information to reproduce the issue on their side. So that is different from this topic. Last time I reported a bug it was addressed within days, but apparently they're no longer following this topic.

0 Likes

I think the responses might depend on the type o the bug reported. Easy to fix, then they fix right away. A little bit difficult, then they probably silently ignore. Sure enough, I cant expect them to fix bugs the same speed for every bug. But I would have at least wanted to know that they are trying.

What do you mean? For the topic with segmentation faullt at kernel build, I have suggested giving the kernel code (and thats all they need since it crashes the kernelanalyzer also) but they didnt tell me how I can submit it without putting it to forum. It crashes at compilation stage!

About the openc64 compiler, I was simply compiling atlas library (open source) and output is there including which directory it crashes etc. and it clearly prints an internal error message. What else can I provide? (and if there is something else I can provide then AMD can feel free to ask!)

About the kernel enqueuing problem.  I also would happily provide the code if I could somehow submit it. I have a basic program which loads some data, detects devices and enqueues kernel. I couldnt get any simplier than that unfortunately. The program works on everything else properly, even on pre-tahiti cards (tested on 5870) and using intel and nvidia sdk's on several different machines. Yet, I dont know how I can provide it to AMD without uploading the whole thing to forum. (although it is few kb, there are data files which are loaded to memory).

and here is the memtest problem, I have provided the source code even It is confirmed by several people, why AMD is silent?

0 Likes

It is not our problem if AMD doesnt care if their products do not function. We tried to raise some voice at least...

The problem seems to be a bug in memtestCL.exe in a kernel that writes blocks of random values to memory.  The kernel has a main loop where each of 256 work items in a workgroup generates a random value and stores it in local memory block. Then each work item reads the same value from local memory and writes it to global memory. In the loop, all work items read the same random value from local memory location #255 to use as the seed for the next iteration.

The problem is a not so obvious missing local barrier needed after the line

      seed = randomBlock[blockDim -1];     // blockDim=256

At first it looks like the barrier after writing randomBlock[threadIdx] should work because there are no other local memory writes in the loop.

The problem occurs when all threads read the same value at randomBlock[255]. If thread #255 (its wave) reads the value first while others are still waiting, then it has a free execution path all the way to the next write of randomBlock[255], thus overwriting the value before slower threads/waves can read it.

In GCN, each wave executes on a single SIMD,  I'm guessing this makes for more flexible execution paths so the bug is most obvious on Tahiti because

................. yes ....... GCN is so powerful!

_

_kernel void deviceWriteRandomBlocks(__global uint* base,uint N,int seed,__local uint* randomBlock) {

    if (seed == 0) seed = 123459876+blockIdx;

    uint bitSeed = deviceRan0p(seed + threadIdx,threadIdx);

    for (uint i=0; i < N; i++) {

        // Generate block of random numbers in parallel       

        randomBlock[threadIdx]= deviceRan0p(seed,threadIdx) |

                (deviceIrbit2(&bitSeed) << 31);

        barrier(CLK_LOCAL_MEM_FENCE);

        // Set the seed for the next round to the last number

        // calculated in this round

        seed = randomBlock[blockDim-1];

//=============================================

        barrier(CLK_LOCAL_MEM_FENCE);                  //! ADD EXTRA LOCAL BARRIER HERE

//=============================================

        // Blit shmem block out to global memory

        *(THREAD_ADDRESS(base,N,i)) = randomBlock[threadIdx];

    }

After making this change the program runs fine on Tahiti. I use mingw in Windows which is not a supported environment for building memtestCL but if anyone wants the binary, just let me know.

memtestCL is copywrite through Stanford U. where a lot of the early GPU development was done. It might be interesting if they have any thoughts or comments on GCN. There is some feedback for memtestCL through SimTK.org, where is comes from.

drallan

drallan, thank you very much for your effort.

On another note, I have a simple kernel which causes tahiti to crash (after several enqueues) if you are interested to have a look, let me know

0 Likes

On another note, I have a simple kernel which causes tahiti to crash (after several enqueues) if you are interested to have a look, let me know

Hi yurtesen,

If you put it in a new thread then everyone can have a look. If you can, zip up your source and attach it.

By the way, what version driver are you using?

drallan

0 Likes

I am using Catalyst 12.4 and APP SDK 2.7 right now. But I had this problem with Catalyst 12.3 and SDK 2.6 also. Strange thing is that program works fine with Cypress and nearly everything else which can run OpenCL.

There 2 problems,

1- The input files total little over 100mb in size

2- I would rather send a download link to whoever wants to look a the code and not put it publicly available (although there doesnt seem to be a private message function in this forum). Any suggestions on how to accomplish this?

0 Likes

yurtesen,

There 2 problems,

1- The input files total little over 100mb in size

2- I would rather send a download link to whoever wants to look a the code and not put it publicly available (although there doesnt seem to be a private message function in this forum). Any suggestions on how to accomplish this?

Have you tried making a copy of everything and start reducing your copy until the error disappears? If you can make a package which reproduces the error with minimal other things around it, you have a triple win:

1) The file size will be much smaller (probably rather 100kb than 100MB).

2) You will not give away unnecessary details about what you are doing.

3) Due to it's much smaller size, it becomes more likely that AMD will attempt to fix it.

0 Likes

drallan please take at look at this thread http://devgurus.amd.com/message/1280958#1280958

At first glance, compiler emit incorrect code.

0 Likes

Hello Drallan,,

After a long pause, I managed to put up a thread for everybody to see as you requested.

If you accept the challenge, the program sources are inclluded

http://devgurus.amd.com/thread/159588

Good luck, and thanks,

Evren

0 Likes

Drallan, you are awesome.

Have you reported this to the memtestCL authors?

0 Likes

I reported it and drallan is indeed awesome this is the reply I got from the author.

Thanks for bringing this to my attention; I hadn't seen the thread before. This actually answers a longstanding question with ATI performance on this test, which my

contact at ATI had previously ascribed to a different source (see slide 18 in http://cs.stanford.edu/people/ihaque/talks/gpuser_lacss_oct_2010.pdf).

I'll get on fixing this.

I started a long test with the largest amount of memory I could allocate on both Cypress and Tahiti. So far the results are not bad

Test iteration 1473 on 1032 MiB of memory on device 0 (Cypress): 0 errors so far

        Moving Inversions (ones and zeros): 0 errors (129 ms)

        Moving Inversions (random): 0 errors (129 ms)

        Memtest86 Walking 8-bit: 0 errors (1031 ms)

        True Walking zeros (8-bit): 0 errors (516 ms)

        True Walking ones (8-bit): 0 errors (513 ms)

        Memtest86 Walking zeros (32-bit): 0 errors (2040 ms)

        Memtest86 Walking ones (32-bit): 0 errors (2042 ms)

        Random blocks: 0 errors (452 ms)

        Memtest86 Modulo-20: 0 errors (5090 ms)

        Logic (one iteration): 0 errors (72 ms)

        Logic (4 iterations): 0 errors (129 ms)

        Logic (local memory, one iteration): 0 errors (122 ms)

        Logic (local memory, 4 iterations): 0 errors (318 ms)

Test iteration 1877 on 2780 MiB of memory on device 1 (Tahiti): 0 errors so far

        Moving Inversions (ones and zeros): 0 errors (107 ms)

        Moving Inversions (random): 0 errors (110 ms)

        Memtest86 Walking 8-bit: 0 errors (874 ms)

        True Walking zeros (8-bit): 0 errors (438 ms)

        True Walking ones (8-bit): 0 errors (438 ms)

        Memtest86 Walking zeros (32-bit): 0 errors (1742 ms)

        Memtest86 Walking ones (32-bit): 0 errors (1729 ms)

        Random blocks: 0 errors (257 ms)

        Memtest86 Modulo-20: 0 errors (4023 ms)

        Logic (one iteration): 0 errors (55 ms)

        Logic (4 iterations): 0 errors (55 ms)

        Logic (local memory, one iteration): 0 errors (56 ms)

        Logic (local memory, 4 iterations): 0 errors (78 ms)

0 Likes

Yurtesen,

Thanks for reporting this to the memtestCL authors, I was not sure where to respond.

Their answer was quite interesting. The bug in principle could affect other architectures because it results in a racing condition, which is then hardware dependent.

drallan

0 Likes

Well, I targeted 10k iterations and left the memtestcl on, but I think I will have to cancel it, because I need the cards. However the results were great.

Test iteration 3997 on 1032 MiB of memory on device 0 (Cypress): 0 errors so far

        Moving Inversions (ones and zeros): 0 errors (128 ms)

        Moving Inversions (random): 0 errors (129 ms)

        Memtest86 Walking 8-bit: 0 errors (1030 ms)

        True Walking zeros (8-bit): 0 errors (515 ms)

        True Walking ones (8-bit): 0 errors (514 ms)

        Memtest86 Walking zeros (32-bit): 0 errors (2044 ms)

        Memtest86 Walking ones (32-bit): 0 errors (2045 ms)

        Random blocks: 0 errors (454 ms)

        Memtest86 Modulo-20: 0 errors (5084 ms)

        Logic (one iteration): 0 errors (73 ms)

        Logic (4 iterations): 0 errors (128 ms)

        Logic (local memory, one iteration): 0 errors (120 ms)

        Logic (local memory, 4 iterations): 0 errors (319 ms)

This tahiti is actually an overclocked MSI card. (although the memory is not overclocked?) and it still works great

http://www.msi.com/product/vga/R7970-2PMD3GD5-OC.html

Test iteration 5064 on 2780 MiB of memory on device 1 (Tahiti): 0 errors so far

        Moving Inversions (ones and zeros): 0 errors (107 ms)

        Moving Inversions (random): 0 errors (107 ms)

        Memtest86 Walking 8-bit: 0 errors (868 ms)

        True Walking zeros (8-bit): 0 errors (433 ms)

        True Walking ones (8-bit): 0 errors (435 ms)

        Memtest86 Walking zeros (32-bit): 0 errors (1735 ms)

        Memtest86 Walking ones (32-bit): 0 errors (1743 ms)

        Random blocks: 0 errors (257 ms)

        Memtest86 Modulo-20: 0 errors (4052 ms)

        Logic (one iteration): 0 errors (55 ms)

        Logic (4 iterations): 0 errors (55 ms)

        Logic (local memory, one iteration): 0 errors (56 ms)

        Logic (local memory, 4 iterations): 0 errors (78 ms)

0 Likes
ihaque
Journeyman III

Hi,

I'm the developer of MemtestCL. With the permission of Dr. Pande (my old PI), I've forked the development of MemtestCL from SimTK and put it up on GitHub:

https://github.com/ihaque/memtestCL

The version of the code there should have the suggested fix. Please have a look and let me know if it solves your problems. Pull requests accepted too!

Cheers,

Imran

0 Likes

It crashes on Tahiti if the following is defined in kernels file (it is defined currently):

#define MODX_WITHOUT_MOD

Can you use $(AMDAPPSDKROOT) in your makefile? AMD APP SDK is not always in /opt/AMDAPP I had to change it several times

0 Likes

@yurtesen: I found the bug with MODX_WITHOUT_MOD (I forgot to reset an offset pointer before an iteration loop). Fixed in cac2beea. The current repo HEAD has this and a number of other fixes; tests clean on my Radeon 6870 and on my CPU as well. Please take a look.

@drallan: If you'd like to submit your mingw make setup in a pull request, I'd be happy to add it to the repo.

0 Likes

@drallan: If you'd like to submit your mingw make setup in a pull request, I'd be happy to add it to the repo.

Hi ihaque,

Currently my GPU system is not on-line so installing github is not so simple however, adding the makefile is.

If you could add it for me, I'll be happy to maintain it  through github once the system is on line.

The changes are in the attached zip file, which are:

1. add makefile.mingu to the Makefiles directory.

2. Two minor preprocessor changes at the start of  memtestCL_cli.cpp.


The changes are in memtestCL_cli_rab.cpp (zip file) marked with "add", "remove" comments,

just search for my initials RAB.

I also included AMD's environmental variable AMDAPPSDKROOT in the makefile.

BTW, your program is very useful.

drallan

0 Likes

Hi  drallan,

Thank you for contributing your mingw makefiles.

Also may thanks to ihaque for making the code availabe on github.

I have a question regarding the barrier, in my case I needed to add the barrier not only to deviceWriteRandomBlocks but also to the deviceVerifyRandomBlocks before the errors went away. Previosly adding it only to deviceWriteRandomBlocks (as it is now on the github) reduced the errors but did not eliminated all of them (the Random blocks tests was still failing sometimes especially for testing small sizes like the default 128 MB on my 7970).

As the race condition in deviceVerifyRandomBlocks is very similar to the one in deviceWriteRandomBlocks , I'm surprised I'm the only one to have stumbed upon it, or maybe when you (drallan) fixed your code you did it in both places while the code on the github has just half of the fix?

tkg

0 Likes

Hi tkg,

Yes, I think you are right. There should be an extra barrier in deviceVerifyRandomBlocks() just like the one in deviceWriteRandomBlocks().The racing condition occurs becasue the seed for loop N+1 uses the data from loop N.

loop

  randomdata[thread] = function(seed);

   barrier

   seed=randomdata[last_thread of workgroup]

   ....(new barrier)

endloop

Thus there is a small but real possibility that one wave, (the last wave in fact) can loop and set randomdata[N+1] before some other wave has read randomdata; The second barrier fixes this. Although both tests use this same logic, the problem may not occur in deviceVerifyRandomBlocks() because it uses a slower local memory array to calculate its data.

loop

   read/write_localmemoryarray();    *slow*

   randomdata[thread] = function(seed);

   barrier

   seed=randomdata[last_thread of workgroup]

   ....(new barrier)

endloop

I never looked at this because I was so happy the problem disappeared after fixing deviceWriteRandomBlocks(),

and I never saw it fail after that so its good you did!.

However, there is an easier fix for the racing condition that also proves the point. The racing condition can only occur when a workgroup has multiple waves so calling the kernels with a workgroup size of 64 (one wave) will also prevent the problem without using the extra barrier, which I just verified. It also does not slow down the test that I can see.

Maybe Ihaque can pop in and suggest the best solution, or I'll contact him.

drallan

Hi drallan

You are quite right,

Setting the workgroup size to 64 does seem to fix the problem for me as well (without the need for the barrier) in this particular case, but to be sure that other architectures will not have this problem it is better to have the barrier added there. I did some testing and even for the workgroups of 256 items, the barrier did not seem to slow down the execution by a significant amount when compared to workgroups of 64 items (with or without the barrier). This was a bit surprising (I assume this is due to the slow external memory access the waves are actually paused and intermixed and they finish quite fast one after the other), but also given the fact that this random block test is one of the fastest compared to the other ones, the overal performance impact of the barrier should be minimal. Obviously on architectures  other than GCN there might be some penalty but at least it should stop giving false errors for this test on current and future hardware architectures that might use other wave sizes.

Also from my testing I found out that executing the RandomBlocks test alone (50 iterations of 128MB or 2500MB), 256 items in the workgroup and with no barrier in verify, did not give errors. However running it together with at least the "True walking ones" test (which is significantly slower than random blocks) did gave some errors in some loops and that seemd also bit  strange.

All the best,

tkg

0 Likes

Hi,

No one contacted me until today (or maybe it just got filtered!). In any case, reports filtered up through the Folding@home forum and I've just pushed a fix for deviceVerifyRandomBlocks to the github repo. Thanks again for the bug report and easy fix

ihaque

0 Likes

I tried to send you email several times. There were some other bugs in your github but I think I made a bug report for them hmm cant remember anymore I hope you made/would make windows binaries which have the fix also

0 Likes

Hi ihaque,

Sorry, I even said above that I would contact you and then I got distracted before I could. .

Yes, should be easy to fix. If there is anything I can help with wrt the windows part (now or later)

please let me know. ( ... perhaps better than I let you know )

Cheers,

drallan

.

0 Likes