cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

wayne_static
Adept II

Different results with HD 7970 and HD 7750

Hello,

I have a kernel that I have written to perform some dynamic programming routine particularly targeting the GCN architecture. Recently, I tried to optimize the kernel by getting rid of If-Else constructs and replacing them with select instead. However, the same kernel works fine for my HD 7970 GPUs and with some improvement in speed but the strange thing is that the same kernel does not work correctly on the HD 7750 GPUs.

By not working I mean - the output of the kernel is a a huge table of values. I verify against a sequential implementation on CPU after a kernel execution and the HD 7970 results are always correct but the results from the HD 7750 are somewhere between 60% to 90% correct. For example, 4,193,984 out of 4,194,304 passes verification.

Again ONLY thing I did was replace if-else with select in the kernel. Could anyone please shed some light on this strange behavior? Many thanks and I can provide kernel codes if necessary. Thanks.

0 Likes
1 Solution

Hi Wayne,

A cursory glance at your code revealed some race conditions in your kernel.

A very similar scenario was reported in NVIDIA forums some 5 years back - where everyone thought it was a hardware bug.

But it turned out to be a race condition.

here is what I found (there could be others hiding -- request you to prune your code)

1. dps1_kernel - A "barrier" in the middle of FOR loop will cause race conditions between UPPER and LOWER half.

                          This is a very subtle race that can dodge even the trained eyes.

                           You need to have another barrier towards end of FOR loop

2. dps1_kernel -- A "barrier" cannot be used in the middle of FOR loop that reads for(x=tid; x<constantN; x += localSize)

                            Technically, some threads cannot enter the Loop and "barrier" will never be reached...

                             Unless -- you know for sure that "localSize" divides "constantN" perfectly.

                             In such cases, you need to write something like this:

                             for(x =0; x<N; x+=localSize)

                             {

                                       if ((x + localId) < N)

                                       { DO WORK }

                                       barrier();

                                       if ((x +localID) < N)

                                       { DO SOME MORE WORK }

                                       barrier(); // This is important!

                             }

i have not checked other kernels. I hope you will be able to refactor your code with this input.

If the bug remains, please post here.

- Bruhaspati

View solution in original post

0 Likes
18 Replies
nou
Exemplar

it may be bug in driver or faulty hardware. best thing is if you can provide test case.

0 Likes

Hi nou thanks for the reply. I am not ruling out your response but may I also mention that this behavior also exist on the nVidia hardware as well, GeForce 650 and 680 GTX to be precise. I don't know what this means with respect to drivers. Please could you elaborate on what you mean by test case in this situation? Thanks

0 Likes

As the code is failing on nvidia as well as 7750, i would guess it is accidentally passing on 7970. 7750 & 7970 are both GCN, it is hard to imagine them giving different results. I would guess you have different drivers installed on 7750 & 7970 machines. Are they running same OS, and do they latest APP SDK? Latest catalyst driver is recommended (13.8 beta as of today) Before sharing your kernel, i would suggest you to check verification logic and all the places where you used select instead of if-else. You might be having some silly bug somewhere 😉 If nothing rings a bell, feel free to share your kernels here. It is recommended to attach a testcase that can be downloaded by anyone and compiled with little hassles. Use advanced editor for attaching.

0 Likes

Thanks for the reply. I agree with you and it is hard to imagine such behavior. At the moment, all machines are running identical drivers, i.e, Catalyst version 13.4 and AMD APP version 1124.2 which comes with the latest SDK version 2.8.1. All machines are also running same copies of Windows 7 Enterprise 64-bit. Maybe I should also mention that the machine with the GeForce 680 GTX also has same version of OS and does not use the AMD APP SDK.

I usually work with a single project using Visual Studio 2012 and then copy the project to which ever machine I want to run tests on. All results are integers so there are no floating-point headaches. Input data is randomized and output data is a table and so the verification process is simply a matter of looping through the GPU values and comparing with the sequential CPU results. All of these happen in one execution of the code.

Do you suggest I update to the catalyst driver 13.8 beta and try again before providing a test case? Thanks.

0 Likes

yes try latest drivers at there is chance that it was already fixed.

0 Likes

I have updated the machine with the HD 7750 GPU to catalyst version 13.8 beta2 but it still fails verification. This machine is also equipped with an A10-5800K APU and it also fails on the HD 7660D GPU attached to it.

0 Likes

Please provide us the testcase

0 Likes

Hi Not able to access the above link due to internal security reasons. Please give us the direct link or attach the file/project directly in this. Dont post any 3rd party urls.

0 Likes

Apologies I attached a wrong project. Please find attached the original test case I attempted to link to. Many thanks.

0 Likes

Hi Wayne,

A cursory glance at your code revealed some race conditions in your kernel.

A very similar scenario was reported in NVIDIA forums some 5 years back - where everyone thought it was a hardware bug.

But it turned out to be a race condition.

here is what I found (there could be others hiding -- request you to prune your code)

1. dps1_kernel - A "barrier" in the middle of FOR loop will cause race conditions between UPPER and LOWER half.

                          This is a very subtle race that can dodge even the trained eyes.

                           You need to have another barrier towards end of FOR loop

2. dps1_kernel -- A "barrier" cannot be used in the middle of FOR loop that reads for(x=tid; x<constantN; x += localSize)

                            Technically, some threads cannot enter the Loop and "barrier" will never be reached...

                             Unless -- you know for sure that "localSize" divides "constantN" perfectly.

                             In such cases, you need to write something like this:

                             for(x =0; x<N; x+=localSize)

                             {

                                       if ((x + localId) < N)

                                       { DO WORK }

                                       barrier();

                                       if ((x +localID) < N)

                                       { DO SOME MORE WORK }

                                       barrier(); // This is important!

                             }

i have not checked other kernels. I hope you will be able to refactor your code with this input.

If the bug remains, please post here.

- Bruhaspati

0 Likes

Thanks for the feedback. I will work on these right away and get back to you ASAP.

0 Likes

Wow! Thanks very much, I am glad and very impressed.

I started with your first point and added another barrier at the end of the for-loop and ran the code a couple of times with different input sizes. Things are now looking the way they should and working correctly on both GPUs.

Regarding your second point, for this implementation, localSize must always divide constantN perfectly (formulation depends on this too) so I guess it's not much of an issue now. However, for future references, I will definitely keep this in mind.

Once again thanks very much for your help.

0 Likes

You are welcome! Glad it worked!

Thanks for marking it as "Answered"... It helps 🙂

0 Likes
aldep
Journeyman III

Beautiful benchmarks, but what has to do with the original question?

I got the same problem (between 7750 & 7950), any news about the subject, AMD?

0 Likes

I don't know whether to feel relief that someone else has encountered a similar situation. However, it would be really helpful if anyone from AMD could give us some update on the situation regarding this issue. Thanks.

0 Likes

Hey Wayne,

Sorry about the delay from our side.... We do track all threads and yours is still in Unresolved state.

So, this will gain our attention anyway....

I just downloaded your package. I will let you know whether I can reproduce here.

I still need to find if I can get 7970 and 7750...If I find, I will experiment and check out...

Thanks for your time,

The experiments will take some time.. Please bear with us,

Thanks,

- Bruhaspati

0 Likes

Thanks for the reply. That is no problem I will await your response.

0 Likes

Beautiful benchmarks -- is nothing but a camouflaged spam...

If you look towards the bottom -- there is a link toward laptop prices etc....

Spammers have become very intelligent today...They can beat all these text-mining algorithms.

Nowdays we got to be very vigilant on these type of messages...

Sigh..

0 Likes