cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

wayne_static
Adept II

Different results with HD 7970 and HD 7750

Hello,

I have a kernel that I have written to perform some dynamic programming routine particularly targeting the GCN architecture. Recently, I tried to optimize the kernel by getting rid of If-Else constructs and replacing them with select instead. However, the same kernel works fine for my HD 7970 GPUs and with some improvement in speed but the strange thing is that the same kernel does not work correctly on the HD 7750 GPUs.

By not working I mean - the output of the kernel is a a huge table of values. I verify against a sequential implementation on CPU after a kernel execution and the HD 7970 results are always correct but the results from the HD 7750 are somewhere between 60% to 90% correct. For example, 4,193,984 out of 4,194,304 passes verification.

Again ONLY thing I did was replace if-else with select in the kernel. Could anyone please shed some light on this strange behavior? Many thanks and I can provide kernel codes if necessary. Thanks.

0 Likes
1 Solution

Hi Wayne,

A cursory glance at your code revealed some race conditions in your kernel.

A very similar scenario was reported in NVIDIA forums some 5 years back - where everyone thought it was a hardware bug.

But it turned out to be a race condition.

here is what I found (there could be others hiding -- request you to prune your code)

1. dps1_kernel - A "barrier" in the middle of FOR loop will cause race conditions between UPPER and LOWER half.

                          This is a very subtle race that can dodge even the trained eyes.

                           You need to have another barrier towards end of FOR loop

2. dps1_kernel -- A "barrier" cannot be used in the middle of FOR loop that reads for(x=tid; x<constantN; x += localSize)

                            Technically, some threads cannot enter the Loop and "barrier" will never be reached...

                             Unless -- you know for sure that "localSize" divides "constantN" perfectly.

                             In such cases, you need to write something like this:

                             for(x =0; x<N; x+=localSize)

                             {

                                       if ((x + localId) < N)

                                       { DO WORK }

                                       barrier();

                                       if ((x +localID) < N)

                                       { DO SOME MORE WORK }

                                       barrier(); // This is important!

                             }

i have not checked other kernels. I hope you will be able to refactor your code with this input.

If the bug remains, please post here.

- Bruhaspati

View solution in original post

0 Likes
18 Replies