Hi Wayne,
A cursory glance at your code revealed some race conditions in your kernel.
A very similar scenario was reported in NVIDIA forums some 5 years back - where everyone thought it was a hardware bug.
But it turned out to be a race condition.
here is what I found (there could be others hiding -- request you to prune your code)
1. dps1_kernel - A "barrier" in the middle of FOR loop will cause race conditions between UPPER and LOWER half.
This is a very subtle race that can dodge even the trained eyes.
You need to have another barrier towards end of FOR loop
2. dps1_kernel -- A "barrier" cannot be used in the middle of FOR loop that reads for(x=tid; x<constantN; x += localSize)
Technically, some threads cannot enter the Loop and "barrier" will never be reached...
Unless -- you know for sure that "localSize" divides "constantN" perfectly.
In such cases, you need to write something like this:
for(x =0; x<N; x+=localSize)
{
if ((x + localId) < N)
{ DO WORK }
barrier();
if ((x +localID) < N)
{ DO SOME MORE WORK }
barrier(); // This is important!
}
i have not checked other kernels. I hope you will be able to refactor your code with this input.
If the bug remains, please post here.
- Bruhaspati