cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

marcuse
Journeyman III

Problem with a kernel in SDK 2.6 that worked fine in SDK 2.5

Hi, the kernel does something similar to this:

__local float sdata[LOCAL_SIZE]; //LOCAL_SIZE is workgroup size

sdata[local_id] = 0;

for (wavefront_id= .......){

....

sdata[local_id] =  1234;

if (local_id < 4) result= sdata[(local_id + 4)];

}

In an APU Fusion Series A4, with SDK 2.5, "result" is 1234. (Catalyst 11.11)

In an ATI V7800 with SDK 2.6, "result" is 1234. (Catalyst 11.12)

But in the same APU Fusion Series A4, with SDK 2.6 (Catalyst 12.3), "result" is 0 for every thread.

I know there is a potential race condition there, but the "if" sentence seems to act like an implicit barrier in the first two cases, but not in the latter.

I cannot use an explicit barrier like barrier(CLK_LOCAL_MEM_FENCE), because the last iteration of the for-loop is not executed by all the threads and the gpu crashes.

I solved it using modulo operation, this way:

if (local_id < 4) result= sdata[(local_id + 4)%LOCAL_SIZE];

But ...

1. Performance decreases by 5 - 10%

2. I don't understand why this solution works, and if it will be reliable in all scenarios

3. I don't know why my original kernel works in some GPUs and SDKs and doesn't work in other ones

Any ideas??

Thanks.

0 Likes
5 Replies

sdata[local_id] =  1234;

if (local_id < 4) result= sdata[(local_id + 4)];

This is a race condition, you need a barrier between a write and read.

Also, if you write outside of a buffer, the results are undefined.

0 Likes
pwvdendr
Adept II

marcuse wrote:


I cannot use an explicit barrier like barrier(CLK_LOCAL_MEM_FENCE), because the last iteration of the for-loop is not executed by all the threads and the gpu crashes.

Of course you can, there are plenty of workarounds for this.

For example:

for (all your iterations) {

  if (not last iteration || thread should execute last iteration) {

    //execute();

  }

  barrier(CLK_LOCAL_MEM_FENCE)

}

Indeed, under some circumstances your unsafe code may execute as you expect it, and this may depend on completely irrelevant details (such as the hardware or the compiler version) whether it does or does not. In any case, it is not safe to use this. There is no implicit barrier of any kind in an if-statement, or for that matter in any non-barrier statement.

Thanks for the replies, but I am aware of the race condition problem. NVidia itself proposes this method for the sum reduction in one of the kernels of the SpMV product (it's CUDA code):

for(IndexType jj = row_start + thread_lane; jj < row_end; jj += WARP_SIZE)

     sdata[threadIdx.x] += Ax[jj] * fetch_x<UseCache>(Aj[jj], x);

sdata[threadIdx.x] += sdata[threadIdx.x + 16];

sdata[threadIdx.x] += sdata[threadIdx.x + 8];

sdata[threadIdx.x] += sdata[threadIdx.x + 4];

sdata[threadIdx.x] += sdata[threadIdx.x + 2];

sdata[threadIdx.x] += sdata[threadIdx.x + 1];

I guess there is no race condition because all the threads that execute the reduction belong to the same warp, so they execute "concurrently". But the same should be true in my code (what I've posted is a simplification), with all the threads belonging to the same wavefront. And it worked fine until I moved to SDK 2.6 and new Catalyst drivers. Perhaps wavefronts don't behave exactly like CUDA warps.

I tried your suggestions, pwvdendr, and it worked, but I need quite a lot of comparisons, and performance degrades badly. Another thing that worked was substituting the

if (local_id < 4) result= sdata[(local_id + 4)];

sentences for a call to a function that executes those sentences (is that reliable?), but performance degrades the same.

By now the best solution I've found is that of the modulo operation, but I don't think it's reliable and I don't even understand why it works. I'll keep trying

Thanks !!

0 Likes

Finally the problem was not a race condition (it could no be, as all the work items executing  that sentences belong to the same wavefront), but the lack of a "volatile" declaration for the local array "sdata". The change of sdata values was not visible to other work-items until the end of the kernel.

But a "volatile" declaration degraded performance a lot. Instead, with only a "memfence" (not a barrier) the problem gets solved correctly with little impact on performance. Anyway, the modulo operation still wins in terms of performance (but it's less "elegant", and perhaps less robust, but I am not sure about the last).

Thanks.

0 Likes
marcuse
Journeyman III

Thanks for replying, but as I understand, in that particular case a barrier is not needed, because all the work-items that could produce the race condition belong to the same wavefront (perhaps this is not evident from the extract of code I've posted), so they execute concurrently. Even more, that's the way Nvidia programmed that code as I said before. And I am not writing outside of a buffer, either.

Anyway, the problem is solved and it was the one I described in my previous post: the lack of coherence between writes and reads in local (or global) memory unless you force this coherence with a "memfence", or a "volatile" declaration in the definition of the local array.

0 Likes