AnsweredAssumed Answered

Problem with a kernel in SDK 2.6 that worked fine in SDK 2.5

Question asked by marcuse on Apr 13, 2012
Latest reply on Jun 19, 2012 by marcuse

Hi, the kernel does something similar to this:


__local float sdata[LOCAL_SIZE]; //LOCAL_SIZE is workgroup size

sdata[local_id] = 0;


for (wavefront_id= .......){


sdata[local_id] =  1234;

if (local_id < 4) result= sdata[(local_id + 4)];



In an APU Fusion Series A4, with SDK 2.5, "result" is 1234. (Catalyst 11.11)

In an ATI V7800 with SDK 2.6, "result" is 1234. (Catalyst 11.12)


But in the same APU Fusion Series A4, with SDK 2.6 (Catalyst 12.3), "result" is 0 for every thread.


I know there is a potential race condition there, but the "if" sentence seems to act like an implicit barrier in the first two cases, but not in the latter.


I cannot use an explicit barrier like barrier(CLK_LOCAL_MEM_FENCE), because the last iteration of the for-loop is not executed by all the threads and the gpu crashes.


I solved it using modulo operation, this way:


if (local_id < 4) result= sdata[(local_id + 4)%LOCAL_SIZE];


But ...


1. Performance decreases by 5 - 10%

2. I don't understand why this solution works, and if it will be reliable in all scenarios

3. I don't know why my original kernel works in some GPUs and SDKs and doesn't work in other ones


Any ideas??