Hi,
i am trying to write some kernels, which use local memory and parallel reduction. With this I have some problems.
My kernel tries to evaluate the min value and the max value of an array.
Kernel:
http://pastebin.com/gRbDcgyU
This kernel doesn't work. Whenever it runs, it crashes and CL_INVALID_COMMAND_QUEUE is returned by clfinish().
What I find is very interesting is, if I replace this:
for(int offset = local_size/2;offset>0;offset/=2)
{
barrier(CLK_LOCAL_MEM_FENCE);
if(gid_local<offset)
{
Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
}
}
barrier(CLK_GLOBAL_MEM_FENCE);
By this:
for(int offset = local_size/2;gid_local<offset;offset/=2)
{
barrier(CLK_LOCAL_MEM_FENCE);
Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
}
barrier(CLK_GLOBAL_MEM_FENCE);
The kernel is working and even returns the correct values, although many threads in the same work-group don't reach the barrier-statement in my loop, which shouldn't be possible.
My idea of parallel reduction is pretty simple. I take the half of my work-group size and use that as an offset for an work-item to access the right elements.
This offset is halfed by each iteration until there is only one work-item alive. Both for loops are doing the same(in my opinion). If I use the first one, the kernel crashes. Sadly I don't know the reason for this. The second loop should not work...but it works.
Many thanks in advance