Hello,
I'm trying to implement radix sort in 1 dimension on a 2 dimensional array. So, global_size(0) = 64, as does local_size(0), and global/local_size(1) = numRows. The second work dimension corresponds to the row that work group should sort. Unfortunately, the data in my rows aren't guaranteed to be a multiple of 64, which causes heartache with respect to loops and barriers.
Basically, I have this for loop with barriers (other code removed):
for(unsigned int j = get_local_id(0); j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);
}
where len = 100. The problem I'm having is that thread divergence causes some threads to execute the barrier and some to not. I'm not sure if this is a problem on the GPU since threads execute in lockstep within a block, but on the CPU, it definitely is. I know the OpenCL spec says all work items in a group must execute the barrier, but can I do the following? Each thread should execute the same number of barriers before moving past the if statement.
unsigned int j = get_local_id(0);
for(; j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);
}
if(j >= len)
{
barrier(CLK_LOCAL mem);
}