rick.weber

barriers in kernels

Discussion created by rick.weber on Feb 4, 2011
Latest reply on Feb 4, 2011 by nou

Hello,

I'm trying to implement radix sort in 1 dimension on a 2 dimensional array. So, global_size(0) = 64, as does local_size(0), and global/local_size(1) = numRows. The second work dimension corresponds to the row that work group should sort. Unfortunately, the data in my rows aren't guaranteed to be a multiple of 64, which causes heartache with respect to loops and barriers.

Basically, I have this for loop with barriers (other code removed):

for(unsigned int j = get_local_id(0); j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);

where len = 100. The problem I'm having is that thread divergence causes some threads to execute the barrier and some to not. I'm not sure if this is a problem on the GPU since threads execute in lockstep within a block, but on the CPU, it definitely is. I know the OpenCL spec says all work items in a group must execute the barrier, but can I do the following? Each thread should execute the same number of barriers before moving past the if statement.

unsigned int j = get_local_id(0);
for(; j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);
}
if(j >= len)
{
barrier(CLK_LOCAL mem);
}

Outcomes