Archives Discussions

rick_weber · ‎02-04-2011

Hello,

I'm trying to implement radix sort in 1 dimension on a 2 dimensional array. So, global_size(0) = 64, as does local_size(0), and global/local_size(1) = numRows. The second work dimension corresponds to the row that work group should sort. Unfortunately, the data in my rows aren't guaranteed to be a multiple of 64, which causes heartache with respect to loops and barriers.

Basically, I have this for loop with barriers (other code removed):

for(unsigned int j = get_local_id(0); j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);
}

where len = 100. The problem I'm having is that thread divergence causes some threads to execute the barrier and some to not. I'm not sure if this is a problem on the GPU since threads execute in lockstep within a block, but on the CPU, it definitely is. I know the OpenCL spec says all work items in a group must execute the barrier, but can I do the following? Each thread should execute the same number of barriers before moving past the if statement.

unsigned int j = get_local_id(0);
for(; j < len; j += get_local_size(0))
{
barrier(CLK_LOCAL mem);
}
if(j >= len)
{
barrier(CLK_LOCAL mem);
}

Meteorhead · ‎02-04-2011

As far as I know it should work, although what you wrote code does not do what you explained in text.

If it doesn't, you can definately rearrange your code so that all work-items hit the same barrier.

while( i < longestRun )
{
    if( I_am_thread_to_do_work )
    { ...
    }

i += whatever;
barrier();
}

Hope that helps.

rick_weber · ‎02-04-2011

Yeah, that's pretty much what I ended up doing. Unfortunately, my code has tons of barriers in the main loop, because I have to compute a scan in shared memory each iteration and update some indices. I can get rid of a bunch of them if I use a private register instead of a local memory variable. But this has the unfortunate side effect of hurting memory bandwidth when finding the pivot point. I guess I'll have to experiment to see which is the lesser evil.

nou · ‎02-04-2011

if you use local memory as private array then you don't need a barrier.

Archives Discussions

barriers in kernels