I'm trying to implement the scan primitive, but I came across a weird behavior. Whenever the dimension of the global array 'x' is greater than 64 (the wavefront size), let's say 128, the following kernel hangs up:
kernel void scan_dead_lock( global float * x )
{
int id = get_local_id(0);
for( int s = 1; s < get_local_size(0); s <<= 1 )
{
if( id >= s ) {
float tmp = x[id] + x[id - s];
barrier( CLK_GLOBAL_MEM_FENCE );
x[id] = tmp;
}
else barrier( CLK_GLOBAL_MEM_FENCE );
barrier( CLK_GLOBAL_MEM_FENCE );
}
}
But this one, functionally equivalent, does not:
kernel void scan( global float * x )
{
float tmp;
int id = get_local_id(0);
for( int s = 1; s < get_local_size(0); s <<= 1 )
{
if( id >= s ) tmp = x[id] + x[id - s];
barrier( CLK_GLOBAL_MEM_FENCE );
if( id >= s ) x[id] = tmp;
barrier( CLK_GLOBAL_MEM_FENCE );
}
}
I'm using 'global size' = 'local size', so there is only a work-group in flight. Is the OpenCL compiler doing something wrong, maybe removing the "else barrier( CLK_GLOBAL_MEM_FENCE );" part from the first kernel?
The dead lock occurs both with Catalyst (fglrx) 12.4 and 12.6 using an AMD HD5750 GPU on Debian GNU/Linux.
Thank you.
Solved! Go to Solution.
yes. this can lead to deadlock
if(get_local_id(0)>32)barrier();
else barrier();
Hi,
GLOBAL barrier means "each threads stops here and wait until all threads come", if you put the barrier in the selection, some threads will never get the same place so it waits forever.
barrier must be hitted with all or none thread in work-group. otherwise it lead to undefined behaviour.
Thank you for your replies.
I knew that a barrier must be hit by all threads in the work-group, but I totally forgot that each barrier is uniquely identified and so all threads must hit the *same* barrier. Right?
Thanks.
yes. this can lead to deadlock
if(get_local_id(0)>32)barrier();
else barrier();