My understanding is that the following reduction does not require memory fences as long as the workgroup size = one wavefront (64 on 7950).
However the following gives incorrect results unless I add a mem_fence(local) after each write. I am using __attribute__((reqd_work_group_size(64,1,1))) in my kernel.
I have successfully used this code before. Either something has changed, or I have found a bug in my own code.
// Reduction min
dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+32] );
dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+16] );
dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+8 ] );
dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+4 ] );
dt_min_local[loc_mempos] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+2 ] );
// Write to global memory
if (loc_mempos == 0)
{
d_min_max_workgroup[get_group_id(0)] = fmin( dt_min_local[loc_mempos] , dt_min_local[loc_mempos+1 ] );
}
Info:
Fedora 19. ATI 13.8. HD 7950