trouble reading from global memory after atom_inc
I know there's no global barrier in OpenCL, but I'm trying to create a workaround using the following code:
void barrier(__global uint* scratch) {
uint nThreads = get_global_size(0);
atom_inc(scratch);
/* this loop never terminates */
while(scratch[0] < nThreads) {
continue;
}
}
The idea is that each thread loops until all threads have incremented that memory.
I know the memory is being incremented because scratch[0] = nThreads when I read scratch back to host memory, but loops never terminate. When I have the threads write out the value at scratch[0] elsewhere, they all just print the result of their atom_inc.
I know I can normally read just fine from global memory, but what am I missing here?