In my OpenCL kernel, I need pass values between work items in the same workgroup many times equivalent to the following:
local float * local_array = // a local memory array
float a, b;
int wid = get_local_id(0);
int wgsize = get_local_size(0);
for( int i = 0; i<12; ++i )
{
a = //do some math
local_array[wid] = a;
barrier(CLK_LOCAL_MEM_FENCE); // How costly is this?
b = local_array[(wid+1)%wgsize]; // neighbor's array element
// then work with a and b
}
Can I expect this barrier to be eliminated by the compiler if my workgroup size is smaller or equal to the hardware thread size?
I would like this kernel to work well on both CPU and GPU devices. I've read elsewhere that barrier() on CPU results in a lightweight context switch every time. How can I avoid that?
In my algorithm, there is a trade off to the number of barriers needed, to the size of the local memory array. Fewer barriers means larger local memory size per work item. Does that help guide my decision?
By the way, I've spent too much time determining if men_fence() is an acceptable substitute for barrier() in this case and have concluded NO; it is not. Googling can find pages and threads arguing both ways. I'm reading the OpenCL spec and deciding men_fence() has nothing to do with synchronization between work items. But please comment if I'm wrong.