i am coming to opencl with a cuda background and i was hoping to find a fast thread communication function similar to cuda's __shuffle(). I bumped into
work_group_reduce() function available in opencl and i was wondering what is its performance when used to reduce values held in registers.
work_group_reduce is in OpenCL 2.0 and AMD stack for OpenCL 2.0 is not yet out. OpenCL 2.0 will have support for shuffle also.
Do you mean performance compared to a hand coded kernel?
With performance, i mean does this use any memory buffers for the reduction (ie. local or global), or just private memory?
In regards to shuffle, do you mean that opencl 2, will have a similar instruction to CUDA's __shuffle() (exchange of private memory values in a wavefront), or you mean that it will support the openCL's shuffle, which looks like a convenience function for sorting numbers in a vector?
work_group_reduce function takes built-in data-types (e.g half, int, uint, long, ulong float or double) as input. It receives the input argument as value, not as a pointer. So, you can pass any value pointed by global or local memory pointer as follows.
kernel void (global const int *inData, global int *outData)
uint gid = get_global_id(0);
outData[gid ] = work_group_reduce_min(inData[gid]);
In the above example, the "inData" or "outData" may be a local memory pointer.