I am using OpenCL on my Apple Macbook pro with a GPGPU graphics card and Intel 2.66GHz core 2 duo and want to use OpenCL on CPU and GPU. it works fine on GPU and also on CPU except one problem while running OpenCL on CPU:
The work group size returned by OpenCL device query is 1 which means that there will be one thread in a thread block. So how could I do e.g. reduction operation and lot of other kernels where we need to have more work-items in a work-group even with CPU OpenCL implementation? Please tell me as I could not found any help on this??
Thanks in advance!
On the CPU, you can use atomics to globally synchronize across work groups. This would allow you to do a parallel reduction. Just read and write to global memory. Since memory is cached on the CPU, there isn't much benefit to using local memory anyways.
I seem to recall that being the case in Apple's OpenCL CPU implementation. I think AMD's allows more threads in a workgroup, but I'm not sure. Unfortunately, there's no real way to enforce that threads execute in lock-step on a CPU.
I'm going to take a different approach to this point. If you want to do a reduction on the CPU do:
for( int i = 0; i < number; ++i )
sum = sum + input[i + groupid];
where sum is a vec4.
Using a workgroup of more than one workitem and doing a parallel reduction over it is just horribly inefficient. The CPU is not a GPU, program it like a CPU. There is no kernel where you need more than one work item except by design, you can always replace it by a loop and on a CPU that will usually be more efficient anyway.