Hello,
I am using OpenCL on my Apple Macbook pro with a GPGPU graphics card and Intel 2.66GHz core 2 duo and want to use OpenCL on CPU and GPU. it works fine on GPU and also on CPU except one problem while running OpenCL on CPU:
The work group size returned by OpenCL device query is 1 which means that there will be one thread in a thread block. So how could I do e.g. reduction operation and lot of other kernels where we need to have more work-items in a work-group even with CPU OpenCL implementation? Please tell me as I could not found any help on this??
Thanks in advance!
--
Usman.
On the CPU, you can use atomics to globally synchronize across work groups. This would allow you to do a parallel reduction. Just read and write to global memory. Since memory is cached on the CPU, there isn't much benefit to using local memory anyways.
Thanks for your reply.
This means that on CPU (no matter with how many cores), we get only 1 work-item per work-group??
I seem to recall that being the case in Apple's OpenCL CPU implementation. I think AMD's allows more threads in a workgroup, but I'm not sure. Unfortunately, there's no real way to enforce that threads execute in lock-step on a CPU.
do AMD write driver for GPU for Apple? or do Apple write whole driver in house?
I'm going to take a different approach to this point. If you want to do a reduction on the CPU do:
for( int i = 0; i < number; ++i )
sum = sum + input[i + groupid];
where sum is a vec4.
Using a workgroup of more than one workitem and doing a parallel reduction over it is just horribly inefficient. The CPU is not a GPU, program it like a CPU. There is no kernel where you need more than one work item except by design, you can always replace it by a loop and on a CPU that will usually be more efficient anyway.