Archives Discussions

enliten · ‎01-11-2011

Hello,

I am using OpenCL on my Apple Macbook pro with a GPGPU graphics card and Intel 2.66GHz core 2 duo and want to use OpenCL on CPU and GPU. it works fine on GPU and also on CPU except one problem while running OpenCL on CPU:
The work group size returned by OpenCL device query is 1 which means that there will be one thread in a thread block. So how could I do e.g. reduction operation and lot of other kernels where we need to have more work-items in a work-group even with CPU OpenCL implementation? Please tell me as I could not found any help on this??

Thanks in advance!

--
Usman.

rick_weber · ‎01-11-2011

On the CPU, you can use atomics to globally synchronize across work groups. This would allow you to do a parallel reduction. Just read and write to global memory. Since memory is cached on the CPU, there isn't much benefit to using local memory anyways.

enliten · ‎01-11-2011

Thanks for your reply.

This means that on CPU (no matter with how many cores), we get only 1 work-item per work-group??

rick_weber · ‎01-11-2011

I seem to recall that being the case in Apple's OpenCL CPU implementation. I think AMD's allows more threads in a workgroup, but I'm not sure. Unfortunately, there's no real way to enforce that threads execute in lock-step on a CPU.

MicahVillmow · ‎01-11-2011

enliten,
All issues with the Mac version of OpenCL need to be directed at Apple. Their implementation is quite different from ours and only they can answer questions specific to their implementation. This is true for AMD GPU's on Apple also.

nou · ‎01-11-2011

do AMD write driver for GPU for Apple? or do Apple write whole driver in house?

MicahVillmow · ‎01-11-2011

AMD works with Apple to provide drivers for AMD products, but support is 100% an Apple issue.

LeeHowes · ‎01-12-2011

I'm going to take a different approach to this point. If you want to do a reduction on the CPU do:

for( int i = 0; i < number; ++i )

sum = sum + input[i + groupid];

where sum is a vec4.

Using a workgroup of more than one workitem and doing a parallel reduction over it is just horribly inefficient. The CPU is not a GPU, program it like a CPU. There is no kernel where you need more than one work item except by design, you can always replace it by a loop and on a CPU that will usually be more efficient anyway.

Archives Discussions

OpenCL on Intel Core 2 duo CPU work group size problem