cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

enliten
Journeyman III

OpenCL on Intel Core 2 duo CPU work group size problem

Hello,

I am using OpenCL on my Apple Macbook pro with a GPGPU graphics card and Intel 2.66GHz core 2 duo and want to use OpenCL on CPU and GPU. it works fine on GPU and also on CPU except one problem while running OpenCL on CPU:
The work group size returned by OpenCL device query is 1 which means that there will be one thread in a thread block. So how could I do e.g. reduction operation and lot of other kernels where we need to have more work-items in a work-group even with CPU OpenCL implementation? Please tell me as I could not found any help on this??

Thanks in advance!

--
Usman.

0 Likes
7 Replies
rick_weber
Adept II

On the CPU, you can use atomics to globally synchronize across work groups. This would allow you to do a parallel reduction. Just read and write to global memory. Since memory is cached on the CPU, there isn't much benefit to using local memory anyways.

0 Likes

Thanks for your reply.

 

This means that on CPU (no matter with how many cores), we get only 1 work-item per work-group??

0 Likes

I seem to recall that being the case in Apple's OpenCL CPU implementation. I think AMD's allows more threads in a workgroup, but I'm not sure. Unfortunately, there's no real way to enforce that threads execute in lock-step on a CPU.

0 Likes

enliten,
All issues with the Mac version of OpenCL need to be directed at Apple. Their implementation is quite different from ours and only they can answer questions specific to their implementation. This is true for AMD GPU's on Apple also.
0 Likes

do AMD write driver for GPU for Apple? or do Apple write whole driver in house?

0 Likes

AMD works with Apple to provide drivers for AMD products, but support is 100% an Apple issue.
0 Likes

I'm going to take a different approach to this point. If you want to do a reduction on the CPU do:

for( int i = 0; i < number; ++i )

  sum = sum + input[i + groupid];

where sum is a vec4.

Using a workgroup of more than one workitem and doing a parallel reduction over it is just horribly inefficient. The CPU is not a GPU, program it like a CPU. There is no kernel where you need more than one work item except by design, you can always replace it by a loop and on a CPU that will usually be more efficient anyway.

0 Likes