Archives Discussions

Atmapuri · ‎05-18-2011

Hi!

I am testing this kernel:

int gid = get_global_id(0);
Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];

If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.

If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).

So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?

Thanks!
Atmapuri

MicahVillmow · ‎05-18-2011

If you do not set the local work size, we use the default size which is device dependent but usually 256(or 64). By setting it to 1, you are effectively using only 1/[64|32|16]th of each wavefront on the machine.

Atmapuri · ‎05-18-2011

That I understand, but I cant set it to more than 1, if the vector length is a primer number. The clEnqueNDRangeKernel returns "invalid work group size" which is exactly what is documented. If I dont specify anything it works fine and fast. Does the driver internally has the ability to set work group size to something which is otherwise not accepted by clEnqueNDRangeKernel???

rick_weber · ‎05-18-2011

You can almost always set the local size to be whatever you want. You need to pass the vector length to the kernel and put an if(global_id(0) < length) around your addition.

MicahVillmow · ‎05-18-2011

Atmapuri,
Yes. The CL API applies restrictions that are not necessarily restrictions in hardware.

Atmapuri · ‎05-19-2011

I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.

maximmoroz · ‎05-19-2011

Originally posted by: Atmapuri I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.

Intel OpenCL Alpha also had that sentence, and I wondered how can I combine it with statement from AMD APP OpenCL Programming Guide:

it is recommended that the developer explicitly specify the global (#work-groups) and local (#work-items/work-group) dimensions, rather than rely on OpenCL to set these automatically (by setting local_work_size to NULL in clEnqueueNDRangeKernel).

I ended up setting local worksize basing on preferred workgroup size multiple parameter.

Atmapuri · ‎05-19-2011

I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

maximmoroz · ‎05-20-2011

Originally posted by: Atmapuri I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

Definately it depends on the goal and/or the kernel. One of the benefits of OpenCL is the ability to compute on different devices without any changes made to the code. There are certainly some cases (projects/products) when this benefit from potential becomes real one. I would argue that there are a lot of such cases. For example when the kernel is designed to run mostly on GPU, but it should have the ability to run on CPU while performance on CPU is not high-priority issue at all.

MicahVillmow · ‎05-19-2011

The local workgroup size is a algorithmic design decision. If you don't use local memory, then there is no point in setting it. If the algorithm is optimal for a single workgroup size, then set it at compile time with the reqd_workgroup_size attribute. If the algorithm can handle multiple sizes, then you set it at runtime.

Archives Discussions

Local work size setting!