cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Atmapuri
Journeyman III

Local work size setting!

Hi!

I am testing this kernel:

    int gid = get_global_id(0);
    Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];

If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.

If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).

So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?

Thanks!
Atmapuri

 

0 Likes
9 Replies

If you do not set the local work size, we use the default size which is device dependent but usually 256(or 64). By setting it to 1, you are effectively using only 1/[64|32|16]th of each wavefront on the machine.
0 Likes

That I understand, but I cant set it to more than 1, if the vector length is a primer number. The clEnqueNDRangeKernel returns "invalid work group size" which is exactly what is documented. If I dont specify anything it works fine and fast. Does the driver internally has the ability to set work group size to something which is otherwise not accepted by clEnqueNDRangeKernel???

0 Likes

You can almost always set the local size to be whatever you want. You need to pass the vector length to the kernel and put an if(global_id(0) < length) around your addition.

0 Likes

Atmapuri,
Yes. The CL API applies restrictions that are not necessarily restrictions in hardware.
0 Likes

I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.

0 Likes

Originally posted by: Atmapuri I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.


Intel OpenCL Alpha also had that sentence, and I wondered how can I combine it with statement from AMD APP OpenCL Programming Guide:

it is recommended that the developer explicitly specify the global (#work-groups) and local (#work-items/work-group) dimensions, rather than rely on OpenCL to set these automatically (by setting local_work_size to NULL in clEnqueueNDRangeKernel).

I ended up setting local worksize basing on preferred workgroup size multiple parameter.

0 Likes

I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

0 Likes

Originally posted by: Atmapuri I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.

For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.

Definately it depends on the goal and/or the kernel. One of the benefits of OpenCL is the ability to compute on different devices without any changes made to the code. There are certainly some cases (projects/products) when this benefit from potential becomes real one. I would argue that there are a lot of such cases. For example when the kernel is designed to run mostly on GPU, but it should have the ability to run on CPU while performance on CPU is not high-priority issue at all.

0 Likes

The local workgroup size is a algorithmic design decision. If you don't use local memory, then there is no point in setting it. If the algorithm is optimal for a single workgroup size, then set it at compile time with the reqd_workgroup_size attribute. If the algorithm can handle multiple sizes, then you set it at runtime.
0 Likes