Hi!
I am testing this kernel:
int gid = get_global_id(0);
Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];
If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.
If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).
So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?
Thanks!
Atmapuri
That I understand, but I cant set it to more than 1, if the vector length is a primer number. The clEnqueNDRangeKernel returns "invalid work group size" which is exactly what is documented. If I dont specify anything it works fine and fast. Does the driver internally has the ability to set work group size to something which is otherwise not accepted by clEnqueNDRangeKernel???
You can almost always set the local size to be whatever you want. You need to pass the vector length to the kernel and put an if(global_id(0) < length) around your addition.
I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.
Originally posted by: Atmapuri I see. Thanks for the confirmation. On the side note, Intel is releasing OpenCL Beta drivers and recommends to: "leave the local_size undefined" with which (considering the CPU as the target) I can gree.
Intel OpenCL Alpha also had that sentence, and I wondered how can I combine it with statement from AMD APP OpenCL Programming Guide:
it is recommended that the developer explicitly specify the global (#work-groups) and local (#work-items/work-group) dimensions, rather than rely on OpenCL to set these automatically (by setting local_work_size to NULL in clEnqueueNDRangeKernel).
I ended up setting local worksize basing on preferred workgroup size multiple parameter.
I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.
For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.
Originally posted by: Atmapuri I am not sure it is worth pursuing the idea of having the same kernels run on PC based OpenCL and GPU based OpenCL. The concepts are so diffrent it seems impossible to make a one fits all solution.
For example, coalescing reads within workgroups requires a stride different from 1 on GPU. Anything other than 1 on CPU is a complete performance failure.
Definately it depends on the goal and/or the kernel. One of the benefits of OpenCL is the ability to compute on different devices without any changes made to the code. There are certainly some cases (projects/products) when this benefit from potential becomes real one. I would argue that there are a lot of such cases. For example when the kernel is designed to run mostly on GPU, but it should have the ability to run on CPU while performance on CPU is not high-priority issue at all.