I am testing this kernel:
int gid = get_global_id(0);
Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];
If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.
If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).
So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?