Local work size setting!

Discussion created by Atmapuri on May 18, 2011
Latest reply on May 20, 2011 by maximmoroz


I am testing this kernel:

    int gid = get_global_id(0);
    Dst[DstIdx + gid] = Src1[Src1Idx + gid] + Src2[Src2Idx + gid];

If I dont specify local work size, the kernel runs fast for any vector size (get_global_size(0)) including those which are prime numbers and for which there is no local_work_size with which global_size would be divisable.

If I do specify local_work_size, for vector lengths which are prime numbers, this is set to 1 (the only number with which the global_size is divisable) and performance takes a steep dive (100x).

So, I was wondering, to which value is "local_work_size" set by the clEnqueNDRangeKernel, when none is specified by the user and the global_size specified is a primer number?