__kernel void mem_clear(__global float *array, int n) { if(get_global_id(0) < n) array[get_global_id(0)] = 0; }
I do not know why the API would not want to run more than 32 instances of this kernel concurrently. Are you using clGetKernelInfo(PREFERRED_MAX_WORK_GROUP_SIZE) (or something like that) to determine what your kernel is able to run in?
And you are wrong about one thing. There is a benefit, for running small kernels like this in larger groups. Writing to global memory is a long operation, and small writes like this will end up in the write buffer which is emptied every some cycles. But the kernel will only finish if all it's memory operations are actually done. Having to swipe some memory, and doing it in smalller groups will end up having to clear write cache more times than if you would've launched bigger groups, which could finish faster.
Pulec ,
Can you confirm if your problem is solved?
pulec,
I think if you specify local & global work group size explicitly this error can only come if global worksize is not divisible by local worksize.
Once my system was also reporting lower work grop size but later i found it was because of a environment variable GPU_MAX_WORKGROUP_SIZE if i remember correctly.
there is some issue with barrier() on 4xxx series card. so you can use default wavefront size as maximum workgroup size when you use barrier(). wavefront size is mostly 64 and AFAIK on RV710 it is 32.