cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

pulec
Journeyman III

What does influence kernel's maximal work-group size?

Hi, I'd be interested wheather anyone knows what influences kernel's maximal work-group size. I know it could be the number of needed registers or LDS for kernel. But, can me anyone explain, why is such a simple kernel as listed limited to only 32 work-items per work-group (for my card, maximal w-g size is 128, and is indeed allowed for other, more complex, kernels)?
EDIT: The code is meant just as an example, there won't be obviously any benefit from using more work-items per w-g. But I have some other kernels that could benefit from more w-i per w-g. I'd only like to know what could cause the reduction of allowed w-g size.

__kernel void mem_clear(__global float *array, int n) { if(get_global_id(0) < n) array[get_global_id(0)] = 0; }

0 Likes
8 Replies
Meteorhead
Challenger

I do not know why the API would not want to run more than 32 instances of this kernel concurrently. Are you using clGetKernelInfo(PREFERRED_MAX_WORK_GROUP_SIZE) (or something like that) to determine what your kernel is able to run in?

And you are wrong about one thing. There is a benefit, for running small kernels like this in larger groups. Writing to global memory is a long operation, and small writes like this will end up in the write buffer which is emptied every some cycles. But the kernel will only finish if all it's memory operations are actually done. Having to swipe some memory, and doing it in smalller groups will end up having to clear write cache more times than if you would've launched bigger groups, which could finish faster.

0 Likes

Yes, I do. Moreover, the limit really has to be kept, otherwise I catch CLError with CL_INVALID_WORK_GROUP_SIZE error (C++ wrapper). (Actually I find out how to obtain max work-g size just after I had received those errors with 128 w-g sizes)

Thanks for the latter information. That is the thing I wasn't aware of. (I don't know AMD HW in full details by now.)
0 Likes

Pulec ,

Can you confirm if your problem is solved?

0 Likes

Well, I would'n call it a problem. I was just curious why I have to run such a simple kernel with only 32 work-items per w-g. And it still remains the same, but it is absolutely not problem (for me) because I wrote myself an wrapper method (it is OOP) that checks for maximal w-g size and if its equal or greater than preferred wg size, preferred is used, otherwise it picks maximal size allowed for kernel.
(Anyway, that project is just for an assignment to the school and I didn't have time to play with it since then)
0 Likes

pulec,

I think if you specify local & global work group size explicitly this error can only come if global worksize is not divisible by local worksize.

Once my system was also reporting lower work grop size but later i found it was because of a environment variable GPU_MAX_WORKGROUP_SIZE if i remember correctly.

 

0 Likes

Well, I haven't set this variable, so it shouldn't be the problem.
Anyway, I have just run some examples from SDK and some of them also used fallback value of 32 work-items (while the others 128). Namely BinomialOption, DCT or FFT used only 32.
But I have rather old and low-end card (beta supported Mobility HD 4570/RV710), so it isn't probably issue by newer/higher-end cards.

EDIT: According to the Khronos man-pages it can be just the case when CL_INVALID_WORK_GROUP_SIZE is returned (if CL_DEVICE_MAX_WORK_GROUP_SIZE is exceeded). http://www.khronos.org/registr...ueueNDRangeKernel.html
0 Likes

there is some issue with barrier() on 4xxx series card. so you can use default wavefront size as maximum workgroup size when you use barrier(). wavefront size is mostly 64 and AFAIK on RV710 it is 32.

0 Likes
pulec
Journeyman III

Thank you for the information. It is likely.
0 Likes