Say I want to allocate X bytes of local memory and clGetDeviceInfo function returns that the maximum size of local memory is Y.
I also have a uni-dimensional global work size of T, and a uni-dimensional local work-size of L, consequently I have W = T/L work-groups.
How do I calculate the effective quantity of local memory? Is it just X? Or is it X*W?
I have an AMD HD4850, and I have implemented the following example:
- Number of work-items (1D): 384 000
- Number of work-items per work-group (1D): 256 (the maximum for my GPU)
- Local memory 40 bytes
- Maximum local memory 16384 bytes
In this scenario, clEnqueueNDRangeKernel returns the error: CL_INVALID_WORK_GROUP_SIZE. The interesting thing is if the number of work-items per work-group is set as 64, it works fine.
What am I missing?
Thanks for your replies
Edit: BTW, CPU execution works fine in any of the above scenarios...is it a bug?