Hi,
I'm the maintainer of Pyrit and currently try to make my OpenCL-code work smoothly with AMD's implementation. I'm running Ubuntu 9.04 with 9.12-hotfix using a HD4850.
Investigating performance problems running my code on the GPU-device, I found out that there is a spike of CPU-usage whenever the size of the kernel (global_work_size) is not a multiple of 256. The total performance of the kernel drops by ~80% in such case...
Since the size of the kernel is not predictable for me, I currently solve this by aligning global_work_size to CL_DEVICE_MAX_WORK_GROUP_SIZE before calling clEnqueueNDRangeKernel(). While this causes the kernel to compute up to GROUP_SIZE - 1 "dummy" units, it saves me from the performance penalty described above.
Using CL_DEVICE_MAX_WORK_GROUP_SIZE (which happens to be 256 for the HD4850) seemed more reasonable than just hardcoding an alignment to 256. I actually do not know if there is a connection between the performance drop and not aligning the size of the kernel to that value...