Unaligned kernel size seems to cause performance penalty

Discussion created by ebfe on Jan 17, 2010
Latest reply on Jan 17, 2010 by nou


I'm the maintainer of Pyrit and currently try to make my OpenCL-code work smoothly with AMD's implementation. I'm running Ubuntu 9.04 with 9.12-hotfix using a HD4850.

Investigating performance problems running my code on the GPU-device, I found out that there is a spike of CPU-usage whenever the size of the kernel (global_work_size) is not a multiple of 256. The total performance of the kernel drops by ~80% in such case...

Since the size of the kernel is not predictable for me, I currently solve this by aligning global_work_size to CL_DEVICE_MAX_WORK_GROUP_SIZE before calling clEnqueueNDRangeKernel(). While this causes the kernel to compute up to GROUP_SIZE - 1 "dummy" units, it saves me from the performance penalty described above.

Using CL_DEVICE_MAX_WORK_GROUP_SIZE (which happens to be 256 for the HD4850) seemed more reasonable than just hardcoding an alignment to 256. I actually do not know if there is a connection between the performance drop and not aligning the size of the kernel to that value...