cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

ebfe
Journeyman III

Unaligned kernel size seems to cause performance penalty

Hi,

I'm the maintainer of Pyrit and currently try to make my OpenCL-code work smoothly with AMD's implementation. I'm running Ubuntu 9.04 with 9.12-hotfix using a HD4850.

Investigating performance problems running my code on the GPU-device, I found out that there is a spike of CPU-usage whenever the size of the kernel (global_work_size) is not a multiple of 256. The total performance of the kernel drops by ~80% in such case...

Since the size of the kernel is not predictable for me, I currently solve this by aligning global_work_size to CL_DEVICE_MAX_WORK_GROUP_SIZE before calling clEnqueueNDRangeKernel(). While this causes the kernel to compute up to GROUP_SIZE - 1 "dummy" units, it saves me from the performance penalty described above.

Using CL_DEVICE_MAX_WORK_GROUP_SIZE (which happens to be 256 for the HD4850) seemed more reasonable than just hardcoding an alignment to 256. I actually do not know if there is a connection between the performance drop and not aligning the size of the kernel to that value...

0 Likes
3 Replies
nou
Exemplar

global work size must be divisible by local work size. workgroups are runing in wavefronts which size is 64 for ATI GPU. for nvidia it is 32. for optimal performance workgroup size should be multiple of 64. read more here http://developer.amd.com/gpu/ATIStreamSDK/assets/ATI_Stream_SDK_Performance_Notes.pdf

so runing CL_DEVICE_MAX_WORK_GROUP_SIZE is imho good idea.

0 Likes
ebfe
Journeyman III

I see your point. The kernel I'm referring to operates on 1-dimensional items so I actually do not need work groups. Therefor local_work_size is NULL when calling clEnqueueNDRangeKernel().

The OpenCL-specification says

local_work_size can also be a NULL value in which case the OpenCL implementation will determine how to be break the global work-items into appropriate work-group instances.

 

IMHO the implementation should go ahead an align the total size of the kernel to a size of a wavefront itself. This is possible as each work-group for this kernel is defined to contain only one item. Therefor the dummy-values that some execution paths compute on will not interfer with real values.

0 Likes

no he cant go ahead. even if you specify NULL in local work size you can use local memory thus interfere with computation. and you maybe do not have 1:1 computation pattern you can write result quite random thus write from dummy work item to regular ouput.

0 Likes