Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Global work-offset implies performance hit?

I have a kernel that processes a large image (OpenCL 1.1, data type is image2d_t). Sometimes I only want to process a region of this image. The obvious solution is to use a global work-offset. I would expect this to yield a performance gain, but so far I only get worse execution time with non-zero offsets!


Image is 4096x4096 pixels. Local work size is 8x8.

A: Entire image processed, no offset:

globalWorkSize = { 4096, 4096 };
globalWorkOffset = { 0, 0 };
Execution time is 38 seconds

B: Sub-image processed using offsets:

globalWorkSize = { 3296, 3296 };
globalWorkOffset = { 400, 400 };
Execution time is 58 seconds

C: Cropped image at 3296x3296 pixels, no offset:
globalWorkSize = { 3296, 3296 };
globalWorkOffset = { 0, 0 };
Execution time is 28 seconds

Can somebody please explain why I get these results? Makes no sense!

6 Replies

there will be problem with local work size. make offset 512, 512


How come? 400 is also divisible with 8.

Please tell me how I can calculate necessary offset padding to get decent performance. Thanks!


Have you run the profiler on your app to find out why it is behaving differently? There should be no difference between a program that uses global offsets and one that does, the calculations are the same for computing the ID's in both cases. Most likely you are hitting issues with caches but the profiler data can give you a hint on what is going on.


This could be an issue with how global offsets are implemented. Can you try to launch workitems for the entire image but do masking inside the kernel, e.g.

if (get_group_id(0) > (400/8) && get_group_id(1) > (400/8)) {

     do processing;

     store result;


By using the goup ids you ensure that there is no divergence inside the work groups. The work groups that do not enter the processing branch should retire almost immediately with no big performance penalty .

So it's plausible to call this a bug in this OpenCL implementation?

Using the group ID is an interesting idea. The question is; will this be faster than just cropping the image before the kernel is executed? What would you guess? Sure I can test this, but currently I have a working solution using a cropped image so...


I'm not sure whether it's a bug or not. I don't have any experience with global offsets.

Regarding cropped images vs. masking in kernels: It depends on whether you use the image data just once or several times.  If you use it just once you might be better off cropping because that way you reduce the amount of data transferred to the GPU. If you're doing a rectangular crop you might also want to look into making the cropping part of the data transfer. You could create a buffer for the entire image and a subbuffer for the rectangle you'd like to crop. Then you can transfer just the subbuffer.