6 Replies Latest reply on Sep 6, 2012 9:22 AM by dmeiser

    Global work-offset implies performance hit?

    znakeeye

      I have a kernel that processes a large image (OpenCL 1.1, data type is image2d_t). Sometimes I only want to process a region of this image. The obvious solution is to use a global work-offset. I would expect this to yield a performance gain, but so far I only get worse execution time with non-zero offsets!

       

      Example

      Image is 4096x4096 pixels. Local work size is 8x8.

      A: Entire image processed, no offset:

      globalWorkSize = { 4096, 4096 };
      globalWorkOffset = { 0, 0 };
      Execution time is 38 seconds


      B: Sub-image processed using offsets:

      globalWorkSize = { 3296, 3296 };
      globalWorkOffset = { 400, 400 };
      Execution time is 58 seconds

       

      C: Cropped image at 3296x3296 pixels, no offset:
      globalWorkSize = { 3296, 3296 };
      globalWorkOffset = { 0, 0 };
      Execution time is 28 seconds

       

       

      Can somebody please explain why I get these results? Makes no sense!

        • Global work-offset implies performance hit?
          nou

          there will be problem with local work size. make offset 512, 512

          • Re: Global work-offset implies performance hit?
            dmeiser

            This could be an issue with how global offsets are implemented. Can you try to launch workitems for the entire image but do masking inside the kernel, e.g.

             

            if (get_group_id(0) > (400/8) && get_group_id(1) > (400/8)) {

                 do processing;

                 store result;

            }

             

            By using the goup ids you ensure that there is no divergence inside the work groups. The work groups that do not enter the processing branch should retire almost immediately with no big performance penalty .

            1 of 1 people found this helpful
              • Re: Global work-offset implies performance hit?
                znakeeye

                So it's plausible to call this a bug in this OpenCL implementation?

                 

                Using the group ID is an interesting idea. The question is; will this be faster than just cropping the image before the kernel is executed? What would you guess? Sure I can test this, but currently I have a working solution using a cropped image so...

                  • Re: Global work-offset implies performance hit?
                    dmeiser

                    I'm not sure whether it's a bug or not. I don't have any experience with global offsets.

                     

                    Regarding cropped images vs. masking in kernels: It depends on whether you use the image data just once or several times.  If you use it just once you might be better off cropping because that way you reduce the amount of data transferred to the GPU. If you're doing a rectangular crop you might also want to look into making the cropping part of the data transfer. You could create a buffer for the entire image and a subbuffer for the rectangle you'd like to crop. Then you can transfer just the subbuffer.