7 Replies Latest reply on Dec 22, 2014 9:25 PM by boxerab

    Most cache friendly tiling of OpenCL image ?

    boxerab

      I have an OpenCL image that is broken into tiles of 64x64 pixels. I am designing a kernel to run through all tiles and process the pixels. Target is AMD GCN.

      Currently, I process the tiles in raster order: left to right, top to bottom.

      Is there a better way of organizing the tiles to maximize use of image cache?

      For example, I thought about clockwise strips, starting from the origin:

       

      1  2  9 10

      4  3  8 11

      5  6  7 12

      .   .  .  13

       

      etc.

      Any ideas?

        • Re: Most cache friendly tiling of OpenCL image ?
          cgrant78@netzero.com

          Without manually rearranging the data, I don't think there is anything you can do with making it more cache friendly. Remember that memory is linear, so unless you are not averse to writing out each block linearly in a pre-pass, then using accessing those in the kernel, then there isn't much you can do. Or if the kernel is uniform( no branching ), then instead of using 64x64 squares, you can use 64x64 runs of pixels, but use a 1D workgroup domain...

          • Re: Most cache friendly tiling of OpenCL image ?
            dipak

            Hi,

            There can be many possible physical memory layouts for OpenCL images. To accelerate the accessing of images, run-time can rearrange the data layout to take advantage of special hardware (say texture buffer) presents in the device. Most of the cases, applications access the images in tiled-manner. Many systems take advantage of this feature and try to map the image accordingly. However, the actual layout formats e.g. size of the tiles/blocks, arrangement of blocks etc. vary from device to device. Generally the  translating from user address space to the tiled arrangement is transparent to the user. User may apply some known assumptions to extract the best performance. I refer you to check following section in AMD's OpenCL Optimization guide to get an idea.

            Chapter 2 OpenCL Performance and Optimization for GCN Devices->2.8 Additional Performance Guidance->2.8.2 Memory Tiling



            Regards,

              • Re: Most cache friendly tiling of OpenCL image ?
                cgrant78@netzero.com

                dipak, Didn't know the hw could possibly remap the image to enable better cache coherency, good info. Sorry for hijacking your thread boxerab.

                • Re: Most cache friendly tiling of OpenCL image ?
                  boxerab

                  Thanks, Dipak.  So, if I want to follow the tiling assumption about memory layout, how should I order my tiles in

                  the kernel?  Tiles are of size 32 x 32. 

                    • Re: Most cache friendly tiling of OpenCL image ?
                      dipak

                      Hi,

                      Image objects are mapped to the tiling arrangement by the runtime and programmer has no control over that procedure. Images can be used in tiled manner as a regular 2D array. It increases the programmability. To get the optimum performance using the image, one needs to do some experiment with the tile size and accessing pattern. As mentioned in that section, workgroup size with 16x16 (or 8x8) work-items is preferable to access a tiled image in a GCN device. I would suggest you to write a sample test code (without any manual re-ordering) and do few experiments to check the performance with different tile size especially when the tile size is 32x32. Also, check whether mapping a (32x32) tile to 4 or (2x2) sub-tiles, each having size (16x16) improves any performance or not.

                      Another point, one may use local memory in this regard especially when tile size is limited (i.e. fits in the local memory size) and access pattern is somewhat regular and known (i.e. does not creates much bank conflicts). In this case, programmer has to transfer the data to the local memory explicitly depending on the usage/access pattern. As a result, programmer can have better control on the data access. You may later try to implement the same using local memory to see whether you can gain extra advantage from that.

                       

                      Regards,