7 Replies Latest reply on Mar 6, 2010 5:33 AM by n0thing

    How to select number of work items?


      Imagine you are performing a 3x3 convolution on each element of a large (say 8K x 8K) matrix A. For each A[i,j], you compute A'[i,j] = A[i-1,j-1] * C[0,0] + A[i-1, j] *C[0,1] + A[i-1, j+1] * C[0, 2] + ... + A[i+1, j+1] * C[2,2]


      Since I can have at most 256x256 work_items on the RV770, each work item must process a more than a single element of A. If I use all 256x256 work items, I will process a 32x32 submatrix in each work element. But I could use fewer work elements, processing a larger sub-matrix on each element. For example, I could use 128x128 work items, each processing a 64x64 sub-matrix, or 128x256, or ...

      Likewise, I could organize a 256x256 work-item solution into workgroups of 16x16, 32x8, etc.

      The question is how do pick among these various approaches. Since there is no need for blocking/sychronzing across work items, this plays no role in organizing into workgroups. Memory access patterns will presumably play a role.  Pretty much any approach will have far more workgroups than processors, so keeping all the processors busy won't be impacted by the choice. What other factors should I take into account in designing the number of work_items and their mapping into work_groups.

        • How to select number of work items?

          There is no limit on the number of work-items in global index, the restriction is only on the local-index i.e 256 for work-group size. So in your case above you can have 8k x 8k work-items and then divide them into groups.

          If you are using no intra-group sharing i.e local memory then you should select a linear work-group. For example - 256 x 1 as this will get good memory performance if you are accessing your global memory linearly.

          I ran the SobelFilter sample in SDK for 3 different work-group combinations -

          a) 256 x 1 = 1.05 ms kernel time

          b) 64 x 4 = 1.05

          b) 16 x 16 = 1.4 ms

          The idea is that all threads in a single wavefront should access the global memory linearly for getting good bandwidth. Hence case a and b give good performance.

          But if you are using local memory then you can reduce the total number of fetches from global memory by using a square-work-group compared to linear. For example - take 2 cases in your algorithm

          1) 4x4 work-group will lead to 36 fetches from global memory to local memory (3x3 filter size)

          2)  16x1 will lead to 54 fetches from global memory using the same filter size


            • How to select number of work items?

              I don't understand how you are assigning data to local vs. global memory to come up with your fetch count. Could you help me see?



                • How to select number of work items?

                  The fetch count is the number of fetches from global memory to local memory for a work-group. It includes all the data(or texels) needed by all the threads inside the work-group.

                  A 4x4 work-group will require to fetch 6x6 tile of texels from global memory if filter-size of 3x3 is used for each thread. Whereas 16x1 group will require 18x3 tile.

                  If you don't use local memory then each thread requires 9 texel fetches from global memory but if you use local memory with a 4x4 block-size you requires only 36/16 = 2.25 avg fetches per thread.