Imagine you are performing a 3x3 convolution on each element of a large (say 8K x 8K) matrix A. For each A[i,j], you compute A'[i,j] = A[i-1,j-1] * C[0,0] + A[i-1, j] *C[0,1] + A[i-1, j+1] * C[0, 2] + ... + A[i+1, j+1] * C[2,2]

Since I can have at most 256x256 work_items on the RV770, each work item must process a more than a single element of A. If I use all 256x256 work items, I will process a 32x32 submatrix in each work element. But I could use fewer work elements, processing a larger sub-matrix on each element. For example, I could use 128x128 work items, each processing a 64x64 sub-matrix, or 128x256, or ...

Likewise, I could organize a 256x256 work-item solution into workgroups of 16x16, 32x8, etc.

The question is how do pick among these various approaches. Since there is no need for blocking/sychronzing across work items, this plays no role in organizing into workgroups. Memory access patterns will presumably play a role. Pretty much any approach will have far more workgroups than processors, so keeping all the processors busy won't be impacted by the choice. What other factors should I take into account in designing the number of work_items and their mapping into work_groups.

There is no limit on the number of work-items in global index, the restriction is only on the local-index i.e 256 for work-group size. So in your case above you can have 8k x 8k work-items and then divide them into groups.

If you are using no intra-group sharing i.e local memory then you should select a linear work-group. For example - 256 x 1 as this will get good memory performance if you are accessing your global memory linearly.

I ran the SobelFilter sample in SDK for 3 different work-group combinations -

a) 256 x 1 = 1.05 ms kernel time

b) 64 x 4 = 1.05

b) 16 x 16 = 1.4 ms

The idea is that all threads in a single wavefront should access the global memory linearly for getting good bandwidth. Hence case a and b give good performance.

But if you are using local memory then you can reduce the total number of fetches from global memory by using a square-work-group compared to linear. For example - take 2 cases in your algorithm

1) 4x4 work-group will lead to 36 fetches from global memory to local memory (3x3 filter size)

2) 16x1 will lead to 54 fetches from global memory using the same filter size