Imagine you are performing a 3x3 convolution on each element of a large (say 8K x 8K) matrix A. For each A[i,j], you compute A'[i,j] = A[i-1,j-1] * C[0,0] + A[i-1, j] *C[0,1] + A[i-1, j+1] * C[0, 2] + ... + A[i+1, j+1] * C[2,2]
Since I can have at most 256x256 work_items on the RV770, each work item must process a more than a single element of A. If I use all 256x256 work items, I will process a 32x32 submatrix in each work element. But I could use fewer work elements, processing a larger sub-matrix on each element. For example, I could use 128x128 work items, each processing a 64x64 sub-matrix, or 128x256, or ...
Likewise, I could organize a 256x256 work-item solution into workgroups of 16x16, 32x8, etc.
The question is how do pick among these various approaches. Since there is no need for blocking/sychronzing across work items, this plays no role in organizing into workgroups. Memory access patterns will presumably play a role. Pretty much any approach will have far more workgroups than processors, so keeping all the processors busy won't be impacted by the choice. What other factors should I take into account in designing the number of work_items and their mapping into work_groups.