Imagine you are performing a 3x3 convolution on each element of a large (say 8K x 8K) matrix A. For each A[i,j], you compute A'[i,j] = A[i-1,j-1] * C[0,0] + A[i-1, j] *C[0,1] + A[i-1, j+1] * C[0, 2] + ... + A[i+1, j+1] * C[2,2]
Since I can have at most 256x256 work_items on the RV770, each work item must process a more than a single element of A. If I use all 256x256 work items, I will process a 32x32 submatrix in each work element. But I could use fewer work elements, processing a larger sub-matrix on each element. For example, I could use 128x128 work items, each processing a 64x64 sub-matrix, or 128x256, or ...
Likewise, I could organize a 256x256 work-item solution into workgroups of 16x16, 32x8, etc.
The question is how do pick among these various approaches. Since there is no need for blocking/sychronzing across work items, this plays no role in organizing into workgroups. Memory access patterns will presumably play a role. Pretty much any approach will have far more workgroups than processors, so keeping all the processors busy won't be impacted by the choice. What other factors should I take into account in designing the number of work_items and their mapping into work_groups.
There is no limit on the number of work-items in global index, the restriction is only on the local-index i.e 256 for work-group size. So in your case above you can have 8k x 8k work-items and then divide them into groups.
If you are using no intra-group sharing i.e local memory then you should select a linear work-group. For example - 256 x 1 as this will get good memory performance if you are accessing your global memory linearly.
I ran the SobelFilter sample in SDK for 3 different work-group combinations -
a) 256 x 1 = 1.05 ms kernel time
b) 64 x 4 = 1.05
b) 16 x 16 = 1.4 ms
The idea is that all threads in a single wavefront should access the global memory linearly for getting good bandwidth. Hence case a and b give good performance.
But if you are using local memory then you can reduce the total number of fetches from global memory by using a square-work-group compared to linear. For example - take 2 cases in your algorithm
1) 4x4 work-group will lead to 36 fetches from global memory to local memory (3x3 filter size)
2) 16x1 will lead to 54 fetches from global memory using the same filter size
The fetch count is the number of fetches from global memory to local memory for a work-group. It includes all the data(or texels) needed by all the threads inside the work-group.
A 4x4 work-group will require to fetch 6x6 tile of texels from global memory if filter-size of 3x3 is used for each thread. Whereas 16x1 group will require 18x3 tile.
If you don't use local memory then each thread requires 9 texel fetches from global memory but if you use local memory with a 4x4 block-size you requires only 36/16 = 2.25 avg fetches per thread.
Please note that for error-free processing on 7xx-based GPUs, the muximum work-group size is 64. Specifying a larger size can results in undefined behavior.
7XX can work fine with larger workgroups, up to what the runtime reports and what the kernel info says the kernel can be run at. Some 7XX devices may actually report less than 64 as the maximum work group size for a given kernel depending on the attributes of the kernel, barrier usage for example.
n0thing - I've looked at the matrix multiple code and read more of the spec (it's big and dense, as I'm sure you know) and now see what you're saying about local memory. I do have a question about code design, though. Let's say we're doing an 4x4 workgroup. One approach would be to have each thread in the group fetch the input matrix element corresponding to the thread and place that element in the corresponding local memory location. But we still need to fetch the the values around the periphery of the sub-matrix represented by the workgroup. Does one thread do this while the others just do nothing? Alternatively, one thread from the work-group could do all the copying and then do the barrier call, while the others do nothing , just make the barrier call. Or is there a more efficient way to do this?
part of my earlier confusion stems from the fact that my device has an RV77, which reports no local memory, causing me to initially ignore this part of the model.
See page 6 of this paper : http://developer.download.nvidia.com/compute/cuda/sdk/website/projects/convolutionSeparable/doc/convolutionSeparable.pdf
It explains how to load the pixels in the apron. The shared memory in CUDA is same as local memory in OpenCL.