Thanks for the reply. The only other thing using the GPU is the X server. Your comment triggered the idea that the compositing from the window-manager might be intereferring with the GPGPU execution, so I logged out and I'm now running the test via ssh with the environment variable COMPUTE=:0.
The results are much more consistent both in terms of runtimes, and in terms of reported wavefronts, so it looks like the compositing was indeed interfering with the profiling. (This is probably one of the reason why being able to do GPGPU without an X server running is such a frequently requested feature, I guess )
There is one discrepancy I'm still seeing, between when the workgroup size is specified and when it's left NULL. The occupancy calculator tells me that when no workgroup size is specified, 256 workitems are issued per group. This is the comparison of the NULL launch vs all the standard combinations of workgroup dimensions that total 256 workitems:
GlobalWorkSize | WorkGroupSize | Time | Wavefronts |
{ 3712 3712 1} | NULL | 3.07344 | 237568.00 |
{ 3840 3712 1} | { 256 1 1} | 2.62100 | 222720.00 |
{ 3712 3712 1} | { 128 2 1} | 2.59211 | 215296.00 |
{ 3712 3712 1} | { 64 4 1} | 2.61056 | 215296.00 |
{ 3712 3712 1} | { 32 8 1} | 2.58033 | 215296.00 |
{ 3712 3712 1} | { 16 16 1} | 2.60367 | 215296.00 |
{ 3712 3712 1} | { 8 32 1} | 2.54200 | 215296.00 |
{ 3712 3712 1} | { 4 64 1} | 2.62467 | 215296.00 |
{ 3712 3712 1} | { 2 128 1} | 3.54155 | 215296.00 |
{ 3712 3840 1} | { 1 256 1} | 7.68667 | 222720.00 |
Notice that for the 256x1 and 1x256 cases the global work size is adjusted so that the dimensions are integer multiples of the workgroup size, as per the specification, and the number of wavefronts grows correctly to match the new number of total items. However, in the NULL case the number of wavefronts grows to 237568, which means that a grand total of 3712*16384 items are being processed. That is surprising.
Is there some way to get the information about the actual launch configuration used when a NULL workgroup size is given at kernel launch? Neither the standard profiling output, nor the occupancy, not the API trace seem to reveal it.