I'm analyzing the impact of workgroup size in the performance of a kernel, in a rather naive and brute-force way: I upload the kernel parameters once, and then execute the kernels repeatedly with different workgroup size configurations, starting from the kernel-preferred workgroup size multiple and doubling each time. For each size, I also try different workgroup shapes (from size x 1 to 1 x size, again halving/doubling for each test), and each configuration is tried multiple times to also find 'jitter' in the execution.
I noticed that even with the same exact configuration, the profiler sometimes reports a _different_ number of wavefronts, ALU instructions, fetch instructions, write instructions etc. This doesn't seem to be correlated with performance, in the sense that the runtime outliers in the repeated tests are not necessarily the ones with the different number of wavefronts etc. An extract is in the table that follows. The work size is always 3712*3172, needing exactly 215296 wavefronts. The actual WGS for the NULL configuration is unknown (is there a way to tell? the only thing I could get is that it uses 256-sized workgroups, from the occupancy table). Notice how the 64x1 configuration has two repetition with too many wavefronts, lower ALUinsts etc than the third 64x1 repetition, similarly for the last 32x2 vs the first two 32x2, and see how the 12ms 16x4 is identical to the 16x4.
It is my understanding that the overlong kernels are probably caused by the X server preempting the GPU (Radeon HD 6970) (by the way, when will OpenCL be available on AMD cards _without_ an X server?), but what can be the cause for the odd wavefront computations? Is it a bug in the profiler (2.4.1317 for amd64 running in Ubuntu), OpenCL (AMD APP 2.6) or the video driver (fglrx 8.91.4, from fglrx-updates on Ubuntu)?