cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

gbilotta
Adept III

Inconsistent number of wavefronts reported by sprofile

I'm analyzing the impact of workgroup size in the performance of a kernel, in a rather naive and brute-force way: I upload the kernel parameters once, and then execute the kernels repeatedly with different workgroup size configurations, starting from the kernel-preferred workgroup size multiple and doubling each time. For each size, I also try different workgroup shapes (from size x 1 to 1 x size, again halving/doubling for each test), and each configuration is tried multiple times to also find 'jitter' in the execution.

I noticed that even with the same exact configuration, the profiler sometimes reports a _different_ number of wavefronts, ALU instructions, fetch instructions, write instructions etc. This doesn't seem to be correlated with performance, in the sense that the runtime outliers in the repeated tests are not necessarily the ones with the different number of wavefronts etc. An extract is in the table that follows. The work size is always 3712*3172, needing exactly 215296 wavefronts. The actual WGS for the NULL configuration is unknown (is there a way to tell? the only thing I could get is that it uses 256-sized workgroups, from the occupancy table). Notice how the 64x1 configuration has two repetition with too many wavefronts, lower ALUinsts etc than the third 64x1 repetition, similarly for the last 32x2 vs the first two 32x2, and see how the 12ms 16x4 is identical to the 16x4.

WorkGroupSizeTimeWavefrontsALUInstsFetchInstsWriteInstsALUBusyALUFetchRatioALUPacking
NULL3.06689278594.0043.114.201.3967.5810.2669.42
NULL3.05967278594.0043.114.201.3967.7210.2669.42
NULL3.07189278594.0043.114.201.3967.7810.2669.42
64x13.32733256322.0042.934.191.4057.7910.2469.50
64x13.32289256322.0042.934.191.4057.8610.2469.50
64x13.31767215296.0050.734.901.4762.7210.3669.50
32x23.31378215296.0048.854.701.3961.1510.3969.21
32x23.29900215296.0048.854.701.3960.9910.3969.21
32x23.30444256322.0041.354.031.3356.0110.2769.21
16x43.30778215296.0047.534.561.3359.4310.4268.99
16x43.28522215296.0047.534.561.3359.4310.4268.99
16x412.92222215296.0047.534.561.3359.3610.4268.99

It is my understanding that the overlong kernels are probably caused by the X server preempting the GPU (Radeon HD 6970) (by the way, when will OpenCL be available on AMD cards _without_ an X server?), but what can be the cause for the odd wavefront computations? Is it a bug in the profiler (2.4.1317 for amd64 running in Ubuntu), OpenCL (AMD APP 2.6) or the video driver (fglrx 8.91.4, from fglrx-updates on Ubuntu)?

0 Likes
2 Replies
lbin
Staff

Thanks for reporting this issue, When you were collecting the performance counter, were there any other applications that were utilizing the GPU? That could affect the counter result.

Thanks for the reply. The only other thing using the GPU is the X server. Your comment triggered the idea that the compositing from the window-manager might be intereferring with the GPGPU execution, so I logged out and I'm now running the test via ssh with the environment variable COMPUTE=:0.

The results are much more consistent both in terms of runtimes, and in terms of reported wavefronts, so it looks like the compositing was indeed interfering with the profiling. (This is probably one of the reason why being able to do GPGPU without an X server running is such a frequently requested feature, I guess )

There is one discrepancy I'm still seeing, between when the workgroup size is specified and when it's left NULL. The occupancy calculator tells me that when no workgroup size is specified, 256 workitems are issued per group. This is the comparison of the NULL launch vs all the standard combinations of workgroup dimensions that total 256 workitems:

GlobalWorkSizeWorkGroupSizeTimeWavefronts
{   3712    3712       1}NULL3.07344237568.00
{   3840    3712       1}{  256     1     1}2.62100222720.00
{   3712    3712       1}{  128     2     1}2.59211215296.00
{   3712    3712       1}{   64     4     1}2.61056215296.00
{   3712    3712       1}{   32     8     1}2.58033215296.00
{   3712    3712       1}{   16    16     1}2.60367215296.00
{   3712    3712       1}{    8    32     1}2.54200215296.00
{   3712    3712       1}{    4    64     1}2.62467215296.00
{   3712    3712       1}{    2   128     1}3.54155215296.00
{   3712    3840       1}{    1   256     1}7.68667222720.00

Notice that for the 256x1 and 1x256 cases the global work size is adjusted so that the dimensions are integer multiples of the workgroup size, as per the specification, and the number of wavefronts grows correctly to match the new number of total items. However, in the NULL case the number of wavefronts grows to 237568, which means that a grand total of 3712*16384 items are being processed. That is surprising.

Is there some way to get the information about the actual launch configuration used when a NULL workgroup size is given at kernel launch? Neither the standard profiling output, nor the occupancy, not the API trace seem to reveal it.

0 Likes