Archives Discussions

n1kunj · ‎03-15-2013

I've found a compiler bug that, when commenting out a line of code, causes the kernel to run 10 times slower. Tested on an AMD 6630M 1GB (Turks) and Intel i5 2410m. The APP Kernel Analyzer shows that the kernel grows from 11714 lines of asm to 17416 lines after commenting out the single line of code. This seems a little excessive. Scratch Reg usage also goes through the roof.

Attached are NotWorking, which has the line of code commented out, and Working, which has correct output.

drallan · ‎03-16-2013

I don't think this is a compiler bug. The math function pow(x,y) can be very slow because it requires about 250 lines of instructions to calculate, each time. When a loop that contains pow() is unrolled, the code can also grow very large. These are the symptoms you see.

However, Opencl defines a fast "native" version of the function powr(x,y) which requires that y is a positive value. The native version is written as native_powr(x,y) and uses hardware instructions, if available, which is therefore hardware dependent. I know that AMD 7000 series GPUs have good hardware support.

You can probably test this easily by using native_powr(x,abs(y)) in your example code.

n1kunj · ‎03-16-2013

You've got it the wrong way round - removing the pow makes the code slower!

drallan · ‎03-16-2013

You've got it the wrong way round - removing the pow makes the code slower!

Oops! Your right, I got it backwards. Then it probably is a compiler issue.

However, I can't see why either case needs so many registers (Tahiti GCN used 256 VGPRs and 284 scratch registers, more than 500 ! ) The problem seems to be the first line, which mixes vector16 with scalar variables and has some terms that do not need to be in the loop. All together, this may prevent the compiler from making rather ordinary optimizations. ( I'm using an earlier compiler version around 12.8, v13.1 may be yet another issue).

Simplifying this statement reduces register use by 400% to normal levels, 36 (Turks)/ 117 (Tahiti), and prevents register spilling. Note, the #define below switches between the two cases.

Assuming float16 x, float z, and params[0] do not depend on j:

float16 base = pow(x,params[0]) + pow(z,params[0]);

for (int j = 0; j < 16; j++) {

#if 0 //original

float 16 v=pow(x,params[0]) + pow(yptr,params[0]) + pow(z,params[0]);

#else // breakup statement, move parts outside loop

float v = base + pow(yptr,params[0]);

#endif

v = pow(v,1.f/params[0]);

float16 res = sin(v)/v;

accum+=sumFloat16(res);

}

n1kunj · ‎03-17-2013

I've since significantly modified the code in a different way for a massive speedup, in part by moving things outside of the loop, but I have noticed a similar problem myself. The compiler can't seem to optimise order of operations.

This code:

float16 out = (float16)a + (float)b + (float)c;

Compiles to more instructions than this code:

float16 out = (float16)a + ( (float)b + (float)c );

I'm not sure if this is specifically part of the OpenCL spec that order of operations should be done left to right even when mixing scalars and vectors, but it's a little annoying that I need to go back through my code and specifically say that scalar ops need to be done before vector ops. It definitely explains some of the strange performance quirks I've been getting, with respect to moving things about in my code that really shouldn't change the emitted code but actually do modify performance.

nou · ‎03-16-2013

I don't see difference with older driver it use same 97scratch register on Cypress. But with 13.1 driver I can see huge difference. It use 18 scratch register with pow() and 197 without. native_powr() it increase to 202 for cypress.

himanshu_gautam · ‎03-17-2013

Thanks for the testcase. I will forward it to the appropriate team if reproduced.

himanshu_gautam · ‎03-17-2013

I could reproduce the VGPR increase after that"pow" instruction is commented. The VGPR usage looks abnormally high in any case. I have forwarded it to relevant team.

pinform · ‎08-04-2014

We were not able to reproduce this issue with the latest driver and latest version of CodeXL. Can you try with the latest driver and CodeXL versions?

We got the following VGPR usage values in the latest CodeXL for the two Kernels on Tahiti:

Working.cl -Scratch Registers: 0 VGPRs : 206

NonWorking.cl - Scratch Registers:0 VGPRs:203

Archives Discussions

Possible OpenCL Compiler Bug