8 Replies Latest reply on Aug 4, 2014 2:20 AM by pinform

    Possible OpenCL Compiler Bug

    n1kunj

      I've found a compiler bug that, when commenting out a line of code, causes the kernel to run 10 times slower. Tested on an AMD 6630M 1GB (Turks) and Intel i5 2410m. The APP Kernel Analyzer shows that the kernel grows from 11714 lines of asm to 17416 lines after commenting out the single line of code. This seems a little excessive. Scratch Reg usage also goes through the roof.

       

      Attached are NotWorking, which has the line of code commented out, and Working, which has correct output.

        • Re: Possible OpenCL Compiler Bug
          drallan

          I don't think this is a compiler bug. The math function pow(x,y) can be very slow because it requires about 250 lines of instructions to calculate, each time. When a loop that contains pow() is unrolled, the code can also grow very large. These are the symptoms you see.

           

          However, Opencl defines a fast "native" version of the function powr(x,y) which requires that y is a positive value.  The native version is written as native_powr(x,y) and uses hardware instructions, if available, which is therefore hardware dependent. I know that AMD 7000 series GPUs have good hardware support.

           

          You can probably test this easily by using native_powr(x,abs(y)) in your example code.

            • Re: Possible OpenCL Compiler Bug
              n1kunj

              You've got it the wrong way round - removing the pow makes the code slower!

                • Re: Possible OpenCL Compiler Bug
                  drallan

                  You've got it the wrong way round - removing the pow makes the code slower!

                  Oops! Your right, I got it backwards. Then it probably is a compiler issue.

                   

                  However, I can't see why either case needs so many registers (Tahiti GCN used 256 VGPRs and 284 scratch registers, more than 500 ! ) The problem seems to be the first line, which mixes vector16 with scalar variables and has some terms that do not need to be in the loop. All together, this may prevent the compiler from making rather ordinary optimizations. ( I'm using an earlier compiler version around 12.8, v13.1 may be yet another issue).

                   

                  Simplifying this statement reduces register use by 400% to normal levels, 36 (Turks)/ 117 (Tahiti), and prevents register spilling. Note, the #define below switches between the two cases.

                   

                  Assuming float16 x, float z, and params[0] do not depend on j:

                   

                          float16 base = pow(x,params[0]) + pow(z,params[0]);


                        for (int j = 0; j < 16; j++) {

                  #if 0      //original

                          float 16 v=pow(x,params[0]) + pow(yptr[j],params[0]) + pow(z,params[0]);

                  #else   // breakup statement, move parts outside loop

                          float v = base + pow(yptr[j],params[0]);

                  #endif       

                          v = pow(v,1.f/params[0]);

                          float16 res = sin(v)/v;

                          accum+=sumFloat16(res);

                      }

                    • Re: Possible OpenCL Compiler Bug
                      n1kunj

                      I've since significantly modified the code in a different way for a massive speedup, in part by moving things outside of the loop, but I have noticed a similar problem myself. The compiler can't seem to optimise order of operations.

                       

                      This code:

                       

                      float16 out = (float16)a + (float)b + (float)c;

                       

                      Compiles to more instructions than this code:

                       

                      float16 out = (float16)a + ( (float)b + (float)c );

                       

                      I'm not sure if this is specifically part of the OpenCL spec that order of operations should be done left to right even when mixing scalars and vectors, but it's a little annoying that I need to go back through my code and specifically say that scalar ops need to be done before vector ops. It definitely explains some of the strange performance quirks I've been getting, with respect to moving things about in my code that really shouldn't change the emitted code but actually do modify performance.

                • Re: Possible OpenCL Compiler Bug
                  nou

                  I don't see difference with older driver it use same 97scratch register on Cypress. But with 13.1 driver I can see huge difference. It use 18 scratch register with pow() and 197 without. native_powr() it increase to 202 for cypress.