3 Replies Latest reply on Jun 21, 2011 11:31 AM by Kerilk

    Performance regression with 11.6

    Kerilk
      Some kernels show performance loos up to 10% when using catalyst 11.6 compared to 11.5.

      We recently acquired a Radeon HD 6970 to make portability tests of BigDFT http://inac.cea.fr/L_Sim/BigDFT/ an electronic structure simulation code.

      Porting the code was rather straightforward and with minor optimizations the code performs similarly on NVIDIA and AMD GPUS. As optimizations on the Cayman architecture have not been thorough we hope to have the Radeon HD 6970 outperform the TESLA C2070 we have available. (The whole OpenCL part is double precision only).

      Nonetheless we saw a serious performance regression using 11.6 catalyst drivers (linux x86_64). Some kernel suffer a 10 % performance compared to the 11.5 version. It seems related to memory accesses, as problems seem worst with certain problem sizes. Some Kernels seem to benefit slightly from the upgrade though (about 1-5%).

      Do you have a better insight in what is causing this regression? Is there a way I could help find a solution?

      Thanks in advance

      Brice Videau

      CEA

       

      Edit : my first assumption was apparently false, the problem seems related to the auto vectorizer in the compiler:

      previously a sequence of :

      double tt = 0.0;

      tt = fma( *buff++, CONST0, tt);

      tt = fma( *buff++, CONST1, tt);

      ....

      tt = fma( *buff++, CONST15, tt);

      was automatically vectorized (I presume). Using:

      double2 tt2 = (double2)(0.0, 0.0);
      __local double2 *buff2= (__local double2 *)buff;
      tt2 = mad(*tmp2++, (double2)(CONST0,CONST1), tt2);

      tt2 = mad(*tmp2++, (double2)(CONST2,CONST3), tt2);

      ....

      tt2 = mad(*tmp2++, (double2)(CONST14,CONST15), tt2);

      double tt = tt2.x + tt2.y;

      gave the performance back. I liked the previous behavior better.

        • Performance regression with 11.6
          himanshu.gautam

          Hi kerlik,

          Thanks for reporting it. 

          It will be fixed soon.

          • Performance regression with 11.6
            jeff_golds

            Which kernel in bigDFT are you looking at?  In any event, mad and fma don't necessarily have the same performance (or accuracy), so it's not clear to me how you are comparing code with fma to code with mad.

            Jeff

              • Performance regression with 11.6
                Kerilk

                Well,

                tt2 += *tmp2++ * (double2)(CONST0,CONST1);

                was not recognized as a mad with -cl-mad-enable by AMD OpenCL compiler. So the performance loss compared to NVIDIA was quite important.

                On our AMD (and NVIDIA hardware) GPU in double precision mad and fma have the same precision / performances, so I first used fma. Then I made some CPU OpenCL test, and found extremely low performances. Reason is that on CPU the fma is emulated and extremely costly (no native support), and mad just translates to a regular multiplication followed by an addition. So I switched to mad for now.

                Nonetheless, on current GPU in double precision, fma and mad are identical.

                If AMD compiler uses mad with -cl-mad-enable for:

                tt2 += *tmp2++ * (double2)(CONST0,CONST1);

                I'll switch back to this notation that is more generic.

                 

                Sorry for this mix-up but I pasted actual code samples.

                 

                Brice

                PS :  this is ongoing work that should be released in the next version of BigDFT. For now, AMD support is only on the bzr server.