1 Reply Latest reply on May 22, 2012 8:54 AM by chipf

    ACML 5.1.0 and zgemm performance

    mattiass

      Hi!

       

      Basically the same question as in my other post (is it known? is it possible to estimate when a fix will be available? other suggestions?)

       

      Maybe this does not provide much of an insight, but one example of a problematic input size for zgemm on acml5.1.0 is

      transa=C transb=N m=166  n=6    k=5124 lda=5240 ldb=5240 ldc=166.

      This, as far as I can tell, only affects zgemm and not dgemm (I have not tested the sgemm/cgemm).

       

      It seems ACML-5.1.0 end up using way more instructions for the same zgemm call as compared to ACML-4.4.0. E.g., there seems to be a huge number of extra branches when running the 5.1.0 version. The non-fma4 version seems to behave the same w r t instruction and branch counters, but of course performs worse.

       

      $ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510 1000 > /dev/null < slow_example

       

      Performance counter stats for './zgemm-510 1000':

       

          31,127,361,668 cycles:u                  #    0.000 GHz

          48,709,297,348 instructions:u            #    1.56  insns per cycle

           3,643,810,589 branches:u

       

             9.882520082 seconds time elapsed

       

      $ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-440 1000 > /dev/null < slow_example

       

      Performance counter stats for './zgemm-440 1000':

       

          19,648,033,362 cycles:u                  #    0.000 GHz

          34,677,863,136 instructions:u            #    1.76  insns per cycle

             130,858,225 branches:u

       

             6.325148731 seconds time elapsed

       

      $ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510-wofma4 1000 > /dev/null < slow_example

       

      Performance counter stats for './zgemm-510-wofma4 1000':

       

          42,072,987,121 cycles:u                  #    0.000 GHz

          48,280,631,523 instructions:u            #    1.15  insns per cycle

           3,643,943,130 branches:u

       

            12.903534633 seconds time elapsed

       

      $ cat slow_example

      transa C transb N m 166  n 6    k 5124 lda 5240 ldb 5240 ldc 166

       

      If it's in any way relevant, the branches seems to take place in zmmblkcaf;

       

      # Events: 10K branches

      #

      # Overhead    Command      Shared Object

      # ........  .........  .................

      #

          93.84%  zgemm-510  libacml.so         [.] zmmblkcaf_

                  |

                  --- zmmblkcaf_

       

       

      Best regards,

      Mattias

        • Re: ACML 5.1.0 and zgemm performance
          chipf

          Thank you for the detailed description.  It is very helpful.

           

          zmmblkcaf is our fortran conjugate case block copy.  It's not immediatly obvious why there is such disparity compared to 4.4.0.

           

          I notice the profile is showing the shared object being used.  Have you tried this case with static linking?