AnsweredAssumed Answered

ACML 5.1.0 and zgemm performance

Question asked by mattiass on May 22, 2012
Latest reply on May 22, 2012 by chipf

Hi!

 

Basically the same question as in my other post (is it known? is it possible to estimate when a fix will be available? other suggestions?)

 

Maybe this does not provide much of an insight, but one example of a problematic input size for zgemm on acml5.1.0 is

transa=C transb=N m=166  n=6    k=5124 lda=5240 ldb=5240 ldc=166.

This, as far as I can tell, only affects zgemm and not dgemm (I have not tested the sgemm/cgemm).

 

It seems ACML-5.1.0 end up using way more instructions for the same zgemm call as compared to ACML-4.4.0. E.g., there seems to be a huge number of extra branches when running the 5.1.0 version. The non-fma4 version seems to behave the same w r t instruction and branch counters, but of course performs worse.

 

$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510 1000 > /dev/null < slow_example

 

Performance counter stats for './zgemm-510 1000':

 

    31,127,361,668 cycles:u                  #    0.000 GHz

    48,709,297,348 instructions:u            #    1.56  insns per cycle

     3,643,810,589 branches:u

 

       9.882520082 seconds time elapsed

 

$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-440 1000 > /dev/null < slow_example

 

Performance counter stats for './zgemm-440 1000':

 

    19,648,033,362 cycles:u                  #    0.000 GHz

    34,677,863,136 instructions:u            #    1.76  insns per cycle

       130,858,225 branches:u

 

       6.325148731 seconds time elapsed

 

$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510-wofma4 1000 > /dev/null < slow_example

 

Performance counter stats for './zgemm-510-wofma4 1000':

 

    42,072,987,121 cycles:u                  #    0.000 GHz

    48,280,631,523 instructions:u            #    1.15  insns per cycle

     3,643,943,130 branches:u

 

      12.903534633 seconds time elapsed

 

$ cat slow_example

transa C transb N m 166  n 6    k 5124 lda 5240 ldb 5240 ldc 166

 

If it's in any way relevant, the branches seems to take place in zmmblkcaf;

 

# Events: 10K branches

#

# Overhead    Command      Shared Object

# ........  .........  .................

#

    93.84%  zgemm-510  libacml.so         [.] zmmblkcaf_

            |

            --- zmmblkcaf_

 

 

Best regards,

Mattias

Outcomes