Basically the same question as in my other post (is it known? is it possible to estimate when a fix will be available? other suggestions?)
Maybe this does not provide much of an insight, but one example of a problematic input size for zgemm on acml5.1.0 is
transa=C transb=N m=166 n=6 k=5124 lda=5240 ldb=5240 ldc=166.
This, as far as I can tell, only affects zgemm and not dgemm (I have not tested the sgemm/cgemm).
It seems ACML-5.1.0 end up using way more instructions for the same zgemm call as compared to ACML-4.4.0. E.g., there seems to be a huge number of extra branches when running the 5.1.0 version. The non-fma4 version seems to behave the same w r t instruction and branch counters, but of course performs worse.
$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510 1000 > /dev/null < slow_example
Performance counter stats for './zgemm-510 1000':
31,127,361,668 cycles:u # 0.000 GHz
48,709,297,348 instructions:u # 1.56 insns per cycle
9.882520082 seconds time elapsed
$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-440 1000 > /dev/null < slow_example
Performance counter stats for './zgemm-440 1000':
19,648,033,362 cycles:u # 0.000 GHz
34,677,863,136 instructions:u # 1.76 insns per cycle
6.325148731 seconds time elapsed
$ perf stat -e cycles:u,instructions:u,branches:u ./zgemm-510-wofma4 1000 > /dev/null < slow_example
Performance counter stats for './zgemm-510-wofma4 1000':
42,072,987,121 cycles:u # 0.000 GHz
48,280,631,523 instructions:u # 1.15 insns per cycle
12.903534633 seconds time elapsed
$ cat slow_example
transa C transb N m 166 n 6 k 5124 lda 5240 ldb 5240 ldc 166
If it's in any way relevant, the branches seems to take place in zmmblkcaf;
# Events: 10K branches
# Overhead Command Shared Object
# ........ ......... .................
93.84% zgemm-510 libacml.so [.] zmmblkcaf_