We are using a program which uses lot's of zgemm-calls (up to 80% of the program-running time).
I'm testing the program on bulldozer-cpu's with acml 5.1, one time with fma4, another time without fma4.
Both times compiled with Intel Fortran Compiler 12.
I cannot find any major differences in the runtimes. Is this correct? Or should FMA4 speed-up zgemm?
With best regards
You should see a speedup in this case. The fma4 zgemm is not as well tuned as DGEMM, but it should provide some benefit compared to running the SSE library. If you can run the application with gdb and stop it in the ZGEMM kernel, you should see AVX/FMA4 instructions. If not, the application is picking up the wrong library somehow. If you do see fma4 instructions (vfmaddpd, etc), then this proves we need to spend more time on tuning our ZGEMM kernels.