I observed something surprising (to me) about the performance of DSYMM vs. DGEMM in ACML4.4.0. Here is a snippet of of F90 code that applies a symmetric 1536-by-1536 matrix to a 1536-by-25 matrix:
INTEGER*4, PARAMETER :: m=1536, n=25
REAL*8 :: A(m,m), B(m,n), C(m,n)
! Call external routine to populate A and B. A is symmetric.
CALL popab(A, B, m, n)
! Option 1: C <- A*B using dgemm
CALL dgemm('N', 'N', m, n, m, 1.d0, A, m, B, m, 0.d0, C, m)
! Option 2: C <- A*B using dsymm
CALL dsymm('L', 'U', m, n, 1.d0, A, m, B, m, 0.d0, C, m)
I was surprised to see that option 2 is much slower than option 1. For example, on a Dual-CPU Opteron 4334 Seoul 3.1 machine (12 cores with 64 GB RAM), option 2 takes about 2.3 times longer to execute than option 1. On a Phenom II N950 mobile machine (4 cores with 8 GB RAM), option 2 takes about 3 times longer than option 1. Both machines are running a 64-bit Linux OS. This difference in performance appears to be unique to ACML. For example, if I use ATLAS, then option 2 takes only about 1.1 times longer to execute than option 1.
Has this performance disparity been resolved in a more recent version of ACML? It would very difficult for our group to upgrade to a newer ACML because of compiler-compatibility issues, but we might consider attempting the upgrade if the performance of DSYMM has been substantially improved for matrices of sizes similar to those in my example code.
Thank you in advance for your assistance.