size of problem for DGEMM in case alfa and beta not equal 1 or 0 is 3*N*K*M but in time_dgemm.f ACML-GPU Linux x64 DNFLOP = 2.0D-6*DBLE(M)*DBLE(N)*DBLE(K)
I suppose that right is DNFLOP = 3.0D-6*DBLE(M)*DBLE(N)*DBLE(K)
DGEMM is C= alpha *A*B + beta * C. When alpha and beta are applied they each have N*M operations, so you could add 2* N*M to the total flops. But since this is n squared versus n cubed for the matix multiply operations, it is typically ignored. If N = 10000, than the extra alpha and beta overhead are only 1/5000 of the operations.
its depends on implementation.
for int i=1..Nfor int j=1..M for int r=1..Kq[i,j] +=a[i,r]*b[rj] //2*N*M*K (MAD 2 flop)endc[i,j]=beta*c[i,j]+alfa*q[i,j] //3*N*M (MAD +MUL 3 flop)endend
for int i=1..N
for int j=1..M
c[i,j]=beta*c[i,j] //N*M (MUL 1 flop) for int r=1..K
c[i,j] +=alfa*a[i,r]*b[rj] //3*N*M*K (MUL +MAD 3 flop)endendend
So i think You use 2*N*m*K variant.
Retrieving data ...