size of problem for DGEMM in case alfa and beta not equal 1 or 0 is 3*N*K*M but in time_dgemm.f ACML-GPU Linux x64 DNFLOP = 2.0D-6*DBLE(M)*DBLE(N)*DBLE(K)

I suppose that right is DNFLOP = 3.0D-6*DBLE(M)*DBLE(N)*DBLE(K)

size of problem for DGEMM in case alfa and beta not equal 1 or 0 is 3*N*K*M but in time_dgemm.f ACML-GPU Linux x64 DNFLOP = 2.0D-6*DBLE(M)*DBLE(N)*DBLE(K)

I suppose that right is DNFLOP = 3.0D-6*DBLE(M)*DBLE(N)*DBLE(K)

its depends on implementation.

for int i=1..N

for int j=1..M

for int r=1..K

q[i,j] +=a[i,r]*b[rj] //2*N*M*K (MAD 2 flop)

end

c[i,j]=beta*c[i,j]+alfa*q[i,j] //3*N*M (MAD +MUL 3 flop)

end

end

for int i=1..Nfor int j=1..M

c[i,j]=beta*c[i,j] //N*M (MUL 1 flop)

for int r=1..K

c[i,j] +=alfa*a[i,r]*b[rj] //3*N*M*K (MUL +MAD 3 flop)

end

end

endSo i think You use 2*N*m*K variant.

DGEMM is C= alpha *A*B + beta * C. When alpha and beta are applied they each have N*M operations, so you could add 2* N*M to the total flops. But since this is n squared versus n cubed for the matix multiply operations, it is typically ignored. If N = 10000, than the extra alpha and beta overhead are only 1/5000 of the operations.