I've run a benchmark that does dense matrix-matrix multiplication (dgemm operation in blas3) in double precision on a radeon 7970 gpu. When I use the dgemm function provided in clAmdBlas I measure about 150GFLOP/s. When I run the same benchmark using ViennaCL I get about 220GLOP/s, i.e. it's significantly faster. Could this be an issue of clAmdBlas not being tuned for tahiti yet? As far as I can tell the kernels in clAmdBlas are precompiled into the so file. Is it possible that the compiler wasn't as tweaked when the so file was generated (December based on clAmdBlas release date) as it is now?ViennaCL compiles its kernel from source.
Another reason why I suspect this could be a compiler optimization issue is that in single precision I get slightly better performance on a 6870 as I do on the 7970.
Shouldn't I expect a significant fraction of the theoretical peak flops in both double precision and single precision for this ALU bound computation?
Thanks a lot in advance.