Does anyone have any results for DGEMM for AMD GPUs?

Some Fermi-based Teslas in some bencmarks are only getting 180Gflops for DGEMM, which I find very surprising...

...I would think that AMD GPUs would perform better for this type of algorithm but I'm having a hard time finding any results for the 58xx series for DGEMM.

Anyone?

If alpha = 1.0 and beta = 0.0; m and n are multiples of 4; k is a multiple of 2; and m <= 16384, n <= 8192, k < 8192, you can use nnsan's IL kernel to acheive just under 500 GFlops/s.

http://galaxy.u-aizu.ac.jp/trac/note/wiki/MatrixMultiply

See thread:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=127963&STARTPAGE=4&FTVAR_FORUMVIEWTMP=Linear

Unfortunately, this kernel reads A and B from images rather than global memory, so there are more constraints on their dimensions. Also, A has to be transposed. C is written to global memory. So, if your matrices are normally stored on the GPU in global memory, you'll have to write some kernels to get the packing and transposition in the images working. Furthermore, this kernel is row major, so you'll have to come up with an analagous kernel for column major. But if the stars align just right, you can expand on this impressive work to run circles around Fermi.