Matrix multiplication performance

Discussion created by clop on Nov 26, 2009
Latest reply on Dec 2, 2009 by clop
Real-life performance numbers requested


I'm contemplating porting my single-precision numerical CUDA code to ATI/AMD platform.

Obviously, I need to justify this effort. Unfortunately, I so far failed to find real-life performance comparison of the new Radeon chips with the new NVidia chips (Fermi), or at least those of the previous generation (GT200) on GPGPU tasks. The two companies advertise their theoretical performance FLOPs quite a bit, but those are not very useful for me.

V Volkov published a few papers, in which he analyzed the performance of NVidia chips (see His open-source code appears to be state-of-the-art for those chips. Among other things, he wrote matrix-matrix multiplication code, that found its way into NVidia CUBLAS library.

Hence my question: if ATI/AMD truly believes, that its high-end chips are faster than those of NVidia for GPGPU applications, does it mind publishing performance comparison numbers for some standard numerical algebra tasks, such as single precision matrix-matrix multiplication? It would be particularly useful to see these numbers produced by an open cl code of a reasonable complexity: I cannot afford to port and maintain my code in any kind of an assembler-level language. It would also be educational to compare the complexity of Volkov's matrix multiplication CUDA code and open cl code for Radeon.

I see a lot of pessimism w.r.t ATI platform, as applied to GPGPU tasks, even in single precision on NVidia forums and I would assume a lot of this pessimism (if unjustified) can be annihilated by such a publication. I find it somewhat funny, that the following google search `radeon 5870 "matrix multiplication"' returns more NVidia than Radeon-related references.