7 Replies Latest reply on Dec 2, 2009 2:06 AM by clop

    Matrix multiplication performance

    clop
      Real-life performance numbers requested

      Hello,

      I'm contemplating porting my single-precision numerical CUDA code to ATI/AMD platform.

      Obviously, I need to justify this effort. Unfortunately, I so far failed to find real-life performance comparison of the new Radeon chips with the new NVidia chips (Fermi), or at least those of the previous generation (GT200) on GPGPU tasks. The two companies advertise their theoretical performance FLOPs quite a bit, but those are not very useful for me.

      V Volkov published a few papers, in which he analyzed the performance of NVidia chips (see http://www.cs.berkeley.edu/~volkov/). His open-source code appears to be state-of-the-art for those chips. Among other things, he wrote matrix-matrix multiplication code, that found its way into NVidia CUBLAS library.

      Hence my question: if ATI/AMD truly believes, that its high-end chips are faster than those of NVidia for GPGPU applications, does it mind publishing performance comparison numbers for some standard numerical algebra tasks, such as single precision matrix-matrix multiplication? It would be particularly useful to see these numbers produced by an open cl code of a reasonable complexity: I cannot afford to port and maintain my code in any kind of an assembler-level language. It would also be educational to compare the complexity of Volkov's matrix multiplication CUDA code and open cl code for Radeon.

      I see a lot of pessimism w.r.t ATI platform, as applied to GPGPU tasks, even in single precision on NVidia forums and I would assume a lot of this pessimism (if unjustified) can be annihilated by such a publication. I find it somewhat funny, that the following google search `radeon 5870 "matrix multiplication"' returns more NVidia than Radeon-related references.

      Thanks!

        • Matrix multiplication performance
          AndreasStahl

           

          Originally posted by: clop Hello, I'm contemplating porting my single-precision numerical CUDA code to ATI/AMD platform. Obviously, I need to justify this effort.


           

          I would recommend you, as an exercise if you may, to port a simple kernel that is critical to your program to OpenCL and compare the running times of OpenCL on NV, OpenCL on ATI and CUDA on NV. Don't leave it to the PR departments, if ATI or NV published OpenCL performance analysis vs. each other, who do you think would come out on top? (and besides, all drivers besides CUDA are as of yet still in beta, and it shows, performancewise)

          Post back here with the results, as I'd be interested to see them, too. I would hope satisfying our interest is justification enough.

          http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx is a nice article on how to port the most general of CUDA to OpenCL, NVidia have a similar paper, if memory serves correctly.

           

          Originally posted by: clop I so far failed to find real-life performance comparison of the new Radeon chips with the new NVidia chips (Fermi)


          Fermi is still in development, hence no real-life performance data, only PR.

           

            • Matrix multiplication performance
              clop

              Thank you for your pointers and advices.

              As I mentioned in my opening post, I'm interested not only in the speed of the code, but also in its simplicity. My kernel code is fairly complex, and I won't port it unless I see, that a well-performing opencl implementation of such a simple operation, as matrix-matrix multiplication (or smth. similar) is concise and easy to understand.

              I might have not invested enough effort in researching ATI/AMD's technologies, but I was under the impression, that programming NVidia chips for GPGPU tasks usED to to easier, and this drew a lot of programmers (like myself) into NVidia's realm. Specifically, I didn't find a way of explicitly addressing shared (a.k.a. local) memory in brook++, and this is critical for my application. I'm working on a research-grade code, that I modify quite frequently,  hence any kind of an assembler-level programming is out of question for me.

              I am under the impression (but can offer no evidence), that NVidia currently owns a very significant share of GPGPU market. If ATI/AMD is interested in selling its chips to current NVivia customers, who utilize them for GPGPU purposes, it seems reasonable that ATI/AMD publishes tutorials and best practices guides for porting CUDA code to ATI/AMD platform. Obviously, such literature should include green-apple-to-red-apple speed comparison. Some state-of-the-art CUDA code (e.g. Volkov's matrix-matrix multiplication, referenced in my opening post) is open source and it could be a good starting point for such a tutorial. Quite frankly, I believe this step is ATI/AMD's job and not mine.

              > http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx is a nice article on how to port the most general of CUDA to OpenCL

              I read this article before, but having gained some experience with CUDA, I suspect, that porting CUDA code to ATI/AMD platform is more tricky, than replacing a bunch of API calls and qualifiers. Specifically, For example, I don't quite understand how to replace my multi-threaded scalar CUDA code with opencl vector code, which seems to be required for efficient usage of ATI/AMD chips. NVidia took a great approach by offering many examples in its tutorials and I would certainly benefit from seeing those or similar ones in ATI/AMDs tutorials.

                • Matrix multiplication performance
                  genaganna

                   

                  Originally posted by: clop http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx is a nice article on how to port the most general of CUDA to OpenCL I read this article before, but having gained some experience with CUDA, I suspect, that porting CUDA code to ATI/AMD platform is more tricky, than replacing a bunch of API calls and qualifiers. Specifically, For example, I don't quite understand how to replace my multi-threaded scalar CUDA code with opencl vector code, which seems to be required for efficient usage of ATI/AMD chips. NVidia took a great approach by offering many examples in its tutorials and I would certainly benefit from seeing those or similar ones in ATI/AMDs tutorials.

                   

                  clop,

                        Converting scalar CUDA code to OpenCL vector code is similar to scalar CPU code to SSE CPU code. It is not always easy to convert scalar code to vector code. vectorizing RadixSort could be challenging task. we could help you if you point a sample where you are facing vectorization issue.

                   

                    • Matrix multiplication performance
                      hazeman

                      I'm following posts about OpenCL on ATI's gpus for some time now.

                      And as I see it for now ATI's OpenCL compiler isn't usable for real computations.

                      For 48xx OpenCL is much slower than Brook ( simple tests show 3x slowdown ). The problem is with missing local memory ( simulated by global ) and with memory access ( no caching ). I've asked if ATI is going to solve it any time soon - and the answer is no.

                      It looks like OpenCL on 58xx should work much better ( it has real local memory and there should be no problem with global memory ). But in real applications it can be slower than GeForce 8800GT ( http://cerberus.fileburst.net/showthread.php?t=55291&page=3 ).

                      IMHO the best thing to do for now is to wait few months. First we'll see real performance of Fermi and maybe ATI will improve OpenCL compiler.

                      And if you must use OpenCL now it's probably better to stay with NVidia.

                       

                       

                        • Matrix multiplication performance
                          clop

                          Hazeman,

                          Thank you for an insightful post.

                          As far as I understand, explicitly addressable local memory is a fairly new feature of AMD/ATI's chips. Maybe you're right, in that it's better to wait a few months until engineers, who are willing to stay on the bleeding edge gain experience with it.

                          I find it unfortunate, that AMD/ATI brings what it claims to be a multi-teraflop chip to the market, but refuses to publish an open-source example code, as simple as matrix-matrix multiplication, that illustrates the real-life performance and ease of programming of this chip.