2 Replies Latest reply on Nov 2, 2009 8:22 PM by claudio_albanese

    SGEMM variations

    claudio_albanese
      Looking for help porting variations on SGEMM to ATI cards

      I wonder if anyone on this forum would like to help with a port project. 

      I recently released an open source pricing library based on GPU computing. You may find it on my homepage at www.albanese.co.uk by following the link to OPLib. The library includes a set of low-level routines written in CUDA and in C to which one can reduce most valuation and risk management tasks. In OPLib I also give an orchestration example for Monte Carlo pricing.

      With CUDA and a 4-GPU system with Teslas 1060 I achieve a sustained performance of 340 GF/sec per card, i.e. about 1.36 TF/sec of sustained performance on a calibration task. Calibration is a very flop consuming operation as it takes about 5 petaflops per risk factor, give or take a factor two. 340 GF/sec is excellent if one considers that peak performance for matrix multiplication of large matrices on Teslas 1060 is 370 GF/sec while I have rather small matrices of size 512 and in the sustained performance benchmark I mentioned I am counting all the high level orchestration stuff and lots of glue code that are needed for a real life implementation. This makes me hope that once the crucial routines are optimized, sustained performance on one of the latest ATI cards can reach 2 TF/sec per card. 

      Achieving this depends on the ability to port a few routines which I released in the public domain in OPLib, namely:

      (i) SGEMM4, a routine which operates on an array of pairs matrices and multiplies them concurrently.

      (ii) SGEMV3, a routine that takes as an argument a matrix and an array of vectors stored non contiguously in memory and applies the matrix to those vectors.

      (iv) SGEMV4, a routine that batches a number of SGEMV3 calls.

      (v) SDOT2, a routine that batches a number of calls to SDOT while storing the dot products in an array in global GPU memory.

      (vi) SCOPY2, a routine that batches a number of calls to SCOPY. 

      The single precision variants of these routines are my first priority. I would also be interested in double precision variations of course, but that's of secondary important as this sort of algorithm is quite robust also in single precision, with errors typically well below the tolerance level. 

      If anyone in this forum is interested in finance applications and can optimize handwritten IL code, I would be very grateful if he would contact me with advice or ideally consider contributing to OPLib. This could be a good topic for graduate students or anyone who would like exposure to the finance sector by writing a paper that I can assure would find a broad readership.

      Regards, Claudio

      email: claudio@albanese.co.uk 

        • SGEMM variations
          st-cyclone

          WoW! been into the same exact stuff for a while now, for my BS.c thesis. AM very interested in a hand optmized IL, but on the other hand (some one correct me if am wrong) didnt AMD port (and optimize) its ACML's SGEMM? here http://developer.amd.com/gpu_assets/ATI%20Stream%20Computing%20-%20ACML-GPU%20SGEMM%20Optimization%20Illustration.ppt

          on the other hand, i took a quick dive into IL, but am concentrating on OpenCL (ocl) for now. but heck, i've been imagining how quick an n-hemlock system hand optimized IL doing MC simulations. 

          Currently, am working on a MC and/or a Black-Scholes ocl implementation, which will be decided by the help of a local specialist depending on what local investment firms need. 

          My plan was writing the thing in ocl then do hand optimizations to it if performance isnt maximized, hopefully I won't need to, since am not IL guru! but have a feeling that i will have to be one...

          As you may expect from a computer engineering undergraduate, (with just a micro-economics course); finance knowledge is just what ever google/Wikipedia could provide me!

            • SGEMM variations
              claudio_albanese

              > didnt AMD port (and optimize) its ACML's SGEMM? 

              True, but a fellow online who calls himself prunedtree posted online code that runs at 980 GF/sec on an RV770. See here

              http://forum.beyond3d.com/showthread.php?t=54842

              By extrapolating, on an RV780 one should be able to reach around 2 TF/sec, which would be terrific. On the Tesla 1060 hardware that I am currently using, the same benchmark stands at only 360 GF/sec. 

              OpenCL matrix multiply code on ATI kits has a performance at around 120 GF/sec. 

               

              http://forums.amd.com/forum/messageview.cfm?catid=390&threadid=120413&forumid=9



              Looks like the OpenCL compiler is unable to get anywhere close to the performance of handwritten IL code. 

              That's problematic to me because I need not SGEMM but batched versions of SGEMM. See my webpage on OPLib. 

              With CUDA it was easier because CUDA is able to generate nearly optimal code. So I could code the extensions myself based on the code by Vasily Volkov. I am now trying to study IL to do the same, but it's tough! Perhaps as a compromise one could use Brook+, but also that is only 30% efficient.

              Regarding Monte Carlo pricing, you may want to look at OPLib. There I have implementations for both Teslas and the Nehalem and I conclude that the Nehalem is 3 times faster than the Tesla. There has been a lot of hype about using GPUs for Monte Carlo pricing but in my opinion that was misplaced. There is talk about 200x speed up ratios which if you look carefully into it only reflects poorly optimized CPU code.

              Also notice that what really matters is to generate scenarios not for Black-Scholes but for generic stochastic vol processes. For those I generate about 230 milion evals per second on a Tesla 1060 and about 670 M eval per second on a pair of Nehalems. I am using CUDA but I doubt OpenCL would do any better. The problem is that SIMD architectures cannot deal well with asynchronous branching and on a CPU there's much more one can do in the way of cache optimizations. See my code in OPLib for more details. 

              My opinion is that GPUs should be used for high throughput algorithms such as SGEMM variations. That is where they truly shine. Scenario simulation is best done CPU side.

               

              > As you may expect from a computer engineering undergraduate, (with just a micro-economics course); finance knowledge is just what ever google/Wikipedia could provide me!

               

              This paper contains all the theory one needs and takes you straight to the cutting edge:

              http://www.level3finance.com/change.pdf

              I wrote it with engineers in mind, but perhaps it is more for graduate students. If you need a more elementary introduction to finance, I am also posting my lectures at King's College online here:

              http://www.level3finance.com/teaching.html