SGEMM variations

Discussion created by claudio_albanese on Oct 26, 2009
Latest reply on Nov 2, 2009 by claudio_albanese
Looking for help porting variations on SGEMM to ATI cards

I wonder if anyone on this forum would like to help with a port project. 

I recently released an open source pricing library based on GPU computing. You may find it on my homepage at by following the link to OPLib. The library includes a set of low-level routines written in CUDA and in C to which one can reduce most valuation and risk management tasks. In OPLib I also give an orchestration example for Monte Carlo pricing.

With CUDA and a 4-GPU system with Teslas 1060 I achieve a sustained performance of 340 GF/sec per card, i.e. about 1.36 TF/sec of sustained performance on a calibration task. Calibration is a very flop consuming operation as it takes about 5 petaflops per risk factor, give or take a factor two. 340 GF/sec is excellent if one considers that peak performance for matrix multiplication of large matrices on Teslas 1060 is 370 GF/sec while I have rather small matrices of size 512 and in the sustained performance benchmark I mentioned I am counting all the high level orchestration stuff and lots of glue code that are needed for a real life implementation. This makes me hope that once the crucial routines are optimized, sustained performance on one of the latest ATI cards can reach 2 TF/sec per card. 

Achieving this depends on the ability to port a few routines which I released in the public domain in OPLib, namely:

(i) SGEMM4, a routine which operates on an array of pairs matrices and multiplies them concurrently.

(ii) SGEMV3, a routine that takes as an argument a matrix and an array of vectors stored non contiguously in memory and applies the matrix to those vectors.

(iv) SGEMV4, a routine that batches a number of SGEMV3 calls.

(v) SDOT2, a routine that batches a number of calls to SDOT while storing the dot products in an array in global GPU memory.

(vi) SCOPY2, a routine that batches a number of calls to SCOPY. 

The single precision variants of these routines are my first priority. I would also be interested in double precision variations of course, but that's of secondary important as this sort of algorithm is quite robust also in single precision, with errors typically well below the tolerance level. 

If anyone in this forum is interested in finance applications and can optimize handwritten IL code, I would be very grateful if he would contact me with advice or ideally consider contributing to OPLib. This could be a good topic for graduate students or anyone who would like exposure to the finance sector by writing a paper that I can assure would find a broad readership.

Regards, Claudio