I wonder if anyone on this forum would like to help with a port project.

I recently released an open source pricing library based on GPU computing. You may find it on my homepage at www.albanese.co.uk by following the link to OPLib. The library includes a set of low-level routines written in CUDA and in C to which one can reduce most valuation and risk management tasks. In OPLib I also give an orchestration example for Monte Carlo pricing.

With CUDA and a 4-GPU system with Teslas 1060 I achieve a sustained performance of 340 GF/sec per card, i.e. about 1.36 TF/sec of sustained performance on a calibration task. Calibration is a very flop consuming operation as it takes about 5 petaflops per risk factor, give or take a factor two. 340 GF/sec is excellent if one considers that peak performance for matrix multiplication of large matrices on Teslas 1060 is 370 GF/sec while I have rather small matrices of size 512 and in the sustained performance benchmark I mentioned I am counting all the high level orchestration stuff and lots of glue code that are needed for a real life implementation. This makes me hope that once the crucial routines are optimized, sustained performance on one of the latest ATI cards can reach 2 TF/sec per card.

Achieving this depends on the ability to port a few routines which I released in the public domain in OPLib, namely:

(i) SGEMM4, a routine which operates on an array of pairs matrices and multiplies them concurrently.

(ii) SGEMV3, a routine that takes as an argument a matrix and an array of vectors stored non contiguously in memory and applies the matrix to those vectors.

(iv) SGEMV4, a routine that batches a number of SGEMV3 calls.

(v) SDOT2, a routine that batches a number of calls to SDOT while storing the dot products in an array in global GPU memory.

(vi) SCOPY2, a routine that batches a number of calls to SCOPY.

The single precision variants of these routines are my first priority. I would also be interested in double precision variations of course, but that's of secondary important as this sort of algorithm is quite robust also in single precision, with errors typically well below the tolerance level.

If anyone in this forum is interested in finance applications and can optimize handwritten IL code, I would be very grateful if he would contact me with advice or ideally consider contributing to OPLib. This could be a good topic for graduate students or anyone who would like exposure to the finance sector by writing a paper that I can assure would find a broad readership.

Regards, Claudio

email: claudio@albanese.co.uk

WoW! been into the same exact stuff for a while now, for my BS.c thesis. AM

veryinterested in a hand optmized IL, but on the other hand (some one correct me if am wrong) didnt AMD port (and optimize) its ACML's SGEMM? here http://developer.amd.com/gpu_assets/ATI%20Stream%20Computing%20-%20ACML-GPU%20SGEMM%20Optimization%20Illustration.ppton the other hand, i took a quick dive into IL, but am concentrating on OpenCL (ocl) for now. but heck, i've been imagining how quick an n-hemlock system hand optimized IL doing MC simulations.

Currently, am working on a MC and/or a Black-Scholes ocl implementation, which will be decided by the help of a local specialist depending on what local investment firms need.

My plan was writing the thing in ocl then do hand optimizations to it if performance isnt maximized, hopefully I won't need to, since am not IL guru! but have a feeling that i will have to be one...

As you may expect from a computer engineering undergraduate, (with just a micro-economics course); finance knowledge is just what ever google/Wikipedia could provide me!