GPU Compute Programmers,

I have a C++ program which currently relies on the ACML (LAPACK) to invert and multiple fairly large matrices of single precision fp values (E.g. 4,000 x 4,000). These matrices are very sparse although they do not always fit nicely into a diagonal matrix so I cannot presently reduce them. The other thing about this program is I have to do this invert and multiply several times (serially) as part of a Newton Rapson. However, I have several thousand permutations which can be done in parallel, each with a small change to the matrix before again calculating and inverting the Jacobian. This is all single precision fp, and seems perfectly suited for the GPU. My question is this...

I suspect I will need to use the Accelerated Parallel Processing Math Libraries (APPML) as that is the only thing available with BLAS functionality, although I do not see the LAPACK dgetrf_ and dgetri_ functions included in APPML (yes, these are fp64 but I don't need that precision). Would C++ AMP be a better alternative? I am very interested in HSA features of passing pointers rather than copying data as there is a lot of data in flight here and some calculations still are done on the CPU. Ultimately, performance is the key and I want to make the right architectural decisions to set myself up for the most performance I can wring out of HSA GPUs coming out over the next 6 months.

Any thoughts, additional questions or discussion would be greatly appreciated!