
Matrix inversion
rick.weber Jan 11, 2011 3:24 PM (in response to galmok)From what I remember in linear algebra, matrix inversion is just about the worst thing you can do from a numerical stability standpoint. What is the condition number of your matrix? Blocked LU with partial pivoting is fairly stable and is commonly used to solve linear systems. Furthermore, it's one of the best things you can do on a GPU as you can acheive a significant fraction of the GPU's peak performance (since most of your time is spent in matrix multiplications and symmetric rankk updates). People have even looked at outofcore solvers (though, I can't find the paper at the moment) running on multipleGPUs. So, in short, no, this problem is the epitome of what should be done using a GPU.

Matrix inversion
galmok Jan 11, 2011 7:01 PM (in response to rick.weber)Just so there is no confusion. When I wrote a matrix size of 23000, I meant 23000x23000 in doubles which is over 4GB (same amount of memory required for inverted matrix). This is currently being inverted on a normal CPU on a PC with plenty of host memory.
I don't have the condition number of the matrix, sorry. Blocked LU seems to use similar math to Schur's decomposition (which introduces significant noise of the order 1e04 in a matrix filled with random numbers from 01).
I do know that the matrix we are using is dense.

Matrix inversion
rick.weber Jan 11, 2011 7:49 PM (in response to galmok)Yeah, I got that you meant a 23kx23k matrix. You'll definitely have to stream parts of the matrix in and out of the GPU. Another option to look at is to store the matrix across multiple GPUs, giving you more aggregate memory.
When doing LU, you need to use pivoting, but when you do it's one of the most stable and quickest ways to solve fullrank linear systems. You shouldn't be seeing 1e4 error for a random matrix that size in double precision, as each element has O(N) operations applied to it. I would expect you should see errors at most around 1e12 (double epsilon ~=1e17 * 1e5 operations = 1e12).
Here's a decent description of a decent LU algorithm:

