I am pondering how I am going to invert a matrix that isn't going to fit in GPU memory. The matrix size is about 23000 currently but is expected to grow significantly over time.
I tried looking a matrix block decomposition, but accuracy suffers greatly (Matlab simulation).
Is this perhaps one problem not suitable for the GPU due to memory limitations?
From what I remember in linear algebra, matrix inversion is just about the worst thing you can do from a numerical stability standpoint. What is the condition number of your matrix? Blocked LU with partial pivoting is fairly stable and is commonly used to solve linear systems. Furthermore, it's one of the best things you can do on a GPU as you can acheive a significant fraction of the GPU's peak performance (since most of your time is spent in matrix multiplications and symmetric rank-k updates). People have even looked at out-of-core solvers (though, I can't find the paper at the moment) running on multiple-GPUs. So, in short, no, this problem is the epitome of what should be done using a GPU.
Just so there is no confusion. When I wrote a matrix size of 23000, I meant 23000x23000 in doubles which is over 4GB (same amount of memory required for inverted matrix). This is currently being inverted on a normal CPU on a PC with plenty of host memory.
I don't have the condition number of the matrix, sorry. Blocked LU seems to use similar math to Schur's decomposition (which introduces significant noise of the order 1e-04 in a matrix filled with random numbers from 0-1).
I do know that the matrix we are using is dense.
Yeah, I got that you meant a 23kx23k matrix. You'll definitely have to stream parts of the matrix in and out of the GPU. Another option to look at is to store the matrix across multiple GPUs, giving you more aggregate memory.
When doing LU, you need to use pivoting, but when you do it's one of the most stable and quickest ways to solve full-rank linear systems. You shouldn't be seeing 1e-4 error for a random matrix that size in double precision, as each element has O(N) operations applied to it. I would expect you should see errors at most around 1e-12 (double epsilon ~=1e-17 * 1e5 operations = 1e-12).
Here's a decent description of a decent LU algorithm: