Yeah, I got that you meant a 23kx23k matrix. You'll definitely have to stream parts of the matrix in and out of the GPU. Another option to look at is to store the matrix across multiple GPUs, giving you more aggregate memory.
When doing LU, you need to use pivoting, but when you do it's one of the most stable and quickest ways to solve full-rank linear systems. You shouldn't be seeing 1e-4 error for a random matrix that size in double precision, as each element has O(N) operations applied to it. I would expect you should see errors at most around 1e-12 (double epsilon ~=1e-17 * 1e5 operations = 1e-12).
Here's a decent description of a decent LU algorithm:
http://www.cs.utexas.edu/~echan/vpapers/SPAA2010.pdf