I have implemented gauss's naive algorithm but it seems you face issues regarding global synch and hence have to call a kernel iteratively.
Did you make any progress?
I found this:
Performance Comparison of Cholesky Decomposition
on GPUs and FPGAs (http://saahpc.ncsa.illinois.edu/10/papers/paper_45.pdf).
I've e-mailed 2 of the authors asking for the OpenCL source. Haven't heard anything yet.