Consider using Nelder Mead instead of NLCG or LMA. I'm not sure if it is in any existing library,
but it would be simple to implement in OpenCL. NM is flexible and sounds like a good "fit" to your
project problem statement - I've used a C++ implementation I wrote for similar work for many years.
I think there's a partial implementation of NM in Rob Farber's CUDA book, but it's probably quicker to
write a new routine from scratch than try to translate his code from CUDA to OpenCL.
Thank you for your feedback ajhill. I went down the road implementing LMA myself (currently in progress). For anyone running into the same problem: Following post describes the math and the various methods for the iterative step calculation: Linear And Nonlinear Least-Squares With Math.NET | Imaging Shop
Most of the operations can be straightforward done by the routines implemented in Math libraries (I used ViennaCL - Linear Algebra Library using CUDA, OpenCL, and OpenMP since they provide an SVD implementation). Whats left to do is to calculate the Jacobian using forward (or centered) differences and to figure out a way to let users easily supply model functions which will be used to calculate the residuals in an OpenCL kernel with arbitrary kernel arguments.