Consider using Nelder Mead instead of NLCG or LMA. I'm not sure if it is in any existing library,
but it would be simple to implement in OpenCL. NM is flexible and sounds like a good "fit" to your
project problem statement - I've used a C++ implementation I wrote for similar work for many years.
I think there's a partial implementation of NM in Rob Farber's CUDA book, but it's probably quicker to
write a new routine from scratch than try to translate his code from CUDA to OpenCL.