I stumbled upon an optimisation that I don't quite understand, and was hoping somebody could shed some light on it.
I have two OpenCL kernels, model_setup and model_run. The model_run kernel basically does a Monte Carlo type simulation with a large number of different parameter sets. One of the selling points of OpenCL is that you can just give it a huge amount of work, and it will figure out how to schedule it efficiently (if I'm not mistaken?). However, I stumbled upon a sizable performance increase (around 11%) by scheduling multiple executions of the kernels with a smaller NDRange as opposed to scheduling one execution with all the data. The group size was not changed.