Hi,
I stumbled upon an optimisation that I don't quite understand, and was hoping somebody could shed some light on it.
I have two OpenCL kernels, model_setup and model_run. The model_run kernel basically does a Monte Carlo type simulation with a large number of different parameter sets. One of the selling points of OpenCL is that you can just give it a huge amount of work, and it will figure out how to schedule it efficiently (if I'm not mistaken?). However, I stumbled upon a sizable performance increase (around 11%) by scheduling multiple executions of the kernels with a smaller NDRange as opposed to scheduling one execution with all the data. The group size was not changed.
Any ideas?
Dale
Solved! Go to Solution.
I don't know what card you are using and what memory bandwidth your algo requires, but generally when you raise the number of threads, the following things happen:
I guess you gain more from the first than losing from the second...
I don't know what card you are using and what memory bandwidth your algo requires, but generally when you raise the number of threads, the following things happen:
I guess you gain more from the first than losing from the second...
Aah yes, I see the cache hit is higher on the multiple kernel runs. Thanks for the quick and helpful response!
slowdownI guess you gain more from the first than losing from the second.