I don't know what card you are using and what memory bandwidth your algo requires, but generally when you raise the number of threads, the following things happen:
- More Compute Units became active (if not all are active already) -> speedup
- More memory is accessed at a given time, so L2 caching became less effective -> slowdown
I guess you gain more from the first than losing from the second...
Aah yes, I see the cache hit is higher on the multiple kernel runs. Thanks for the quick and helpful response!