My guess is that it is much faster to call the kernel only once. I have noticed quite a bit of overhead associated with the EnqueueNDRangeKernel call, so if you can amortize that one kernel call over 100 iterations as opposed to calling the kernel 100 times, you should see a decent speed up.