To calculate 100 matrixes multiplication,which is faster? One is to loop a kernel 99 times which deals with two matrixes multiplication. Another is excute the kernel only one time which deals with 100 matrixes.
My guess is that it is much faster to call the kernel only once. I have noticed quite a bit of overhead associated with the EnqueueNDRangeKernel call, so if you can amortize that one kernel call over 100 iterations as opposed to calling the kernel 100 times, you should see a decent speed up.