I'm queuing kernels that modify a buffer over and over again and am wondering if there's a more efficient way to do what I'm doing.
for (int q = 0; q < iterations; q++)
clEnqueueNDRangeKernel(cq, kernelA, args...);
clEnqueueNDRangeKernel(cq, kernelB, args...);
In my case, kernelB must wait for kernelA, however, no arguments for them change and no communication with the host needs to happen until the for loop completes. The problem is that for my data set the for loop needs to iterate thousands of times...and it seems that clEnqueuNDRangeKernel has a non-trivial cost when called enough times, so a lot of time is spent queuing the kernels themselves, when it seems like it would be easier to just somehow tell OpenCL to re-run them N times? Is that possible?
My global_work_size is in the millions and B must wait for A, so I don't think it's possible to do something clever like put iteration loop inside the kernel or something.