I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again. But this is very costly. I did a simple test comparing looping inside the kernel (ignoring the synch problem) and looping over the enqueuing call. The difference is 10X longer for enqueuing. There's got to be a better way.

What takes the enqueuing operation so long? If I change my matrix size, the run times all change approximately proportionally to matrix size, which means that the kernel enqueuing operation is sensitive to buffer size and I don't see why. The matrices (buffers) are passed as pointers, so they should only be written to the device when I do a clEnqueueReadBuffer or clEnqueueWriteBuffer. Where is all the overhead coming from?

try enqueue kernel ten times and after that call clFinish()