I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again. But this is very costly. I did a simple test comparing looping inside the kernel (ignoring the synch problem) and looping over the enqueuing call. The difference is 10X longer for enqueuing. There's got to be a better way.
What takes the enqueuing operation so long? If I change my matrix size, the run times all change approximately proportionally to matrix size, which means that the kernel enqueuing operation is sensitive to buffer size and I don't see why. The matrices (buffers) are passed as pointers, so they should only be written to the device when I do a clEnqueueReadBuffer or clEnqueueWriteBuffer. Where is all the overhead coming from?
try enqueue kernel ten times and after that call clFinish()
Originally posted by: nou try enqueue kernel ten times and after that call clFinish()
Does your approach ensure global memory consistency between each kernel enqueue call?
Originally posted by: nou try enqueue kernel ten times and after that call clFinish()
OK, I just tried that. It makes no difference time-wise. I did not attempt to verify output correctness. The enqueuing occurs very quickly, but it takes just as long to finish executing the calls in the queue. What is going on in passing the kernel to the device?
For what it's worth, here's roughly what my program looks like
allocate and initialize matrices on host
find device, create context and queue, compile kernel
create buffers for each matrix with clCreateBuffer and CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR
set kernel args
clEnqueueNDRangeKernel
I also just tested creating the buffers with CL_MEM_USE_HOST_PTR instead of CL_MEM_COPY_HOST_PTR and that makes no difference in run time either.
One more last thought - I'm executing this kernel on a dual core Intel CPU as the device. I'm not on my box with the Firestream card and don't have it handy to compare with.
One more last thought - I'm executing this kernel on a dual core Intel CPU as the device. I'm not on my box with the Firestream card and don't have it handy to compare with.
I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again.
Volkov has implemented a synchronization mechanism (http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf) using atomic operations.