cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

drstrip
Journeyman III

synchronizing across (not within) work groups

I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an  updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again. But this is very costly. I did a simple test comparing looping inside the kernel (ignoring the synch problem) and looping over the enqueuing call. The difference is 10X longer for enqueuing. There's got to be a better way.

 

What takes the enqueuing operation so long? If I change my matrix size, the run times all change approximately proportionally to matrix size, which means that the kernel enqueuing operation is sensitive to buffer size and I don't see why. The matrices (buffers) are passed as pointers, so they should only be written to the device when I do a clEnqueueReadBuffer or clEnqueueWriteBuffer. Where is all the overhead coming from?

 

0 Likes
5 Replies
nou
Exemplar

try enqueue kernel ten times and after that call clFinish()

0 Likes
Fr4nz
Journeyman III

Originally posted by: nou try enqueue kernel ten times and after that call clFinish()


Does your approach ensure global memory consistency between each kernel enqueue call?

0 Likes

Originally posted by: nou try enqueue kernel ten times and after that call clFinish()

 

 

OK, I just tried that. It makes no difference time-wise. I did not attempt to verify output correctness. The enqueuing occurs very quickly, but it takes just as long to finish executing the calls in the queue. What is going on in passing the kernel to the device?

For what it's worth, here's roughly what my program looks like

    allocate and initialize matrices on host

    find device, create context and queue, compile kernel

    create buffers for each matrix with clCreateBuffer and CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR

    set kernel args

    clEnqueueNDRangeKernel

 

I also just tested creating the buffers with CL_MEM_USE_HOST_PTR instead of CL_MEM_COPY_HOST_PTR and that makes no difference in run time either.

 

One more last thought - I'm executing this kernel on a dual core Intel CPU as the device. I'm not on my box with the Firestream card and don't have it handy to compare with.

 

0 Likes


One more last thought - I'm executing this kernel on a dual core Intel CPU as the device. I'm not on my box with the Firestream card and don't have it handy to compare with.




 


So you could use CPU-based profiler like CodeAnalyst or vTune to see where exactly (what call into DLL at least) bottleneck lies.
0 Likes
gaurav_garg
Adept I

I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an  updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again.


Volkov has implemented a synchronization mechanism (http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf) using atomic operations.

0 Likes