5 Replies Latest reply on Mar 26, 2010 11:18 AM by gaurav.garg

    synchronizing across (not within) work groups

    drstrip

      I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an  updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again. But this is very costly. I did a simple test comparing looping inside the kernel (ignoring the synch problem) and looping over the enqueuing call. The difference is 10X longer for enqueuing. There's got to be a better way.

       

      What takes the enqueuing operation so long? If I change my matrix size, the run times all change approximately proportionally to matrix size, which means that the kernel enqueuing operation is sensitive to buffer size and I don't see why. The matrices (buffers) are passed as pointers, so they should only be written to the device when I do a clEnqueueReadBuffer or clEnqueueWriteBuffer. Where is all the overhead coming from?

       

        • synchronizing across (not within) work groups
          nou

          try enqueue kernel ten times and after that call clFinish()

            • synchronizing across (not within) work groups
              Fr4nz

               

              Originally posted by: nou try enqueue kernel ten times and after that call clFinish()


              Does your approach ensure global memory consistency between each kernel enqueue call?

              • synchronizing across (not within) work groups
                drstrip

                 

                Originally posted by: nou try enqueue kernel ten times and after that call clFinish()

                 

                 

                OK, I just tried that. It makes no difference time-wise. I did not attempt to verify output correctness. The enqueuing occurs very quickly, but it takes just as long to finish executing the calls in the queue. What is going on in passing the kernel to the device?

                For what it's worth, here's roughly what my program looks like

                    allocate and initialize matrices on host

                    find device, create context and queue, compile kernel

                    create buffers for each matrix with clCreateBuffer and CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR

                    set kernel args

                    clEnqueueNDRangeKernel

                 

                I also just tested creating the buffers with CL_MEM_USE_HOST_PTR instead of CL_MEM_COPY_HOST_PTR and that makes no difference in run time either.

                 

                One more last thought - I'm executing this kernel on a dual core Intel CPU as the device. I'm not on my box with the Firestream card and don't have it handy to compare with.

                 

              • synchronizing across (not within) work groups
                gaurav.garg

                 

                I have a kernel that operates on a 2D matrix, doing ops on nearest neighbors to compute an  updated value for each matrix entry. The matrix is far too large for a single workgroup. I would like to perform this update of the matrix (think, for example, Conway's Game of Life) numerous times on each call to the kernel. The barrier function only operates within a work group, so I can't synchronize my update of the matrix that way. As far as I can tell, the only thing I can do is a single update of the matrix in each call to the kernel, wait for it to complete, then enqueue the kernel again.


                Volkov has implemented a synchronization mechanism (http://mc.stanford.edu/cgi-bin/images/6/65/SC08_Volkov_GPU.pdf) using atomic operations.