11 Replies Latest reply on Apr 11, 2012 12:52 PM by rick.weber

    GPU_ASYNC_MEM_COPY=2 doesn't seem to work


      I know this is a preview feature, but I can't seem to interleave compute and transfers on my Radeon 7970s. I'm using SDK 2.6 and the 8.921 linux 64 drivers on Ubuntu.


      I do the following:

      1) Create two command queues

      2) Create 6 buffers

      3) set to queue 0

      4) enqueue transfer transfer execute

      5) set to queue 1

      6) enqueue transfer transfer execute

      7) keep doing 3-6 a few times

      8) reduce outputs from queues 1 and 2

      9) transfer reduced output back to host

      10) flush both queues


      I'm allocating the buffers as CL_READ_WRITE (nothing else) and calling clEnqueueWriteBuffer() with CL_FALSE for blocking to do the transfers. I daisy chain events for safety reasons (long story short, the library does this), but because each queue only touches 3 of the 6 buffers, there's no dependencies between them (other than the reduction).


      When I plot the profiling data on a timeline, it's immediately clear the driver isn't overlapping computation and execution; queue 1 does some transfers and executions then stalls while queue 2 does some then back and forth.


      Is there anything I'm missing with this feature, like needing to use pinned memory and clEnqueueMap and such? I want this to work in a general purpose way and I really don't want to have to use clEnqueMap and pinned memory unless I really have to, as I'll need to break transfers into pieces, stream them through pinned memory, and watch events to know when to start the next memcpy. Furthermore, I assumed the runtime already did this anyways.

        • Re: GPU_ASYNC_MEM_COPY=2 doesn't seem to work

          Would it be possible to get the exact requirements for using this feature? Everything I've read is fairly vague and I've tried what I found in other threads.

          • Re: GPU_ASYNC_MEM_COPY=2 doesn't seem to work

            Hi Rick,


            Will you please post your code here for review? 




              • Re: GPU_ASYNC_MEM_COPY=2 doesn't seem to work

                Here is the salient part of the code that issues tasks into queues. This is all built on the clutil abstraction, so you won't see any actual OpenCL call here and I don't want to post the source for the entire library here. I think it would be more useful to just describe what the functions in my library do on the backend. If you disagree and want the entire thing, PM me and I can point you to the google code repository.


                Basically, this example is matrix multiplication that automatically uses multiple OpenCL devices. The ParallelFor loop issues the user defined lambda every time a device isn't doing any work. The runtime manages this behind the scenes by enqueueing a marker and polling its event for completion. As you can imagine, this requires everything in the lambda be nonblocking and merely issue tasks to the device.


                Each device has two command queues (more can be configured, but that's a non-sequitur), and the user manages which queue subsequent commands are put in. Users change which command queue they submit to by calling curDevice.setCommandQueue(number).


                Buffer objects maintain the cl_mem handle to the underlying data they allocate upon construction as well as the size. When constructed, these buffers allocate memory using CL_MEM_READ_WRITE as the only flag. When copy constructed, they call clRetain() and when destructed they call clRelease(). They also maintain a cl_event to the last task that referenced this object (be it a kernel or data transfer). Buffer.get() and Buffer.put() wrap clEnqueueReadBuffer() and clEnqueueWriteBuffer() respectively, forwarding the pointer the user sends. They pass the event from the last referencing task as a dependency, release that event, then set the event output from the clEnqueue*Buffer() task as the last task. No calls to clEnqueue*Buffer() are blocking.


                clUtilEnqueueKernel() is a bit of variadic template black magic where you pass the name of the kernel you want as argument 1 and an Rvalue reference to a variadic "grid" object defining global/local work group arrangements as argument 2. Arguments 3 through N are literally passed along to clKernelSetArg() UNLESS they happen to be Buffer objects, in which case they pass their underlying cl_mem handle instead. Additionally, their last referencing cl_event is appended to the event wait list. We then call clEnqueueNDRangeKernel(), passing the wait list of all the events from all buffer objects. We then take the output event and update all the Buffer's last reference.


                sgemm is a BLAS call that essentially just calls clUtilEnqueueKernel().


                In essence, this code alternates between two command queues, each with its own associated buffers, and calls matrix multiply on blocks of the matrix. Each task is only dependent on tasks previously issued to the same queue, since we double buffer. After all our sgemms finish, we add the two buffers, introducing the only inter-queue dependencies at the end. However, we still don't see overlap if we comment this out.


                We preallocate 2 a, b, and c buffers for each device and call a kernel that zeros them and then calls clFInish (not displayed here) to force lazy allocation to quit being lazy and just do it.


                We are seeing correct results and are indeed using all 3 Radeon 7970s in the machine as evidenced by our profiling trace, but our performance scalability is bad because the sgemm kernel is just too fast and transfers become the bottleneck. We need to overlap the transfer and communication to even have a prayer for improving scalability (and even then I remain skeptical).


                #include "Matrix.h"


                bool _IMatrix::RuntimeInitialized = false;


                std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::aDevice;

                std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::bDevice;

                std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::cDevice;



                void multiply(BlockedMatrix<float>& c,

                              const BlockedMatrix<float>& a,

                              const BlockedMatrix<float>& b)


                  using namespace clUtil;

                  using namespace std;


                  if((a.mRowsPerBlock != b.mRowsPerBlock) ||

                     (b.mRowsPerBlock != c.mRowsPerBlock))


                    throw BlockedMatrixBlockMismatch();



                  if((a.mColsPerBlock != b.mColsPerBlock) ||

                     (b.mColsPerBlock != c.mColsPerBlock))


                    throw BlockedMatrixBlockMismatch();



                  if((a.mRows != c.mRows) || (b.mCols != c.mCols) || (a.mCols != b.mRows))


                    throw MatrixBadDimensions();



                  char transA = 'N';

                  char transB = 'N';


                  //Use double buffering to overlap transfer and compute

                  unsigned int k = a.getCols();


                  ParallelFor(0, 1, c.numRowBlocks() * c.numColBlocks() - 1,

                  [&](size_t startIdx, size_t endIdx)


                    size_t curDeviceNum = Device::GetCurrentDeviceNum();


                    shared_ptr<Buffer> curA = _IMatrix::aDevice[2 * curDeviceNum];

                    shared_ptr<Buffer> altA = _IMatrix::aDevice[2 * curDeviceNum + 1];

                    shared_ptr<Buffer> curB = _IMatrix::bDevice[2 * curDeviceNum];

                    shared_ptr<Buffer> altB = _IMatrix::bDevice[2 * curDeviceNum + 1];

                    shared_ptr<Buffer> curC = _IMatrix::cDevice[2 * curDeviceNum];

                    shared_ptr<Buffer> altC = _IMatrix::cDevice[2 * curDeviceNum + 1];


                    Device& curDevice = Device::GetCurrentDevice();


                    size_t curQueue = 0;


                    for(size_t curBlockID = startIdx; curBlockID <= endIdx; curBlockID++)


                      unsigned int blockRow = curBlockID % c.numRowBlocks();

                      unsigned int blockCol = curBlockID / c.numRowBlocks();

                      unsigned int blockSize = c.mRowsPerBlock * c.mColsPerBlock;


                      //Zero our output buffers




                                          clUtilGrid(blockSize, 64),







                                          clUtilGrid(blockSize, 64),







                      //Go back and forth between the two queues, multiplying blocks of

                      //A and B. This should give us transfer/kernel interleaving

                      for(unsigned int curK = 0; curK < k / c.mRowsPerBlock; curK++)


                        curA->put(a.getBlock(blockRow, curK), sizeof(float) * blockSize);

                        curB->put(b.getBlock(curK, blockCol), sizeof(float) * blockSize);
















                        //Swap the buffers

                        shared_ptr<Buffer> tmp;



                        tmp = curA;

                        curA = altA;

                        altA = tmp;


                        tmp = curB;

                        curB = altB;

                        altB = tmp;


                        tmp = curC;

                        curC = altC;

                        altC = tmp;


                        //Change the queue the next batch of tasks will go into

                        curQueue = curQueue == 0 ? 1 : 0;





                      //Add the two c buffers


                                          clUtilGrid(blockSize, 64),





                      //Finally, write c out to its correct place

                      curC->get(c.getBlock(curBlockID), sizeof(float) * blockSize);