cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

rick_weber
Adept II

GPU_ASYNC_MEM_COPY=2 doesn't seem to work

I know this is a preview feature, but I can't seem to interleave compute and transfers on my Radeon 7970s. I'm using SDK 2.6 and the 8.921 linux 64 drivers on Ubuntu.

I do the following:

1) Create two command queues

2) Create 6 buffers

3) set to queue 0

4) enqueue transfer transfer execute

5) set to queue 1

6) enqueue transfer transfer execute

7) keep doing 3-6 a few times

😎 reduce outputs from queues 1 and 2

9) transfer reduced output back to host

10) flush both queues

I'm allocating the buffers as CL_READ_WRITE (nothing else) and calling clEnqueueWriteBuffer() with CL_FALSE for blocking to do the transfers. I daisy chain events for safety reasons (long story short, the library does this), but because each queue only touches 3 of the 6 buffers, there's no dependencies between them (other than the reduction).

When I plot the profiling data on a timeline, it's immediately clear the driver isn't overlapping computation and execution; queue 1 does some transfers and executions then stalls while queue 2 does some then back and forth.

Is there anything I'm missing with this feature, like needing to use pinned memory and clEnqueueMap and such? I want this to work in a general purpose way and I really don't want to have to use clEnqueMap and pinned memory unless I really have to, as I'll need to break transfers into pieces, stream them through pinned memory, and watch events to know when to start the next memcpy. Furthermore, I assumed the runtime already did this anyways.

0 Likes
11 Replies
rick_weber
Adept II

Would it be possible to get the exact requirements for using this feature? Everything I've read is fairly vague and I've tried what I found in other threads.

0 Likes

Or simply an example that demonstrates it?

(i'd obviously rather wrongly assumed this kind of thing was a given: i mean this sort of pc or device sitting around polling for i/o is such an ibm-pc-xt era technological limitation i'm pretty much shocked this is even in a shipping product in this day an age).

0 Likes
kcarney
Staff

Hi Rick,

Will you please post your code here for review? 

Thanks!

Kristen

0 Likes

Here is the salient part of the code that issues tasks into queues. This is all built on the clutil abstraction, so you won't see any actual OpenCL call here and I don't want to post the source for the entire library here. I think it would be more useful to just describe what the functions in my library do on the backend. If you disagree and want the entire thing, PM me and I can point you to the google code repository.

Basically, this example is matrix multiplication that automatically uses multiple OpenCL devices. The ParallelFor loop issues the user defined lambda every time a device isn't doing any work. The runtime manages this behind the scenes by enqueueing a marker and polling its event for completion. As you can imagine, this requires everything in the lambda be nonblocking and merely issue tasks to the device.

Each device has two command queues (more can be configured, but that's a non-sequitur), and the user manages which queue subsequent commands are put in. Users change which command queue they submit to by calling curDevice.setCommandQueue(number).

Buffer objects maintain the cl_mem handle to the underlying data they allocate upon construction as well as the size. When constructed, these buffers allocate memory using CL_MEM_READ_WRITE as the only flag. When copy constructed, they call clRetain() and when destructed they call clRelease(). They also maintain a cl_event to the last task that referenced this object (be it a kernel or data transfer). Buffer.get() and Buffer.put() wrap clEnqueueReadBuffer() and clEnqueueWriteBuffer() respectively, forwarding the pointer the user sends. They pass the event from the last referencing task as a dependency, release that event, then set the event output from the clEnqueue*Buffer() task as the last task. No calls to clEnqueue*Buffer() are blocking.

clUtilEnqueueKernel() is a bit of variadic template black magic where you pass the name of the kernel you want as argument 1 and an Rvalue reference to a variadic "grid" object defining global/local work group arrangements as argument 2. Arguments 3 through N are literally passed along to clKernelSetArg() UNLESS they happen to be Buffer objects, in which case they pass their underlying cl_mem handle instead. Additionally, their last referencing cl_event is appended to the event wait list. We then call clEnqueueNDRangeKernel(), passing the wait list of all the events from all buffer objects. We then take the output event and update all the Buffer's last reference.

sgemm is a BLAS call that essentially just calls clUtilEnqueueKernel().

In essence, this code alternates between two command queues, each with its own associated buffers, and calls matrix multiply on blocks of the matrix. Each task is only dependent on tasks previously issued to the same queue, since we double buffer. After all our sgemms finish, we add the two buffers, introducing the only inter-queue dependencies at the end. However, we still don't see overlap if we comment this out.

We preallocate 2 a, b, and c buffers for each device and call a kernel that zeros them and then calls clFInish (not displayed here) to force lazy allocation to quit being lazy and just do it.

We are seeing correct results and are indeed using all 3 Radeon 7970s in the machine as evidenced by our profiling trace, but our performance scalability is bad because the sgemm kernel is just too fast and transfers become the bottleneck. We need to overlap the transfer and communication to even have a prayer for improving scalability (and even then I remain skeptical).

#include "Matrix.h"

bool _IMatrix::RuntimeInitialized = false;

std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::aDevice;

std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::bDevice;

std::unique_ptr<std::shared_ptr<clUtil::Buffer>[]> _IMatrix::cDevice;

void multiply(BlockedMatrix<float>& c,

              const BlockedMatrix<float>& a,

              const BlockedMatrix<float>& b)

{

  using namespace clUtil;

  using namespace std;

  if((a.mRowsPerBlock != b.mRowsPerBlock) ||

     (b.mRowsPerBlock != c.mRowsPerBlock))

  {

    throw BlockedMatrixBlockMismatch();

  }

  if((a.mColsPerBlock != b.mColsPerBlock) ||

     (b.mColsPerBlock != c.mColsPerBlock))

  {

    throw BlockedMatrixBlockMismatch();

  }

  if((a.mRows != c.mRows) || (b.mCols != c.mCols) || (a.mCols != b.mRows))

  {

    throw MatrixBadDimensions();

  }

  char transA = 'N';

  char transB = 'N';

  //Use double buffering to overlap transfer and compute

  unsigned int k = a.getCols();

  ParallelFor(0, 1, c.numRowBlocks() * c.numColBlocks() - 1,

  [&](size_t startIdx, size_t endIdx)

  {

    size_t curDeviceNum = Device::GetCurrentDeviceNum();

    shared_ptr<Buffer> curA = _IMatrix::aDevice[2 * curDeviceNum];

    shared_ptr<Buffer> altA = _IMatrix::aDevice[2 * curDeviceNum + 1];

    shared_ptr<Buffer> curB = _IMatrix::bDevice[2 * curDeviceNum];

    shared_ptr<Buffer> altB = _IMatrix::bDevice[2 * curDeviceNum + 1];

    shared_ptr<Buffer> curC = _IMatrix::cDevice[2 * curDeviceNum];

    shared_ptr<Buffer> altC = _IMatrix::cDevice[2 * curDeviceNum + 1];

    Device& curDevice = Device::GetCurrentDevice();

    size_t curQueue = 0;

    for(size_t curBlockID = startIdx; curBlockID <= endIdx; curBlockID++)

    {

      unsigned int blockRow = curBlockID % c.numRowBlocks();

      unsigned int blockCol = curBlockID / c.numRowBlocks();

      unsigned int blockSize = c.mRowsPerBlock * c.mColsPerBlock;

      //Zero our output buffers

      curDevice.setCommandQueue(0);

      clUtilEnqueueKernel("bzero",

                          clUtilGrid(blockSize, 64),

                          *curC,

                          blockSize);

      curDevice.setCommandQueue(1);

      clUtilEnqueueKernel("bzero",

                          clUtilGrid(blockSize, 64),

                          *altC,

                          blockSize);

     

      curDevice.setCommandQueue(0);

      //Go back and forth between the two queues, multiplying blocks of

      //A and B. This should give us transfer/kernel interleaving

      for(unsigned int curK = 0; curK < k / c.mRowsPerBlock; curK++)

      {

        curA->put(a.getBlock(blockRow, curK), sizeof(float) * blockSize);

        curB->put(b.getBlock(curK, blockCol), sizeof(float) * blockSize);

       

        sgemm(&transA,

              &transB,

              c.mRowsPerBlock,

              c.mColsPerBlock,

              a.mColsPerBlock,

              1.0,

              *curA,

              a.mRowsPerBlock,

              *curB,

              b.mRowsPerBlock,

              1.0,

              *curC,

              c.mRowsPerBlock);

        //Swap the buffers

        shared_ptr<Buffer> tmp;

        tmp = curA;

        curA = altA;

        altA = tmp;

       

        tmp = curB;

        curB = altB;

        altB = tmp;

       

        tmp = curC;

        curC = altC;

        altC = tmp;

        //Change the queue the next batch of tasks will go into

        curQueue = curQueue == 0 ? 1 : 0;

        curDevice.setCommandQueue(curQueue);

      }

      //Add the two c buffers

      clUtilEnqueueKernel("acc",

                          clUtilGrid(blockSize, 64),

                          *curC,

                          *altC,

                          blockSize);

     

      //Finally, write c out to its correct place

      curC->get(c.getBlock(curBlockID), sizeof(float) * blockSize);

    }

  });

}

0 Likes

Hi Rick,

When using in-order OCL queue you don't need to chain events from one command to the next on the same queue if you have a set of dedicated buffers per queue. AMD does not support out of order queues. However, while this may slightly reduce performance, it should not account for the lack of concurrency.

If this is an open source project, point me to the repository and I will take a closer look.

Tzachi

0 Likes

I know AMD doesn't support OOO queues. The idea is that when they do support it, I can change how the queues are instantiated and still have correctness guarantees. Furthermore, tracking dependencies provides protection if the user does something stupid with multiple queues like enqueue a transfer into queue 0 and a kernel needing that data into queue 1.

0 Likes

Also, how do you private message people?

0 Likes

Rick,

There isn't private message functionality on the forums.

Cheers!

Kristen

0 Likes

Anywho, you can check out the source at

https://code.google.com/p/clutil/source/checkout

You need g++4.5 or higher to compile this due to C++11 features.

The example in question is in examples/MatrixMultiply

0 Likes

I have a few comments and questions on your code but i rather take this offline, What is you e-mail address?

Tzachi

0 Likes

rick dot weber dot iii at gmail dot com

0 Likes