Would it be possible to get the exact requirements for using this feature? Everything I've read is fairly vague and I've tried what I found in other threads.
Or simply an example that demonstrates it?
(i'd obviously rather wrongly assumed this kind of thing was a given: i mean this sort of pc or device sitting around polling for i/o is such an ibm-pc-xt era technological limitation i'm pretty much shocked this is even in a shipping product in this day an age).
Will you please post your code here for review?
Here is the salient part of the code that issues tasks into queues. This is all built on the clutil abstraction, so you won't see any actual OpenCL call here and I don't want to post the source for the entire library here. I think it would be more useful to just describe what the functions in my library do on the backend. If you disagree and want the entire thing, PM me and I can point you to the google code repository.
Basically, this example is matrix multiplication that automatically uses multiple OpenCL devices. The ParallelFor loop issues the user defined lambda every time a device isn't doing any work. The runtime manages this behind the scenes by enqueueing a marker and polling its event for completion. As you can imagine, this requires everything in the lambda be nonblocking and merely issue tasks to the device.
Each device has two command queues (more can be configured, but that's a non-sequitur), and the user manages which queue subsequent commands are put in. Users change which command queue they submit to by calling curDevice.setCommandQueue(number).
Buffer objects maintain the cl_mem handle to the underlying data they allocate upon construction as well as the size. When constructed, these buffers allocate memory using CL_MEM_READ_WRITE as the only flag. When copy constructed, they call clRetain() and when destructed they call clRelease(). They also maintain a cl_event to the last task that referenced this object (be it a kernel or data transfer). Buffer.get() and Buffer.put() wrap clEnqueueReadBuffer() and clEnqueueWriteBuffer() respectively, forwarding the pointer the user sends. They pass the event from the last referencing task as a dependency, release that event, then set the event output from the clEnqueue*Buffer() task as the last task. No calls to clEnqueue*Buffer() are blocking.
clUtilEnqueueKernel() is a bit of variadic template black magic where you pass the name of the kernel you want as argument 1 and an Rvalue reference to a variadic "grid" object defining global/local work group arrangements as argument 2. Arguments 3 through N are literally passed along to clKernelSetArg() UNLESS they happen to be Buffer objects, in which case they pass their underlying cl_mem handle instead. Additionally, their last referencing cl_event is appended to the event wait list. We then call clEnqueueNDRangeKernel(), passing the wait list of all the events from all buffer objects. We then take the output event and update all the Buffer's last reference.
sgemm is a BLAS call that essentially just calls clUtilEnqueueKernel().
In essence, this code alternates between two command queues, each with its own associated buffers, and calls matrix multiply on blocks of the matrix. Each task is only dependent on tasks previously issued to the same queue, since we double buffer. After all our sgemms finish, we add the two buffers, introducing the only inter-queue dependencies at the end. However, we still don't see overlap if we comment this out.
We preallocate 2 a, b, and c buffers for each device and call a kernel that zeros them and then calls clFInish (not displayed here) to force lazy allocation to quit being lazy and just do it.
We are seeing correct results and are indeed using all 3 Radeon 7970s in the machine as evidenced by our profiling trace, but our performance scalability is bad because the sgemm kernel is just too fast and transfers become the bottleneck. We need to overlap the transfer and communication to even have a prayer for improving scalability (and even then I remain skeptical).
bool _IMatrix::RuntimeInitialized = false;
void multiply(BlockedMatrix<float>& c,
const BlockedMatrix<float>& a,
const BlockedMatrix<float>& b)
using namespace clUtil;
using namespace std;
if((a.mRowsPerBlock != b.mRowsPerBlock) ||
(b.mRowsPerBlock != c.mRowsPerBlock))
if((a.mColsPerBlock != b.mColsPerBlock) ||
(b.mColsPerBlock != c.mColsPerBlock))
if((a.mRows != c.mRows) || (b.mCols != c.mCols) || (a.mCols != b.mRows))
char transA = 'N';
char transB = 'N';
//Use double buffering to overlap transfer and compute
unsigned int k = a.getCols();
ParallelFor(0, 1, c.numRowBlocks() * c.numColBlocks() - 1,
[&](size_t startIdx, size_t endIdx)
size_t curDeviceNum = Device::GetCurrentDeviceNum();
shared_ptr<Buffer> curA = _IMatrix::aDevice[2 * curDeviceNum];
shared_ptr<Buffer> altA = _IMatrix::aDevice[2 * curDeviceNum + 1];
shared_ptr<Buffer> curB = _IMatrix::bDevice[2 * curDeviceNum];
shared_ptr<Buffer> altB = _IMatrix::bDevice[2 * curDeviceNum + 1];
shared_ptr<Buffer> curC = _IMatrix::cDevice[2 * curDeviceNum];
shared_ptr<Buffer> altC = _IMatrix::cDevice[2 * curDeviceNum + 1];
Device& curDevice = Device::GetCurrentDevice();
size_t curQueue = 0;
for(size_t curBlockID = startIdx; curBlockID <= endIdx; curBlockID++)
unsigned int blockRow = curBlockID % c.numRowBlocks();
unsigned int blockCol = curBlockID / c.numRowBlocks();
unsigned int blockSize = c.mRowsPerBlock * c.mColsPerBlock;
//Zero our output buffers
//Go back and forth between the two queues, multiplying blocks of
//A and B. This should give us transfer/kernel interleaving
for(unsigned int curK = 0; curK < k / c.mRowsPerBlock; curK++)
curA->put(a.getBlock(blockRow, curK), sizeof(float) * blockSize);
curB->put(b.getBlock(curK, blockCol), sizeof(float) * blockSize);
//Swap the buffers
tmp = curA;
curA = altA;
altA = tmp;
tmp = curB;
curB = altB;
altB = tmp;
tmp = curC;
curC = altC;
altC = tmp;
//Change the queue the next batch of tasks will go into
curQueue = curQueue == 0 ? 1 : 0;
//Add the two c buffers
//Finally, write c out to its correct place
curC->get(c.getBlock(curBlockID), sizeof(float) * blockSize);
When using in-order OCL queue you don't need to chain events from one command to the next on the same queue if you have a set of dedicated buffers per queue. AMD does not support out of order queues. However, while this may slightly reduce performance, it should not account for the lack of concurrency.
If this is an open source project, point me to the repository and I will take a closer look.
I know AMD doesn't support OOO queues. The idea is that when they do support it, I can change how the queues are instantiated and still have correctness guarantees. Furthermore, tracking dependencies provides protection if the user does something stupid with multiple queues like enqueue a transfer into queue 0 and a kernel needing that data into queue 1.
Also, how do you private message people?
There isn't private message functionality on the forums.
Anywho, you can check out the source at
You need g++4.5 or higher to compile this due to C++11 features.
The example in question is in examples/MatrixMultiply
I have a few comments and questions on your code but i rather take this offline, What is you e-mail address?
rick dot weber dot iii at gmail dot com