I have a PCI data acquisition card that supports P2P. It will be capturing records one after the other at a very rapid rate, and the plan is to write each record to the GPU using DirectGMA, where a kernel will process the data. I can't handle the records sequentially as there is no time between records for the kernel to run. Instead, I was thinking of having two device buffers that the PCI card would alternately write to. After a record arrives in the first buffer I would run the kernel on this data, while at the same time start a wait for data to arrive in the second buffer. Once this data has arrived, I would run the kernel on this, start a wait on the first buffer, and so on.
I'm not familiar with out of order queues and CL events, so I'm looking for some suggestions on how I could achieve this. This is my attempt so far (I haven't included the reading of results back to the host, as this should be trivial):
1. clEnqueueWaitSignalAMD (buffer A)
2. clWaitForEvents (step #1 to complete)
3. clEnqueueNDRangeKernel (arg = buffer A)
4. clEnqueueWaitSignalAMD (buffer B)
5. clWaitForEvents (step #4 to complete)
6. clEnqueueNDRangeKernel (arg = buffer B)
7. Go back to #1
With an out of order queue I'm assuming 3 & 4 will happen "at the same time" (ditto for 6 & 1), however I think this is flawed: at step 5 the host waits on the buffer B wait signal to complete before proceeding, so there is an assumption here that this will always take longer than the kernel to run (#3), but what if it doesn't? Do I need to cater for this in some way, and if so how? Or does this whole pattern rely on the kernel taking less time than a record write, otherwise the program wouldn't be able to keep up with the data throughput?!
Am I on the right lines here? Any pointers would be greatly appreciated.
Currently, AMD's OpenCL runtime does not support out-of-order queue on host side; each host-side queue executes commands in-order fashion (please check "queue property" in the clinfo output). If out-of-order command execution is required, use two or more queues to submit the commands.
I assume that by using two queues (one per buffer), I would just run the commands in sequence on each queue: clEnqueueWaitSignalAMD(), clEnqueueNDRangeKernel(), clEnqueueReadBuffer(). When the latter completes I just enqueue the same three commands again.
Does this mean I have to manage each queue in its own host thread? I can't see how else to independently wait for each queue's clEnqueueReadBuffer() command to complete.
Using a separate thread for each queue (or buffer) might be a good option to independently process each buffer. If single thread is used, then there will be some kind of ordering between the two dependency chains. For example, a typical single thread approach may look like below.
// set a dependency chain for bufferA so that processing of bufferA can start once GMA write to bufferA completes
clEnqueueWaitSignalAMD(queueA, bufferA, marker++, e1)
clEnqueueNDRangeKernel(queueA, bufferA, 1, e1, e2)
clEnqueueReadBuffer(queueA, bufferA, hostBufferA, 1, e2, e3)
clFlush(queueA) // submit all the commands without blocking the host thread
// once GMA write to bufferA completes, bufferB can be used for writing
// now, set a similar dependency chain for bufferB so that processing of bufferB can start once GMA write to bufferB completes
clEnqueueWaitSignalAMD(queueB, bufferB, marker++, e4)
clEnqueueNDRangeKernel(queueB, bufferB, 1, e4, e5)
clEnqueueReadBuffer(queueB, bufferB, hostBufferB, 1, e5, e6)
clFlush(queueB) // submit all the commands without blocking the host thread
clWaitForEvents(e3) // wait for bufferA to complete (blocking call)
readyBufferA = true; // at this moment, bufferA is ready for GMA write once again; send a signal to GMA writer [Note, GMA writer should not use the bufferA until it gets this signal]
host_Process(hostBufferA) // also, a new thread can be launched to process the host buffer in parallel
clWaitForEvents(e6) // wait for bufferB to complete (blocking call)
readyBufferB = true; // at this moment, bufferB is ready for GMA write once again; send a signal to GMA writer [note, GMA writer should not use the bufferB until it gets this signal]
host_Process(hostBufferB) // also, a new thread can be launched to process the host buffer in parallel
Note: some of the events can be omitted when a in-order queue is used.
Thanks Dipak for the comprehensive answer, that will be very useful.
My only comment is that the other PCI card acquires data at a fixed frequency, and there is no mechanism to send a signal telling it that the buffer is ready to write to. I guess I just have to hope that the GPU can "keep up" with the data throughput, and complete the command chain before the buffer is written to again.
No separate signalling mechanism is needed. I just meant to say a new DirectGMA write should start once the GPU finishes its work on the same buffer. For example, if there is an API for the DirectGMA write, then check a corresponding flag or something before calling the API to initiate the DirectGMA write to a buffer. Please note, a buffer should not be used for DirectGMA write while the GPU is reading/processing the same buffer.
Can I just confirm that clEnqueueWaitSignalAMD() does not block the host?
Am I right in saying that the only clEnqueue... commands that do block the host are ReadBuffer & WriteBuffer (and only if the third arg is CL_TRUE)?
Last question: is there any way of being able to see the clEnqueueWaitSignalAMD commands in CodeXL?
In general, clEnqueue<> calls do not block except those which take an extra input argument to explicitly indicate a blocking operation. Regarding clEnqueueWaitSignalAMD(), however, I'm too little bit confused here because this page says: "This command instructs the OpenCL to wait until <value> is written to <buffer> before issuing the next command". If it's a blocking call, then I think multi-threaded approach would be a better solution than the above I described earlier. I will check with the appropriate team and get back to you.
Looks like two different implementations exist for clEnqueueWaitSignalAMD. It’s an asynchronous call in newer PAL OpenCL stack, whereas in ORCA OpenCL stack, it is a blocking call because it waits on CPU. So, I would suggest you to try a simple test before finalizing any design.
PAL OpenCL stack - supports Vega 10 and later products
Legacy (ORCA ) OpenCL stack - supports legacy products older than Vega 10
Thanks for the reply.
I am currently using APP SDK v3.0.130 (Windows). Which implementation would this fall under? What about GPUOpen (Releases · GPUOpen-LibrariesAndSDKs/OCL-SDK · GitHub )?https://github.com/GPUOpen-LibrariesAndSDKs/OCL-SDK/releases
Or is it dependent on the hardware? I'm using a Radeon Pro WX7100.