Here are couple of good pointers to start with DirectGMA in OpenCL:
Hopefully the above sample will help you to understand the DGMA data transfer and signaling methods.
P.S. You have been whitelisted.
Hi Dipak, thanks for those links. I've seen one or two similar examples that detail how the memory transfer is performed.
The main area that I'm unsure about is how I could create a kernel that runs continuously (or at least for the duration of my data acquisition), which would process each record that arrives via DGMA from the data acquisition card. All the code samples I've seen are very basic, and just run a kernel to do something short-lived (such as adding two arrays together). Due to the critical timing and frequency of the data acquisition records arriving (every 30us) I can't simply run a new kernel to process each one.
Assuming this is possible, my other main question is: after processing a record, how would I pass the results (the found peaks) back to the host program? Is there (say) some kind of eventing mechanism that the kernel can use to inform the host that a record has been processed and the results are available to retrieve from GPU memory?
1 of 1 people found this helpful
First, let me clear some points.
As per OpenCL spec, a buffer (expect SVM buffer with atomic support) should not be updated while a kernel is accessing it. OpenCL says that memory consistency for buffer objects shared between enqueued commands is enforced at a synchronization point only.
Launching (not enqueuing) a new kernel and completion of a running kernel act as synchronization points. No buffer update is allowed during a kernel run. Similarly, any update inside a kernel is not guaranteed to be visible until the kernel finishes. Hence executing a long-running kernel is not a feasible option if you want to update or access the buffer contents during kernel execution. Also, on some systems, a long-running kernel may produce side-effects like GUI freezes or even a driver crash.
I have a suggestion though. Instead of one DGMA buffer, you can use multiple DGMA buffers (of different sizes). Assuming application uses two buffers and they can be processed independently, following is a typical call sequence:
- enqueue a write to buffer-A
- Once write is done, enqueue a write to buffer-B and enqueue kernel-A with buffer-A
- Once kernel-A finishes, enqueue a read to get the result generated by kernel-A. Meanwhile if write to buffer-B is done, enqueue kernel-B with buffer-B and re-use the buffer-A to enqueue a write. Once Result-A is available on the host, process it as required (if possible, asynchrously).
- Just replace A by B and repeat the last step.
The commands should be enqueued asynchronously and event objects can be used to form a chain of dependency.
Kernel-A and Kernel-B may share the same kernel code, however I marked them separately to indicate different launching parameters and kernel arguments.
Regarding your question how to pass information from kernel to host. OpenCL 2.0 introduced fine-grained SVM (Shared Virtual Memory) with atomic support that allows same memory to be atomically accessed by the host and device(s). This type of memory buffer and related atomic functions can be used to pass some data between the host and device(s) while kernel is running.
Thanks, that makes more sense. I was looking at all this the wrong way, thinking that I could have a kernel that (somehow) kept running for the duration of the acquisition. I've since received an example program from the data acquisition card vendor, that does something similar to what you suggest. It's only very simple but enqueues a set of commands for each record: 'clEnqueueWaitSignalAMD' followed by 'clEnqueueCopyBuffer()' (although this could just as easily be our kernel).
In their example they know in advance how many records they'll be acquiring, so they just enqueue up-front that number of these commands. In our scenario we will be acquiring indefinitely so we can't do this. How could we deal with this? I'm vaguely aware of the concept of callbacks in OpenCL, so I'm guessing I could use this to be informed when a kernel has finished running, then enqueue the next set of commands?
Last question: if something was to go wrong during the data acquisition, I guess the host could sit waiting indefinitely for the remaining GPU command(s) to complete (which they never will, particularly 'clEnqueueWaitSignalAMD'). What is the best way to handle this situation gracefully, i.e. aborting/clearing the queue?
1 of 1 people found this helpful
You can execute the above steps inside a loop to process block by block. When a block of data is available, just enqueue a set of commands and form a chain of dependency using event objects. Once all the commands associated with a block finishes (i.e. last event object associated with the chain completes), enqueue another set of commands with another block. Continues this process till a new block is available. Instead of a single buffer, if multiple buffers are used and each represents a separate block of data, then multiple blocks can be processed simultaneously as well.
Regarding you last question, I would like to refer below link that describes how event object can be used to check a command failure and what the consequences are.
A while back, I used Mantle (built on DirectX socket) to test out DMA to and from the GPU. On an old DDR5 video card, in 1 megabyte chunks, it ran more than twice as fast as DMA to RAM. I was able to transfer from SSD to GPU then from GPU to CPU faster than from SSD directly to CPU.