cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

andyste1
Adept II

A problem to solve with OpenCL and DirectGMA...

I've been tasked with solving a problem that feels like it might be a good fit for a GPU, although I could be wrong...

We have a data acquisition card that generates nearly 8Gb/sec, typically in the form of a 240kb "record" (60,000 x 32-bit values) every 30 microseconds. A data acquisition "run" can last for a few milliseconds, or for many seconds. The DA card supports peer-to-peer transfer, so my initial thought is to write the records straight to graphics card memory where they can be processed by the GPU. The card is a Radeon Pro WX7100. (The above is our "ideal" throughput, but this can be reduced if it's likely to be too much for the PCI bus or GPU. We could drop the frequency to every 60us or even 120us).

The data processing will involve extracting certain sections of the record that we are interested in (typically 10-20% of the overall record). To do this we would need to pass the GPU a series of "from & to" ranges, specifying which sections of the record we want to look at (e.g. "50-175", "1675-1920", "5700-5780", etc). Within each section we then want to do a simple peak detect, returning details of each found peak back to the host program (peak height, width, etc). The number of ranges will vary (anything from 1 to 20), and will differ in width.

The upshot is that I'm looking for some (lots of) pointers on where to begin with this. It's totally different from the textbook samples that just run a kernel to add two arrays together, which is the limit of my OpenCL/Cuda knowledge!

Would I still run a kernel to achieve all this? I presume it would have to run indefinitely (while the DA card is acquiring) until stopped by the host program?

How does the kernel know when a new record has "arrived" in memory?

Once the kernel has processed a record, how would it pass all of the peak details back to the "host" program, before moving on to the next record?

Is this even feasible, or suitable for GPU processing?

Thanks in advance

0 Likes
1 Solution

First, let me clear some points.

As per OpenCL spec, a buffer (expect SVM buffer with atomic support) should not be updated while a kernel is accessing it. OpenCL says that  memory consistency for buffer objects shared between enqueued commands is enforced at a synchronization point only.

Launching (not enqueuing) a new kernel and completion of a running kernel act as synchronization points. No buffer update is allowed during a kernel run. Similarly, any update inside a kernel is not guaranteed to be visible until the kernel finishes. Hence executing a long-running kernel is not a feasible option if you want to update or access the buffer contents during kernel execution. Also, on some systems, a long-running kernel may produce side-effects like GUI freezes or even a driver crash.

I have a suggestion though. Instead of one DGMA buffer, you can use multiple DGMA buffers (of different sizes). Assuming application uses two buffers and they can be processed independently, following is a typical call sequence:

  1. enqueue a write to buffer-A
  2. Once write is done, enqueue a write to buffer-B and enqueue kernel-A with buffer-A
  3. Once kernel-A finishes, enqueue a read to get the result generated by kernel-A. Meanwhile if write to buffer-B is done, enqueue kernel-B with buffer-B and re-use the buffer-A to enqueue a write. Once Result-A is available on the host, process it as required (if possible, asynchrously).
  4. Just replace A by B and repeat the last step.

The commands should be enqueued asynchronously and event objects can be used to form a chain of dependency.

Note:

Kernel-A and Kernel-B may share the same kernel code, however I marked them separately to indicate different launching parameters and kernel arguments.

Regarding your question how to pass information from kernel to host. OpenCL 2.0 introduced fine-grained SVM (Shared Virtual Memory) with atomic support that allows same memory to be atomically accessed by the host and device(s). This type of memory buffer and related atomic functions can be used to pass some data between the host and device(s) while kernel is running.

View solution in original post

6 Replies
dipak
Big Boss

Here are couple of good pointers to start with DirectGMA in OpenCL:

https://www.khronos.org/registry/OpenCL/extensions/amd/cl_amd_bus_addressable_memory.txt

GitHub - ROCm-Developer-Tools/DirectGMA_CL: Simple example showing how to use DGMA in OpenCL

Hopefully the above sample will help you to understand the DGMA data transfer and signaling methods.

P.S. You have been whitelisted.

Thanks.

0 Likes

Hi Dipak, thanks for those links. I've seen one or two similar examples that detail how the memory transfer is performed.

The main area that I'm unsure about is how I could create a kernel that runs continuously (or at least for the duration of my data acquisition), which would process each record that arrives via DGMA from the data acquisition card. All the code samples I've seen are very basic, and just run a kernel to do something short-lived (such as adding two arrays together). Due to the critical timing and frequency of the data acquisition records arriving (every 30us) I can't simply run a new kernel to process each one.

Assuming this is possible, my other main question is: after processing a record, how would I pass the results (the found peaks) back to the host program? Is there (say) some kind of eventing mechanism that the kernel can use to inform the host that a record has been processed and the results are available to retrieve from GPU memory?

0 Likes

First, let me clear some points.

As per OpenCL spec, a buffer (expect SVM buffer with atomic support) should not be updated while a kernel is accessing it. OpenCL says that  memory consistency for buffer objects shared between enqueued commands is enforced at a synchronization point only.

Launching (not enqueuing) a new kernel and completion of a running kernel act as synchronization points. No buffer update is allowed during a kernel run. Similarly, any update inside a kernel is not guaranteed to be visible until the kernel finishes. Hence executing a long-running kernel is not a feasible option if you want to update or access the buffer contents during kernel execution. Also, on some systems, a long-running kernel may produce side-effects like GUI freezes or even a driver crash.

I have a suggestion though. Instead of one DGMA buffer, you can use multiple DGMA buffers (of different sizes). Assuming application uses two buffers and they can be processed independently, following is a typical call sequence:

  1. enqueue a write to buffer-A
  2. Once write is done, enqueue a write to buffer-B and enqueue kernel-A with buffer-A
  3. Once kernel-A finishes, enqueue a read to get the result generated by kernel-A. Meanwhile if write to buffer-B is done, enqueue kernel-B with buffer-B and re-use the buffer-A to enqueue a write. Once Result-A is available on the host, process it as required (if possible, asynchrously).
  4. Just replace A by B and repeat the last step.

The commands should be enqueued asynchronously and event objects can be used to form a chain of dependency.

Note:

Kernel-A and Kernel-B may share the same kernel code, however I marked them separately to indicate different launching parameters and kernel arguments.

Regarding your question how to pass information from kernel to host. OpenCL 2.0 introduced fine-grained SVM (Shared Virtual Memory) with atomic support that allows same memory to be atomically accessed by the host and device(s). This type of memory buffer and related atomic functions can be used to pass some data between the host and device(s) while kernel is running.

Thanks, that makes more sense. I was looking at all this the wrong way, thinking that I could have a kernel that (somehow) kept running for the duration of the acquisition. I've since received an example program from the data acquisition card vendor, that does something similar to what you suggest. It's only very simple but enqueues a set of commands for each record: 'clEnqueueWaitSignalAMD' followed by 'clEnqueueCopyBuffer()' (although this could just as easily be our kernel).

In their example they know in advance how many records they'll be acquiring, so they just enqueue up-front that number of these commands. In our scenario we will be acquiring indefinitely so we can't do this. How could we deal with this? I'm vaguely aware of the concept of callbacks in OpenCL, so I'm guessing I could use this to be informed when a kernel has finished running, then enqueue the next set of commands?

Last question: if something was to go wrong during the data acquisition, I guess the host could sit waiting indefinitely for the remaining GPU command(s) to complete (which they never will, particularly 'clEnqueueWaitSignalAMD'). What is the best way to handle this situation gracefully, i.e. aborting/clearing the queue?

0 Likes

You can execute the above steps inside a loop to process block by block. When a block of data is available, just enqueue a set of commands and form a chain of dependency using event objects. Once all the commands associated with a block finishes (i.e. last event object associated with the chain completes), enqueue another set of commands with another block. Continues this process till a new block is available. Instead of a single buffer, if multiple buffers are used and each represents a separate block of data, then multiple blocks can be processed simultaneously as well.

Regarding you last question, I would like to refer below link that describes how event object can be used to check a command failure and what the consequences are.

https://www.khronos.org/registry/OpenCL/specs/opencl-1.2.pdf#page=179

tim_reago
Journeyman III

A while back, I used Mantle (built on DirectX socket) to test out DMA to and from the GPU.  On an old DDR5 video card, in 1 megabyte chunks, it ran more than twice as fast as DMA to RAM.  I was able to transfer from SSD to GPU then from GPU to CPU faster than from SSD directly to CPU.

Tim Reago

0 Likes