I've been tasked with solving a problem that feels like it might be a good fit for a GPU, although I could be wrong...
We have a data acquisition card that generates nearly 8Gb/sec, typically in the form of a 240kb "record" (60,000 x 32-bit values) every 30 microseconds. A data acquisition "run" can last for a few milliseconds, or for many seconds. The DA card supports peer-to-peer transfer, so my initial thought is to write the records straight to graphics card memory where they can be processed by the GPU. The card is a Radeon Pro WX7100. (The above is our "ideal" throughput, but this can be reduced if it's likely to be too much for the PCI bus or GPU. We could drop the frequency to every 60us or even 120us).
The data processing will involve extracting certain sections of the record that we are interested in (typically 10-20% of the overall record). To do this we would need to pass the GPU a series of "from & to" ranges, specifying which sections of the record we want to look at (e.g. "50-175", "1675-1920", "5700-5780", etc). Within each section we then want to do a simple peak detect, returning details of each found peak back to the host program (peak height, width, etc). The number of ranges will vary (anything from 1 to 20), and will differ in width.
The upshot is that I'm looking for some (lots of) pointers on where to begin with this. It's totally different from the textbook samples that just run a kernel to add two arrays together, which is the limit of my OpenCL/Cuda knowledge!
Would I still run a kernel to achieve all this? I presume it would have to run indefinitely (while the DA card is acquiring) until stopped by the host program?
How does the kernel know when a new record has "arrived" in memory?
Once the kernel has processed a record, how would it pass all of the peak details back to the "host" program, before moving on to the next record?
Is this even feasible, or suitable for GPU processing?
Thanks in advance