A couple of years ago we implemented an OpenCL solution to acquire and process scientific data via DirectGMA. While it works, I suspect much of this is due to sheer luck given our lack of OpenCL experience at the time. Now that we have some OpenCL experience under our belt, I'm wondering if we can rewrite this to be simpler or more efficient. What follows is an explanation of the process, and I would appreciate some feedback on how this could be improved.
Our scientific instruments are connected to a PC fitted with a high speed data acquisition card and a Radeon Pro WX7100. The acq card acquires data at up to 2gigasamples/sec, and this data is collected into "records" and sent to the GPU via DirectGMA. It typically takes 10-30 microseconds to acquire a single record, but depends on various user-configurable settings. Records may contain from 6000 to 40000 16-bit int values, so approx 12-80Kb, but always the same reclen in a particular "run". An acquisition run can last several second to several hours.
The acq card can be configured to send multiple records in each DMA buffer, rather than one record at a time, to optimise data transfer.
The acq card repeatedly acquires records and sends them to the GPU via DirectGMA (note we configure the acq card to send multiple records in each DMA buffer, rather than one at a time). After each clEnqueueWaitSignalAMD command we then run a kernel that simply copies the records from the DMA buffer into what we call the "working buffer". Once the required number of records have been collected (user configurable: as low as 100 or as high as 30,000), they get processed by a series of kernels. These perform various analyses on all that data, writing a small number of "results" to a buffer that is then read out by the client software (clEnqueueReadBuffer) and written to disk.
The process then repeats, acquiring and storing further DMA buffers, running the processing kernels, and so on, until the end of the acquisition "run".
Currently we do everything using one OpenCL queue, but it seems to defy logic as to how it is actually working, so I guess there's some part of the process we don't understand! Taking one configuration as an example, the acq card will send 80 records to the GPU in each DMA buffer, and it takes 2ms for the acq card to acquire those 80 records. The GPU control software code repeatedly goes around its "loop" calling clEnqueueWaitSignalAMD and running the "copy" kernel, until it has collected 30,000 records in the working buffer (375 times around the loop). At this point it drops into an "if" block where it executes those processing kernels, before returning to the start of the loop where the process repeats. Profiling shows the total time to run the processing kernels is 200ms. We never miss and DMA buffers, so it's unclear why, given that the software/GPU is unable to service further DMA buffers during those 200ms. Perhaps the DMA buffers are being buffered somewhere until such a time where the next clEnqeueWaitSignalAMD can be issued?
Anyway, I'm wondering whether this whole process could be improved, e.g. using two queues? I was thinking about having one queue just for repeatedly runing the DMA wait and "copy" kernel; after the correct number of iterations it then launches the "processing" kernels on a second queue, freeing up the first queue to continue dealing with further DMA buffers. What I'm not clear on is how/where you would "wait" (e.g. clFinish, or use a "blocking" command), if at all?