I have a PCI data acquisition card that supports P2P. It will be capturing records one after the other at a very rapid rate, and the plan is to write each record to the GPU using DirectGMA, where a kernel will process the data. I can't handle the records sequentially as there is no time between records for the kernel to run. Instead, I was thinking of having two device buffers that the PCI card would alternately write to. After a record arrives in the first buffer I would run the kernel on this data, while at the same time start a wait for data to arrive in the second buffer. Once this data has arrived, I would run the kernel on this, start a wait on the first buffer, and so on.
I'm not familiar with out of order queues and CL events, so I'm looking for some suggestions on how I could achieve this. This is my attempt so far (I haven't included the reading of results back to the host, as this should be trivial):
1. clEnqueueWaitSignalAMD (buffer A)
2. clWaitForEvents (step #1 to complete)
3. clEnqueueNDRangeKernel (arg = buffer A)
4. clEnqueueWaitSignalAMD (buffer B)
5. clWaitForEvents (step #4 to complete)
6. clEnqueueNDRangeKernel (arg = buffer B)
7. Go back to #1
With an out of order queue I'm assuming 3 & 4 will happen "at the same time" (ditto for 6 & 1), however I think this is flawed: at step 5 the host waits on the buffer B wait signal to complete before proceeding, so there is an assumption here that this will always take longer than the kernel to run (#3), but what if it doesn't? Do I need to cater for this in some way, and if so how? Or does this whole pattern rely on the kernel taking less time than a record write, otherwise the program wouldn't be able to keep up with the data throughput?!
Am I on the right lines here? Any pointers would be greatly appreciated.