13 Replies Latest reply on Nov 29, 2018 1:57 AM by dipak

    Processing two buffers using an out of order queue

    andyste1

      I have a PCI data acquisition card that supports P2P. It will be capturing records one after the other at a very rapid rate, and the plan is to write each record to the GPU using DirectGMA, where a kernel will process the data. I can't handle the records sequentially as there is no time between records for the kernel to run. Instead, I was thinking of having two device buffers that the PCI card would alternately write to. After a record arrives in the first buffer I would run the kernel on this data, while at the same time start a wait for data to arrive in the second buffer. Once this data has arrived, I would run the kernel on this, start a wait on the first buffer, and so on.

       

      I'm not familiar with out of order queues and CL events, so I'm looking for some suggestions on how I could achieve this. This is my attempt so far (I haven't included the reading of results back to the host, as this should be trivial):

       

      1. clEnqueueWaitSignalAMD (buffer A)

      2. clWaitForEvents (step #1 to complete)

       

      3. clEnqueueNDRangeKernel (arg = buffer A)

      4. clEnqueueWaitSignalAMD (buffer B)

       

      5. clWaitForEvents (step #4 to complete)

      6. clEnqueueNDRangeKernel (arg = buffer B)

       

      7. Go back to #1

       

      With an out of order queue I'm assuming 3 & 4 will happen "at the same time" (ditto for 6 & 1), however I think this is flawed: at step 5 the host waits on the buffer B wait signal to complete before proceeding, so there is an assumption here that this will always take longer than the kernel to run (#3), but what if it doesn't? Do I need to cater for this in some way, and if so how? Or does this whole pattern rely on the kernel taking less time than a record write, otherwise the program wouldn't be able to keep up with the data throughput?!

       

      Am I on the right lines here? Any pointers would be greatly appreciated.

        • Re: Processing two buffers using an out of order queue
          dipak

          Currently, AMD's OpenCL runtime does not support out-of-order queue on host side; each host-side queue executes commands in-order fashion (please check "queue property" in the clinfo output). If out-of-order command execution is required, use two or more queues to submit the commands.

           

          Thanks.

            • Re: Processing two buffers using an out of order queue
              andyste1

              I assume that by using two queues (one per buffer), I would just run the commands in sequence on each queue: clEnqueueWaitSignalAMD(), clEnqueueNDRangeKernel(), clEnqueueReadBuffer(). When the latter completes I just enqueue the same three commands again.

               

              Does this mean I have to manage each queue in its own host thread? I can't see how else to independently wait for each queue's clEnqueueReadBuffer() command to complete.

                • Re: Processing two buffers using an out of order queue
                  dipak

                  Using a separate thread for each queue (or buffer) might be a good option to independently process each buffer. If single thread is used, then there will be some kind of ordering between the two dependency chains. For example, a typical single thread approach may look like below.

                   

                  while(1)

                  {

                  // set a dependency chain for bufferA so that processing of bufferA can start once GMA write to bufferA completes

                  clEnqueueWaitSignalAMD(queueA, bufferA, marker++, e1)

                  clEnqueueNDRangeKernel(queueA, bufferA, 1, e1, e2)

                  clEnqueueReadBuffer(queueA, bufferA, hostBufferA, 1, e2, e3)

                  clFlush(queueA) // submit all the commands without blocking the host thread

                   

                  // once GMA write to bufferA completes, bufferB can be used for writing

                  // now, set a similar dependency chain for bufferB so that processing of bufferB can start once GMA write to bufferB completes

                  clEnqueueWaitSignalAMD(queueB, bufferB, marker++, e4)

                  clEnqueueNDRangeKernel(queueB, bufferB, 1, e4, e5)

                  clEnqueueReadBuffer(queueB, bufferB, hostBufferB, 1, e5, e6)

                  clFlush(queueB) // submit all the commands without blocking the host thread

                   

                  clWaitForEvents(e3) // wait for bufferA to complete (blocking call)

                  readyBufferA = true; // at this moment, bufferA is ready for GMA write once again; send a signal to GMA writer [Note, GMA writer should not use the bufferA until it gets this signal]

                  host_Process(hostBufferA) // also, a new thread can be launched to process the host buffer in parallel

                   

                  clWaitForEvents(e6) // wait for bufferB to complete (blocking call)

                  readyBufferB = true; // at this moment, bufferB is ready for GMA write once again; send a signal to GMA writer [note, GMA writer should not use the bufferB until it gets this signal]

                  host_Process(hostBufferB) // also, a new thread can be launched to process the host buffer in parallel

                  }

                   

                  Note: some of the events can be omitted when a in-order queue is used.

                   

                  Thanks.

                  1 of 1 people found this helpful