0 Replies Latest reply on Sep 13, 2016 8:14 PM by tugrul_512bit

    OpenCL 1.2: "3 in-order queues with events" versus "1 out-of-order-queue with barriers" (a R7-240 pipeline example) performance and event latency

    tugrul_512bit

      Edit:

       

      Not a solution but its better now. Instead of doing vertical pipelining with 3 queues (each queue dedicated to one type of operation) as in original question, used 4 queues each having read+compute+write operations, dropped all events, then the overlapping was much better, no more holes:

       

      pipeline4.png

       

      but I fear this relies on gpu's ACEs and may not be performance portable(such as nvidia, old amd and intel chips).

       

      I could solve this using timings and choosing proper pipeline shape in the next iteration but this would break multi-gpu load balancer's self-adjustments if timings are changed wildly. Maybe I stop load balancing when choosing pipeline shape?

       

      Bonus question: considering this gpu having 5 compute units, I shouldn't use more than 5 queues or it can not compute that many kernels at the same time or am I wrong and it can context-switch in nanoseconds and serve 123123124 kernel executions using 12312312 queues? In opencl, I couldn't find how to query max(or preferred) number of  queues.

       

      Thank you.

       

       

      Original question:

       

      pipeline2.png

       

      see the holes between commands in queues? How to decrease that latencies(they are events)?

       

      I have a R7-240 low profile card. If latencies are not originated by events, maybe this gpu can not do read+compute+write at the same time?

       

      Event chaining is like:

       

      1-) write,

       

      2-) write,   3-) compute after waiting for(1)

       

      4-) write after waiting for (3),     5-) compute after waiting for(2)  6-) read after waiting for (3 and 2)

       

      7-) write after waiting for (5 and 6),     8-) compute after waiting for(4 and 6)  9-) read after waiting for (4 and 5)

       

      ....

       

      so it is two events per command to wait before execution. When I make it simple 1 event per command, scheduler doesn't overlap read+compute+write operations(see below picture).

       

      pipeline3.png

      and it becomes slower.

       

      So I had to do an explicit overlapping using 3 in-order-queues with help of 2 events per command.

       

      Is out-of-order doable with an AMD gpu? Is it friendly with other vendors?

       

      Events are purely device-side aren't they? Those empty timelines become apparent when there are more pipeline depth or more enqueued kernels using events. R7-240 has two asynchronous-compute-engines in gpu. Won't these two modules help opencl?

       

      Thank you for your time.

       

       

      Here are the ACE modules, can be seen on upper left corner:

       

      chip1.jpg

       

      Message was edited by: Huseyin Tugrul BUYUKISIK 10 minutes ater question creation