AnsweredAssumed Answered

OpenCL 1.2: "3 in-order queues with events" versus "1 out-of-order-queue with barriers" (a R7-240 pipeline example) performance and event latency

Question asked by tugrul_512bit on Sep 11, 2016

Edit:

 

Not a solution but its better now. Instead of doing vertical pipelining with 3 queues (each queue dedicated to one type of operation) as in original question, used 4 queues each having read+compute+write operations, dropped all events, then the overlapping was much better, no more holes:

 

pipeline4.png

 

but I fear this relies on gpu's ACEs and may not be performance portable(such as nvidia, old amd and intel chips).

 

I could solve this using timings and choosing proper pipeline shape in the next iteration but this would break multi-gpu load balancer's self-adjustments if timings are changed wildly. Maybe I stop load balancing when choosing pipeline shape?

 

Bonus question: considering this gpu having 5 compute units, I shouldn't use more than 5 queues or it can not compute that many kernels at the same time or am I wrong and it can context-switch in nanoseconds and serve 123123124 kernel executions using 12312312 queues? In opencl, I couldn't find how to query max(or preferred) number of  queues.

 

Thank you.

 

 

Original question:

 

pipeline2.png

 

see the holes between commands in queues? How to decrease that latencies(they are events)?

 

I have a R7-240 low profile card. If latencies are not originated by events, maybe this gpu can not do read+compute+write at the same time?

 

Event chaining is like:

 

1-) write,

 

2-) write,   3-) compute after waiting for(1)

 

4-) write after waiting for (3),     5-) compute after waiting for(2)  6-) read after waiting for (3 and 2)

 

7-) write after waiting for (5 and 6),     8-) compute after waiting for(4 and 6)  9-) read after waiting for (4 and 5)

 

....

 

so it is two events per command to wait before execution. When I make it simple 1 event per command, scheduler doesn't overlap read+compute+write operations(see below picture).

 

pipeline3.png

and it becomes slower.

 

So I had to do an explicit overlapping using 3 in-order-queues with help of 2 events per command.

 

Is out-of-order doable with an AMD gpu? Is it friendly with other vendors?

 

Events are purely device-side aren't they? Those empty timelines become apparent when there are more pipeline depth or more enqueued kernels using events. R7-240 has two asynchronous-compute-engines in gpu. Won't these two modules help opencl?

 

Thank you for your time.

 

 

Here are the ACE modules, can be seen on upper left corner:

 

chip1.jpg

 

Message was edited by: Huseyin Tugrul BUYUKISIK 10 minutes ater question creation

Outcomes