Archives Discussions

tugrul_512bit · ‎09-11-2016

Edit:

Not a solution but its better now. Instead of doing vertical pipelining with 3 queues (each queue dedicated to one type of operation) as in original question, used 4 queues each having read+compute+write operations, dropped all events, then the overlapping was much better, no more holes:

but I fear this relies on gpu's ACEs and may not be performance portable(such as nvidia, old amd and intel chips).

I could solve this using timings and choosing proper pipeline shape in the next iteration but this would break multi-gpu load balancer's self-adjustments if timings are changed wildly. Maybe I stop load balancing when choosing pipeline shape?

Bonus question: considering this gpu having 5 compute units, I shouldn't use more than 5 queues or it can not compute that many kernels at the same time or am I wrong and it can context-switch in nanoseconds and serve 123123124 kernel executions using 12312312 queues? In opencl, I couldn't find how to query max(or preferred) number of queues.

Thank you.

Original question:

see the holes between commands in queues? How to decrease that latencies(they are events)?

I have a R7-240 low profile card. If latencies are not originated by events, maybe this gpu can not do read+compute+write at the same time?

Event chaining is like:

1-) write,

2-) write, 3-) compute after waiting for(1)

4-) write after waiting for (3), 5-) compute after waiting for(2) 6-) read after waiting for (3 and 2)

7-) write after waiting for (5 and 6), 8-) compute after waiting for(4 and 6) 9-) read after waiting for (4 and 5)

....

so it is two events per command to wait before execution. When I make it simple 1 event per command, scheduler doesn't overlap read+compute+write operations(see below picture).

and it becomes slower.

So I had to do an explicit overlapping using 3 in-order-queues with help of 2 events per command.

Is out-of-order doable with an AMD gpu? Is it friendly with other vendors?

Events are purely device-side aren't they? Those empty timelines become apparent when there are more pipeline depth or more enqueued kernels using events. R7-240 has two asynchronous-compute-engines in gpu. Won't these two modules help opencl?

Thank you for your time.

Here are the ACE modules, can be seen on upper left corner:

Message was edited by: Huseyin Tugrul BUYUKISIK 10 minutes ater question creation

Archives Discussions

OpenCL 1.2: "3 in-order queues with events" versus "1 out-of-order-queue with barriers" (a R7-240 pipeline example) performance and event latency