Edit:
Not a solution but its better now. Instead of doing vertical pipelining with 3 queues (each queue dedicated to one type of operation) as in original question, used 4 queues each having read+compute+write operations, dropped all events, then the overlapping was much better, no more holes:
but I fear this relies on gpu's ACEs and may not be performance portable(such as nvidia, old amd and intel chips).
I could solve this using timings and choosing proper pipeline shape in the next iteration but this would break multi-gpu load balancer's self-adjustments if timings are changed wildly. Maybe I stop load balancing when choosing pipeline shape?
Bonus question: considering this gpu having 5 compute units, I shouldn't use more than 5 queues or it can not compute that many kernels at the same time or am I wrong and it can context-switch in nanoseconds and serve 123123124 kernel executions using 12312312 queues? In opencl, I couldn't find how to query max(or preferred) number of queues.
Thank you.
Original question:
see the holes between commands in queues? How to decrease that latencies(they are events)?
I have a R7-240 low profile card. If latencies are not originated by events, maybe this gpu can not do read+compute+write at the same time?
Event chaining is like:
1-) write,
2-) write, 3-) compute after waiting for(1)
4-) write after waiting for (3), 5-) compute after waiting for(2) 6-) read after waiting for (3 and 2)
7-) write after waiting for (5 and 6), 8-) compute after waiting for(4 and 6) 9-) read after waiting for (4 and 5)
....
so it is two events per command to wait before execution. When I make it simple 1 event per command, scheduler doesn't overlap read+compute+write operations(see below picture).
and it becomes slower.
So I had to do an explicit overlapping using 3 in-order-queues with help of 2 events per command.
Is out-of-order doable with an AMD gpu? Is it friendly with other vendors?
Events are purely device-side aren't they? Those empty timelines become apparent when there are more pipeline depth or more enqueued kernels using events. R7-240 has two asynchronous-compute-engines in gpu. Won't these two modules help opencl?
Thank you for your time.
Here are the ACE modules, can be seen on upper left corner:
Message was edited by: Huseyin Tugrul BUYUKISIK 10 minutes ater question creation