Interesting... What is the average kernel execution time in your supplied examples?
Perhaps, the kernel dependencies are resolved on the host but I'm not aware of it.
I remember it was around 35 microseconds. I've already sold my cards
They sound very short kernels to me. So the gaps should be about the same duration.
A benchmark for assessing concurrent kernel execution effectiveness would be interesting.
The gaps was about 60 microseconds. The kernels took a little longer time when paralleled.
I ended up finding NVIDIA cards didn't have such problem, so I switched to CUDA...
I did a test with 2 contexts running in 2 threads without any events. One context do only compute x14. Other context do only read x7 + write x7 interleaved. Any of them takes about 19 ms alone.
To test overlapped compute and read/write, I run them at the same time:
as you can see, any read or write always end at the same time with a compute blob. (whole work completes about 20ms so they literally worked async)
Now same test but this time compute blobs are only 7 instead of 14 but with doubled compute intensity:
again, read/write blobs end at same end-point of a compute blob and clearly there is less space for read/writes to catch so half of the read/write jobs are done much faster when only compute finishes. (all finished in 38 ms)
Same amount of total compute but increased per kernel work leads to a lesser utilization of async read/write/compute capability.
Now to stress the async even more, added another thread with 3rd context with just read/write again:
as it is seen, MORE read/write could fit here between beginning and end of total compute blob area. (total work is now 40 ms)
Finally, my understanding is that the more queues you use, the more async operation happens at the same time and this is with a lowest-end RX550 GPU. So, when you use many queues in parallel, you can hide the latencies of events, computes and read/writes. I think AMD is better with more queues than Nvidia or Intel.