What I'm doing now is in a pattern like "a large job -> several small jobs -> a large job -> small jobs -> ...". The small ones each uses no more than 4 CU's and don't depends on each other, that is they can run concurrently. But when I do so, I find a kernel will not start immediately when the event it waits on is signaled. And the delay is quite huge, putting them in one queue is actually much faster.
And what's more. I did an experiment. I enqueued one kernel to two queues many times. And it ran like
And then, I made the kernels from Queue 2 "depending" on kernels from Queue 1. And I got
This is... fascinated... Kernels that wait for events do not only delay after the events are signaled but also delay after the kernels before them in the same queue. The gaps exist even after all jobs in Queue 1 are done. Why?
Interesting... What is the average kernel execution time in your supplied examples?
Perhaps, the kernel dependencies are resolved on the host but I'm not aware of it.
They sound very short kernels to me. So the gaps should be about the same duration.
A benchmark for assessing concurrent kernel execution effectiveness would be interesting.
The gaps was about 60 microseconds. The kernels took a little longer time when paralleled.
I ended up finding NVIDIA cards didn't have such problem, so I switched to CUDA...
I did a test with 2 contexts running in 2 threads without any events. One context do only compute x14. Other context do only read x7 + write x7 interleaved. Any of them takes about 19 ms alone.
To test overlapped compute and read/write, I run them at the same time:
as you can see, any read or write always end at the same time with a compute blob. (whole work completes about 20ms so they literally worked async)
Now same test but this time compute blobs are only 7 instead of 14 but with doubled compute intensity:
again, read/write blobs end at same end-point of a compute blob and clearly there is less space for read/writes to catch so half of the read/write jobs are done much faster when only compute finishes. (all finished in 38 ms)
Same amount of total compute but increased per kernel work leads to a lesser utilization of async read/write/compute capability.
Now to stress the async even more, added another thread with 3rd context with just read/write again:
as it is seen, MORE read/write could fit here between beginning and end of total compute blob area. (total work is now 40 ms)
Finally, my understanding is that the more queues you use, the more async operation happens at the same time and this is with a lowest-end RX550 GPU. So, when you use many queues in parallel, you can hide the latencies of events, computes and read/writes. I think AMD is better with more queues than Nvidia or Intel.