cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bluewanderer
Journeyman III

Efficiency of event

What I'm doing now is in a pattern like "a large job -> several small jobs -> a large job -> small jobs -> ...". The small ones each uses no more than 4 CU's and don't depends on each other, that is they can run concurrently. But when I do so, I find a kernel will not start immediately when the event it waits on is signaled. And the delay is quite huge, putting them in one queue is actually much faster.

And what's more. I did an experiment. I enqueued one kernel to two queues many times. And it ran like

1.PNG

And then, I made the kernels from Queue 2 "depending" on kernels from Queue 1. And I got

2.PNG

This is... fascinated... Kernels that wait for events do not only delay after the events are signaled but also delay after the kernels before them in the same queue. The gaps exist even after all jobs in Queue 1 are done. Why?

0 Likes
5 Replies
ekondis
Adept II

Interesting... What is the average kernel execution time in your supplied examples?

Perhaps, the kernel dependencies are resolved on the host but I'm not aware of it.

0 Likes

I remember it was around 35 microseconds. I've already sold my cards

0 Likes

They sound very short kernels to me. So the gaps should be about the same duration.

A benchmark for assessing concurrent kernel execution effectiveness would be interesting.

0 Likes

The gaps was about 60 microseconds. The kernels took a little longer time when paralleled.

I ended up finding NVIDIA cards didn't have such problem, so I switched to CUDA...

0 Likes
tugrul_512bit
Adept III

I did a test with 2 contexts running in 2 threads without any events. One context do only compute x14. Other context do only read x7 + write x7 interleaved. Any of them takes about 19 ms alone.

To test overlapped compute and read/write, I run them at the same time:

debug1.png

as you can see, any read or write always end at the same time with a compute blob. (whole work completes about 20ms so they literally worked async)

Now same test but this time compute blobs are only 7 instead of 14 but with doubled compute intensity:

debug2.png

again, read/write blobs end at same end-point of a compute blob and clearly there is less space for read/writes to catch so half of the read/write jobs are done much faster when only compute finishes.  (all finished in 38 ms)

Same amount of total compute but increased per kernel work leads to a lesser utilization of async read/write/compute capability.

Now to stress the async even more, added another thread with 3rd context with just read/write again:

debug3.png

as it is seen, MORE read/write could fit here between beginning and end of total compute blob area. (total work is now 40 ms)

Finally, my understanding is that the more queues you use, the more async operation happens at the same time and this is with a lowest-end RX550 GPU. So, when you use many queues in parallel, you can hide the latencies of events, computes and read/writes. I think AMD is better with more queues than Nvidia or Intel.

0 Likes