5 Replies Latest reply on May 18, 2017 7:22 PM by tugrul_512bit

    Efficiency of event

    bluewanderer

      What I'm doing now is in a pattern like "a large job -> several small jobs -> a large job -> small jobs -> ...". The small ones each uses no more than 4 CU's and don't depends on each other, that is they can run concurrently. But when I do so, I find a kernel will not start immediately when the event it waits on is signaled. And the delay is quite huge, putting them in one queue is actually much faster.

       

      And what's more. I did an experiment. I enqueued one kernel to two queues many times. And it ran like

      1.PNG

      And then, I made the kernels from Queue 2 "depending" on kernels from Queue 1. And I got

       

       

      2.PNG

      This is... fascinated... Kernels that wait for events do not only delay after the events are signaled but also delay after the kernels before them in the same queue. The gaps exist even after all jobs in Queue 1 are done. Why?

        • Re: Efficiency of event
          ekondis

          Interesting... What is the average kernel execution time in your supplied examples?

           

          Perhaps, the kernel dependencies are resolved on the host but I'm not aware of it.

          • Re: Efficiency of event
            tugrul_512bit

            I did a test with 2 contexts running in 2 threads without any events. One context do only compute x14. Other context do only read x7 + write x7 interleaved. Any of them takes about 19 ms alone.

             

            To test overlapped compute and read/write, I run them at the same time:

             

            debug1.png

             

            as you can see, any read or write always end at the same time with a compute blob. (whole work completes about 20ms so they literally worked async)

             

            Now same test but this time compute blobs are only 7 instead of 14 but with doubled compute intensity:

             

            debug2.png

             

            again, read/write blobs end at same end-point of a compute blob and clearly there is less space for read/writes to catch so half of the read/write jobs are done much faster when only compute finishes.  (all finished in 38 ms)

             

            Same amount of total compute but increased per kernel work leads to a lesser utilization of async read/write/compute capability.

             

            Now to stress the async even more, added another thread with 3rd context with just read/write again:

             

            debug3.png

             

            as it is seen, MORE read/write could fit here between beginning and end of total compute blob area. (total work is now 40 ms)

             

            Finally, my understanding is that the more queues you use, the more async operation happens at the same time and this is with a lowest-end RX550 GPU. So, when you use many queues in parallel, you can hide the latencies of events, computes and read/writes. I think AMD is better with more queues than Nvidia or Intel.