I'm currently working on reducing the dispatch times of HSA kernels to enable fine-grained offloading.
At the moment I encounter latencies of about 1 micro-second (doing some tricks, see below for details).
I would be very interested in the experiences of other HSA developers.
- What dispatch latencies do you encounter?
- Did you find some tricks, hacks or optimizations to reduce latencies?
Many thanks in advance.
Setup: In my experiments I use a simple "do nothing" kernel. I disable the interrupt handling (env HSA_ENABLE_INTERRUPT=0 <hsa_app>) and use busy-waiting instead. Further, the iGPUs' frequency is pinned to 720MHz.
A synchronous dispatch of a single kernel takes ~7 micro-seconds (time until the application receives the completion signal, including AQL-enqueue).
Dispatching multiple kernels in batches can hide latencies to some degree: 3.5 microseconds.
Dispatching and running a (busy-wait-) kernel in advance and communicating via atomics reduces latencies to ~1 microsecond.