Showing results for 
Search instead for 
Did you mean: 

Newcomers Start Here

Journeyman III

HSA: Kernel dispatch latencies (What are your experiences?)


I'm currently working on reducing the dispatch times of HSA kernels to enable fine-grained offloading.

At the moment I encounter latencies of  about 1 micro-second (doing some tricks, see below for details).

I would be very interested in the experiences of other HSA developers.

- What dispatch latencies do you encounter?

- Did you find some tricks, hacks or optimizations to reduce latencies?

Many thanks in advance.


Setup: In my experiments I use a simple "do nothing" kernel. I disable the interrupt handling (env HSA_ENABLE_INTERRUPT=0 <hsa_app>) and use busy-waiting instead. Further, the iGPUs' frequency is pinned to 720MHz.

A synchronous dispatch of a single kernel takes ~7 micro-seconds (time until the application receives the completion signal, including AQL-enqueue).

Dispatching multiple kernels in batches can hide latencies to some degree: 3.5 microseconds.

Dispatching and running a (busy-wait-) kernel in advance and communicating via atomics reduces latencies to ~1 microsecond.

0 Kudos