This is more kind of a feedback to improve a future API version. Also i have not enough experience with async compute yet and may be wrong.
So, my usecase is quite different from the typical 'do large compute workloads while rendering shadowmaps'.
I work on realtime GI based on a tree of surface samples.
The algorithm is very complex and often requires one indirect dispatch per tree level and barriers in between, so tiny workloads near the root.
Also it requires many mainteance tasks (e. g. interpolating samples from parents when entering view) resulting again in tiny workloads and mostly zero work dispatches for each tree level.
Similar problems will arise in almost any algorithm with complex work reduction / distribution, or variational workloads (e. g. collision detection for all possible pairs of shape primitives in a physics engine). So we totally need fine grained compute, also for dispatches of only a few wavefronts processing only short programs.
Async compute is just perfect to solve this problem, but it seems synchronization cost and overhead is still to high to do it with full efficiency.
This is what i try to do:
Usually i use a single command buffer containing indirect dispatches and memory barriers for every potential workload.
To go async, i need to divide it into multiple command buffers for multiple queues at each synchronization point (There seems no other way to sync 2 command buffers).
I made my division like this:
Task A (0.1ms - 34 invocations, mostly zero or tiny workloads)
Task B (0.5ms - 16 invocations, starting with tiny, ending with heavy workloads)
Task C (0.5ms - final work, at this point i need results from both A and B)
So i can do A and B simultaneously. My goal is to hide runtime of A behind B, and this totally works.
Option 1 (better):
queue1: process A
queue2: process B, wait on A, process C
queue1: process A
queue2: process B
queue3: wait on A and B, process C
The problem is, i end up with runtime of 1.05 ms, not the expected 1.00 ms.
This is disappointing because if i remove task C, A+B needs only the time of B (0.5ms).
The problem presists if i remove the semaphores, so it seems it's more about enqueing multiple command buffers (additional CPU <-> GPU interaction?).
But i can't be sure of anything - if we talk about 0.05ms even using timestamps for profiling has a larger effect on performance like that (for some details see Confusing performance with async compute - Graphics Programming and Theory - GameDev.net )
However, if you think this makes sense and indicates a API limitation,
maybe it would work to extend synchronization between queues (something like DX12 split barriers or VK events),
or enable async compute for a single queue with user defined dependencies, barriers, etc. to avoid the need to divide any command buffer.
Maybe it's possible for an improvement only on the driver side.
Also let me know if you have an idea of something else i could try.
I'll continue with this when i'm finished with my whole algorithm and have more options for async combinations...