Hi nfogh,
Thank you for your query.
After discussing with the related team, the above behavior looks like an expected one. From the API timeline trace, you can find that the next command is not submitted until the previous command with callback finishes its execution. That's why the delay. As the team explained it:
- OpenCL runtime builds batches of commands before sending to KMD / HW
- There is no feedback from HW to SW for each single command in the batch, but a status of the entire batch
- If the application sends a request for callback, then the batch building has to be interrupted, since runtime needs the result of the command as soon as possible.
- The next batch has to start the formation – hence a delay before the next command execution.
- There is also a delay to get the result of the batch, but it’s hidden inside that 0.25 ms delay.
Please note that, (in theory) the delay should be much smaller, if the application doesn’t delay the commands. One kernel isn’t enough to cause a batch formation, but the application can control it with clFlush command. Hence if the application needs GPU busy during the callback it has to have something like this:
o NDRange with callback
o NDrange
o NDrange
o clFlush -> if this flush is missing, then runtime can’t guarantee 2 previous commands started execution
o NDrange
P.S. Assuming it's a Windows setup.
Thanks.