cancel
Showing results for 
Search instead for 
Did you mean: 

OpenCL

nfogh
Journeyman III

OpenCL: Delay in inter-kernel execution when requesting callbacks

Jump to solution

Hi

I have a problem with delays in kernel execution when I request callbacks from OpenCL.

In my application, I need to execute kernels at a "very" high rate (around 300Hz), and I need a callback to my host application every time execution has finished. However, I am seeing large delays in kernel-to-kernel execution when getting these callbacks, even when there is another kernel waiting in the queue.

To investigate, I have created a test program that enqueues 100 kernels into a queue. Looking at the CodeXL timeline trace, all the kernels are executed just after each other with around 1 - 3 us delay.

However, when I request a callback on one of the events, the subsequent kernel has a execution delay of around 0.25 ms. Even though the callback is completely empty (just a return statement).

The callback is executed is executed immediately (within i few us), but the next kernel waits to execute for some time. In the attached image, it can be seen when I request a callback for every 5th execution of my kernel:

Untitled.png

I calculated that this delay can account for up to 25% of my GPU processing power when executing at 300 Hz

Can anybody shed some light on this?

0 Kudos
Reply
1 Solution

Accepted Solutions
dipak
Staff
Staff

Re: OpenCL: Delay in inter-kernel execution when requesting callbacks

Jump to solution

Hi nfogh,

Thank you for your query.

After discussing with the related team, the above behavior looks like an expected one. From the API timeline trace, you can find that the next command is not submitted until the previous command with callback finishes its execution. That's why the delay. As the team explained it:

- OpenCL runtime builds batches of commands before sending to KMD / HW

- There is no feedback from HW to SW for each single command in the batch, but a status of the entire batch

- If the application sends a request for callback, then the batch building has to be interrupted, since runtime needs the result of the command as soon as possible.

- The next batch has to start the formation – hence a delay before the next command execution.

- There is also a delay to get the result of the batch, but it’s hidden inside that 0.25 ms delay.

Please note that, (in theory) the delay should be much smaller, if the application doesn’t delay the commands. One kernel isn’t enough to cause a batch formation, but the application can control it with clFlush command. Hence if the application needs GPU busy during the callback it has to have something like this:

o NDRange with callback

o NDrange

o NDrange

o clFlush -> if this flush is missing, then runtime can’t guarantee 2 previous commands started execution

o NDrange

P.S. Assuming it's a Windows setup.

Thanks.

View solution in original post

0 Kudos
Reply
1 Reply
dipak
Staff
Staff

Re: OpenCL: Delay in inter-kernel execution when requesting callbacks

Jump to solution

Hi nfogh,

Thank you for your query.

After discussing with the related team, the above behavior looks like an expected one. From the API timeline trace, you can find that the next command is not submitted until the previous command with callback finishes its execution. That's why the delay. As the team explained it:

- OpenCL runtime builds batches of commands before sending to KMD / HW

- There is no feedback from HW to SW for each single command in the batch, but a status of the entire batch

- If the application sends a request for callback, then the batch building has to be interrupted, since runtime needs the result of the command as soon as possible.

- The next batch has to start the formation – hence a delay before the next command execution.

- There is also a delay to get the result of the batch, but it’s hidden inside that 0.25 ms delay.

Please note that, (in theory) the delay should be much smaller, if the application doesn’t delay the commands. One kernel isn’t enough to cause a batch formation, but the application can control it with clFlush command. Hence if the application needs GPU busy during the callback it has to have something like this:

o NDRange with callback

o NDrange

o NDrange

o clFlush -> if this flush is missing, then runtime can’t guarantee 2 previous commands started execution

o NDrange

P.S. Assuming it's a Windows setup.

Thanks.

View solution in original post

0 Kudos
Reply