cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

markr
Adept I

Kernel Timing Anomalies

1. Inaccurate Timings when multiple Kernels Enqueued

When I queue multiple kernels I have been getting some erratic timings. (My example code is attached in the Zip file)

This example uses a very simple Kernel  that just applies a gain to an image. This simple kernel is memory bandwidth limited. The minimum kernel execution time possible can be deduced from the image size and GPU memory Bandwidth.  For the two image sizes used in my example this works out to be 207uS and 3.2uS for the GPU I am using.

I enqueue  the Kernel 4 times on a larger image, followed by 4 times on a smaller image. With the following results:

                                              Queued      Submitted       Start       End

Kernel  1 Execution Time: 240.593 uS   (     0.000 uS     35.351 uS    919.660 uS   1160.253 uS )

Kernel  2 Execution Time: 249.185 uS   (     3.136 uS     87.237 uS   1233.142 uS   1482.327 uS )

Kernel  3 Execution Time: 237.630 uS   (     3.706 uS     95.219 uS   1482.623 uS   1720.253 uS )

Kernel  4 Execution Time: 236.741 uS   (    14.540 uS     99.496 uS   1720.549 uS   1957.290 uS )

Kernel  5 Execution Time:   2.519 uS   (    15.395 uS    102.632 uS   1957.586 uS   1960.105 uS )

Kernel  6 Execution Time:   3.111 uS   (    15.965 uS    105.482 uS   1960.253 uS   1963.364 uS )

Kernel  7 Execution Time:   0.296 uS   (    16.535 uS    108.048 uS   1963.512 uS   1963.808 uS )

Kernel  8 Execution Time:   0.297 uS   (    16.820 uS    110.329 uS   1963.956 uS   1964.253 uS )

TotalTime =   1964.253 uS (First Kernel Queued to last Kernel Finished)

Times are somewhat erratic, in the above example Kernel Calls 5, 7 and 8 are showing impossibly fast execution times. (.zip has log of multiple runs)

Also shown for each Kernel are the Queued, submitted, start and end times, relative to the first Kernel Queued.  I note all 8 Kernels are Queued and submitted before the first Kernel starts.

Timings are as expected if I wait for each kernel to complete before enqueing the next (uncomment the #define WAIT_FOR_EACH_KERNEL at top of example .cpp file):

                                              Queued      Submitted       Start       End

Kernel  1 Execution Time: 240.148 uS   (     0.000 uS     18.816 uS    647.614 uS    887.762 uS )

Kernel  2 Execution Time: 242.370 uS   (   904.582 uS    948.771 uS   1106.094 uS   1348.464 uS )

Kernel  3 Execution Time: 238.222 uS   (  1372.696 uS   1393.508 uS   1505.373 uS   1743.595 uS )

Kernel  4 Execution Time: 238.518 uS   (  1768.398 uS   1788.924 uS   1901.634 uS   2140.152 uS )

Kernel  5 Execution Time:   5.333 uS   (  2164.099 uS   2184.625 uS   2297.604 uS   2302.937 uS )

Kernel  6 Execution Time:   5.037 uS   (  2327.454 uS   2347.981 uS   2459.829 uS   2464.866 uS )

Kernel  7 Execution Time:   5.334 uS   (  2488.529 uS   2509.055 uS   2621.747 uS   2627.081 uS )

Kernel  8 Execution Time:   5.334 uS   (  2651.313 uS   2672.125 uS   2785.387 uS   2790.721 uS )

TotalTime =   2790.721 uS (First Kernel Queued to last Kernel Finished)

The Kernel execution times here seem realistic and have less variation.

To get meaningful individual Kernel timings I have therefore had to add an event wait after each Kernel enqueue when using AMD GPU’s (observed on both the Hawaii (R9 290X) and Tahiti that I have access to). I have not noticed this phenomenon on Nvidia GPU’s.

Has anyone else experienced anything similar?

Can anyone explain what is going on?

2. First Run:

This second point is more trying to understand better what causes the timing differences observed first time round.

In my example code the multiple Kernel  enqueing sequence is repeated multiple times. I have observed the results from the first time round are markedly different (I have ignored these timings from the first time round in the first issue):

                                              Queued      Submitted       Start       End

Kernel  1 Execution Time: 252.889 uS   (     0.000 uS     23.377 uS  12794.529 uS  13047.418 uS )

Kernel  2 Execution Time: 234.815 uS   (   181.315 uS   1658.639 uS  13047.566 uS  13282.381 uS )

Kernel  3 Execution Time: 233.037 uS   (   399.977 uS   1706.818 uS  13282.677 uS  13515.714 uS )

Kernel  4 Execution Time: 237.481 uS   (   545.942 uS   1721.643 uS  13515.862 uS  13753.343 uS )

Kernel  5 Execution Time:   2.814 uS   (   797.389 uS   1732.191 uS  13753.492 uS  13756.306 uS )

Kernel  6 Execution Time:   2.963 uS   (  1014.911 uS   1746.161 uS  13756.603 uS  13759.566 uS )

Kernel  7 Execution Time:   0.296 uS   (  1212.762 uS   1762.410 uS  13759.714 uS  13760.010 uS )

Kernel  8 Execution Time:   0.297 uS   (  1327.367 uS   1776.380 uS  13760.306 uS  13760.603 uS )

TotalTime =  13760.603 uS (First Kernel Queued to last Kernel Finished)

I understand the cause may be to do with “Lazy memory allocation” used somewhere in the OpenCL infrastructure and/or the GPU waking up from partial power down.

Can anyone shed more light on what exactly is happening first time round ?

I notice:

- the queuing and submission of the kernels take longer

- There is a long gap between the last submission and the first Kernel start

- Once the first kernel has started, timings seem similar to before.

0 Likes
5 Replies
dipak
Big Boss

Hi,

Thanks for sharing the observations.

Kernel  5 Execution Time:  2.519 uS  (    15.395 uS    102.632 uS  1957.586 uS  1960.105 uS )

Kernel  6 Execution Time:  3.111 uS  (    15.965 uS    105.482 uS  1960.253 uS  1963.364 uS )

Kernel  7 Execution Time:  0.296 uS  (    16.535 uS    108.048 uS  1963.512 uS  1963.808 uS )

Kernel  8 Execution Time:  0.297 uS  (    16.820 uS    110.329 uS  1963.956 uS  1964.253 uS )

Yeah, as I checked, I also got similar findings as yours. However, the output was somehow as expected if I put a "flush()" call after each "enqueueNDRangeKernel()" statement. I'm not sure about the reason for this discrepancy. I'll check with the relevant team and get back to you. Meanwhile, could you please check the "flush()" option and share your findings?

I understand the cause may be to do with “Lazy memory allocation” used somewhere in the OpenCL infrastructure and/or the GPU waking up from partial power down.

"Lazy memory allocation" is  one of the main reasons for taking longer time. You may better realize it if you force the run-time to allocate the buffer before the first kernel run. For example, if you add the below BOLD statement,  hopefully, you may observe lower first-run timing than existing one.

      for (int k = 0; k < NUM_KERNELS_TO_RUN; k++)

      {

        queue.enqueueWriteImage(ClImagesIn, CL_TRUE, Origin, Region, 0, 0, HostDataIn);

       // JUST FOR  TESTING THE EFFECT OF "Lazy memory allocation"

        queue.enqueueWriteImage(ClImagesOut, CL_TRUE, Origin, Region, 0, 0, HostDataIn);

      }

Regards,

Hi Dipak,

Yes, forcing the run-time to allocate the buffer before the first kernel  run (i.e adding your statement in bold) does reduce the first run delay. A useful insight that demonstrates the "Lazy Memory Allocation" thanks.

Adding flush() after enquing the  Kernel does seem to generally  reduce the variability in the execution times. But the first kernel call on the smaller frame is still often shorter than the minimum time possible. So I conclude the timing is still suspect. For the best timing of individual kernels it still seems better to queue one kernel at a time.

(I experimented further, interesting that when using the flush(), if more than 10 Kernels queued (I set NUM_KERNELS_TO_RUN to 32) the submission times increase quite dramatically. )

Thanks for your input.

Regards,

Mark.

0 Likes

Hi Mark,

I think the event objects are not reflecting the expected timing information when multiple commands are batched together (i.e. no sync point like finish() or wait()) to the device. However, this scenario is not true if the program is built as OCL 2.0 (i.e. with build flag "-cl-std=CL2.0"). In that case, the timing information seems more regular (though the execution time seems little bit higher than OCL 1.2, but it may happen and that issue is out of scope here). So, as I suspect, it may be an issue with OCL 1.2 building. I'll check further and report a bug, if required.

For the best timing of individual kernels it still seems better to queue one kernel at a time.

Yes, it may be. However, if you check the overall time i.e. first kernel queued to last kernel finished, you'll observe that queuing multiple commands as batch generally gives better performance than executing one by one (i.e. queue and wait for finishing). It is also recommended in AMD programming guide. In a real scenario, you should consider the overall time, not only the execution time. Actually, multiple queues can be used to boost the performance if kernels are independent.

Regards,

0 Likes

Hi Dipak,

Thanks for your input again.

Yes I can confirm the build flag "-cl-std=CL2.0" also improves timing consistency  a bit (but slower execution as you say).

What I meant previously by:

For the best timing of individual kernels it still seems better to queue one kernel at a time

is this is just an analysis issue. I find I have to run one Kernel at a time to get believable individual kernel execution times when optimising my Kernels. 

My observation that I can get impossibly short execution times reported when the Kernels are multiply enqueud, implies to me the timings are less accurate in this scenario.  

Here are some timings from my actual code for Enqueing Kernels "one at a time" and "multiply":

OpenCL Kernel Timings:                 One At A Time   Multiply Enqueued

  Kernel 1      Stats: Mean(100 Calls)    0.220 mS,      0.220 mS

  Kernel 2      Stats: Mean(100 Calls)    3.507 mS,      3.768 mS

  Kernel 3      Stats: Mean(100 Calls)    0.306 mS,      0.000 mS

  Kernel 4      Stats: Mean(100 Calls)    0.676 mS,      0.679 mS

  Kernel 5      Stats: Mean(100 Calls)    1.325 mS,      1.326 mS

  ...

Here kernel 3 is reported at less than 1uS when Kernels are multiply enqueued , but when individually run takes 306 uS (it cant be less than 125uS due to band width considerations). Note when multiply enqued the previous  Kernel (2) is reported as taking longer:

My suspicions are when multiply run the event times for Kernel 2 end and Kernel 3 begin are being delayed to reality (there are always close together).

I have not been able to demonstrate as dramatic an example in the simple test I supplied, but I believe it is showing the same principle to what is going on in some of my real code.

Best Regards,

Mark.

(BTW on Holiday Next week)

0 Likes

Hi Mark,

My observation that I can get impossibly short execution times reported when the Kernels are multiply enqueud, implies to me the timings are less accurate in this scenario.

As I mentioned earlier, when multiple commands are queued together, the timing information is not reflecting the actual numbers. This may be a bug. I'll report it to the relevant team.

What I was trying to say is that, in a scenario when multiple commands need to be executed, the batch method may give better overall performance (i.e. queued to end) than executing each command one by one using finish() or waitForEvent(). If any dependency is there,  the events can be used to mark the dependency. It's just a suggestion.

BTW, thanks for confirming that OCL 2.0 build gives better timing consistency.

Enjoy the holidays...

Regards,

0 Likes