1. Inaccurate Timings when multiple Kernels Enqueued
When I queue multiple kernels I have been getting some erratic timings. (My example code is attached in the Zip file)
This example uses a very simple Kernel that just applies a gain to an image. This simple kernel is memory bandwidth limited. The minimum kernel execution time possible can be deduced from the image size and GPU memory Bandwidth. For the two image sizes used in my example this works out to be 207uS and 3.2uS for the GPU I am using.
I enqueue the Kernel 4 times on a larger image, followed by 4 times on a smaller image. With the following results:
Queued Submitted Start End
Kernel 1 Execution Time: 240.593 uS ( 0.000 uS 35.351 uS 919.660 uS 1160.253 uS )
Kernel 2 Execution Time: 249.185 uS ( 3.136 uS 87.237 uS 1233.142 uS 1482.327 uS )
Kernel 3 Execution Time: 237.630 uS ( 3.706 uS 95.219 uS 1482.623 uS 1720.253 uS )
Kernel 4 Execution Time: 236.741 uS ( 14.540 uS 99.496 uS 1720.549 uS 1957.290 uS )
Kernel 5 Execution Time: 2.519 uS ( 15.395 uS 102.632 uS 1957.586 uS 1960.105 uS )
Kernel 6 Execution Time: 3.111 uS ( 15.965 uS 105.482 uS 1960.253 uS 1963.364 uS )
Kernel 7 Execution Time: 0.296 uS ( 16.535 uS 108.048 uS 1963.512 uS 1963.808 uS )
Kernel 8 Execution Time: 0.297 uS ( 16.820 uS 110.329 uS 1963.956 uS 1964.253 uS )
TotalTime = 1964.253 uS (First Kernel Queued to last Kernel Finished)
Times are somewhat erratic, in the above example Kernel Calls 5, 7 and 8 are showing impossibly fast execution times. (.zip has log of multiple runs)
Also shown for each Kernel are the Queued, submitted, start and end times, relative to the first Kernel Queued. I note all 8 Kernels are Queued and submitted before the first Kernel starts.
Timings are as expected if I wait for each kernel to complete before enqueing the next (uncomment the #define WAIT_FOR_EACH_KERNEL at top of example .cpp file):
Queued Submitted Start End
Kernel 1 Execution Time: 240.148 uS ( 0.000 uS 18.816 uS 647.614 uS 887.762 uS )
Kernel 2 Execution Time: 242.370 uS ( 904.582 uS 948.771 uS 1106.094 uS 1348.464 uS )
Kernel 3 Execution Time: 238.222 uS ( 1372.696 uS 1393.508 uS 1505.373 uS 1743.595 uS )
Kernel 4 Execution Time: 238.518 uS ( 1768.398 uS 1788.924 uS 1901.634 uS 2140.152 uS )
Kernel 5 Execution Time: 5.333 uS ( 2164.099 uS 2184.625 uS 2297.604 uS 2302.937 uS )
Kernel 6 Execution Time: 5.037 uS ( 2327.454 uS 2347.981 uS 2459.829 uS 2464.866 uS )
Kernel 7 Execution Time: 5.334 uS ( 2488.529 uS 2509.055 uS 2621.747 uS 2627.081 uS )
Kernel 8 Execution Time: 5.334 uS ( 2651.313 uS 2672.125 uS 2785.387 uS 2790.721 uS )
TotalTime = 2790.721 uS (First Kernel Queued to last Kernel Finished)
The Kernel execution times here seem realistic and have less variation.
To get meaningful individual Kernel timings I have therefore had to add an event wait after each Kernel enqueue when using AMD GPU’s (observed on both the Hawaii (R9 290X) and Tahiti that I have access to). I have not noticed this phenomenon on Nvidia GPU’s.
Has anyone else experienced anything similar?
Can anyone explain what is going on?
2. First Run:
This second point is more trying to understand better what causes the timing differences observed first time round.
In my example code the multiple Kernel enqueing sequence is repeated multiple times. I have observed the results from the first time round are markedly different (I have ignored these timings from the first time round in the first issue):
Queued Submitted Start End
Kernel 1 Execution Time: 252.889 uS ( 0.000 uS 23.377 uS 12794.529 uS 13047.418 uS )
Kernel 2 Execution Time: 234.815 uS ( 181.315 uS 1658.639 uS 13047.566 uS 13282.381 uS )
Kernel 3 Execution Time: 233.037 uS ( 399.977 uS 1706.818 uS 13282.677 uS 13515.714 uS )
Kernel 4 Execution Time: 237.481 uS ( 545.942 uS 1721.643 uS 13515.862 uS 13753.343 uS )
Kernel 5 Execution Time: 2.814 uS ( 797.389 uS 1732.191 uS 13753.492 uS 13756.306 uS )
Kernel 6 Execution Time: 2.963 uS ( 1014.911 uS 1746.161 uS 13756.603 uS 13759.566 uS )
Kernel 7 Execution Time: 0.296 uS ( 1212.762 uS 1762.410 uS 13759.714 uS 13760.010 uS )
Kernel 8 Execution Time: 0.297 uS ( 1327.367 uS 1776.380 uS 13760.306 uS 13760.603 uS )
TotalTime = 13760.603 uS (First Kernel Queued to last Kernel Finished)
I understand the cause may be to do with “Lazy memory allocation” used somewhere in the OpenCL infrastructure and/or the GPU waking up from partial power down.
Can anyone shed more light on what exactly is happening first time round ?
I notice:
- the queuing and submission of the kernels take longer
- There is a long gap between the last submission and the first Kernel start
- Once the first kernel has started, timings seem similar to before.
Hi,
Thanks for sharing the observations.
Kernel 5 Execution Time: 2.519 uS ( 15.395 uS 102.632 uS 1957.586 uS 1960.105 uS )
Kernel 6 Execution Time: 3.111 uS ( 15.965 uS 105.482 uS 1960.253 uS 1963.364 uS )
Kernel 7 Execution Time: 0.296 uS ( 16.535 uS 108.048 uS 1963.512 uS 1963.808 uS )
Kernel 8 Execution Time: 0.297 uS ( 16.820 uS 110.329 uS 1963.956 uS 1964.253 uS )
Yeah, as I checked, I also got similar findings as yours. However, the output was somehow as expected if I put a "flush()" call after each "enqueueNDRangeKernel()" statement. I'm not sure about the reason for this discrepancy. I'll check with the relevant team and get back to you. Meanwhile, could you please check the "flush()" option and share your findings?
I understand the cause may be to do with “Lazy memory allocation” used somewhere in the OpenCL infrastructure and/or the GPU waking up from partial power down.
"Lazy memory allocation" is one of the main reasons for taking longer time. You may better realize it if you force the run-time to allocate the buffer before the first kernel run. For example, if you add the below BOLD statement, hopefully, you may observe lower first-run timing than existing one.
for (int k = 0; k < NUM_KERNELS_TO_RUN; k++)
{
queue.enqueueWriteImage(ClImagesIn
// JUST FOR TESTING THE EFFECT OF "Lazy memory allocation"
queue.enqueueWriteImage(ClImagesOut
}
Regards,
Hi Dipak,
Yes, forcing the run-time to allocate the buffer before the first kernel run (i.e adding your statement in bold) does reduce the first run delay. A useful insight that demonstrates the "Lazy Memory Allocation" thanks.
Adding flush() after enquing the Kernel does seem to generally reduce the variability in the execution times. But the first kernel call on the smaller frame is still often shorter than the minimum time possible. So I conclude the timing is still suspect. For the best timing of individual kernels it still seems better to queue one kernel at a time.
(I experimented further, interesting that when using the flush(), if more than 10 Kernels queued (I set NUM_KERNELS_TO_RUN to 32) the submission times increase quite dramatically. )
Thanks for your input.
Regards,
Mark.
Hi Mark,
I think the event objects are not reflecting the expected timing information when multiple commands are batched together (i.e. no sync point like finish() or wait()) to the device. However, this scenario is not true if the program is built as OCL 2.0 (i.e. with build flag "-cl-std=CL2.0"). In that case, the timing information seems more regular (though the execution time seems little bit higher than OCL 1.2, but it may happen and that issue is out of scope here). So, as I suspect, it may be an issue with OCL 1.2 building. I'll check further and report a bug, if required.
For the best timing of individual kernels it still seems better to queue one kernel at a time.
Yes, it may be. However, if you check the overall time i.e. first kernel queued to last kernel finished, you'll observe that queuing multiple commands as batch generally gives better performance than executing one by one (i.e. queue and wait for finishing). It is also recommended in AMD programming guide. In a real scenario, you should consider the overall time, not only the execution time. Actually, multiple queues can be used to boost the performance if kernels are independent.
Regards,
Hi Dipak,
Thanks for your input again.
Yes I can confirm the build flag "-cl-std=CL2.0" also improves timing consistency a bit (but slower execution as you say).
What I meant previously by:
For the best timing of individual kernels it still seems better to queue one kernel at a time
is this is just an analysis issue. I find I have to run one Kernel at a time to get believable individual kernel execution times when optimising my Kernels.
My observation that I can get impossibly short execution times reported when the Kernels are multiply enqueud, implies to me the timings are less accurate in this scenario.
Here are some timings from my actual code for Enqueing Kernels "one at a time" and "multiply":
OpenCL Kernel Timings: One At A Time Multiply Enqueued
Kernel 1 Stats: Mean(100 Calls) 0.220 mS, 0.220 mS
Kernel 2 Stats: Mean(100 Calls) 3.507 mS, 3.768 mS
Kernel 3 Stats: Mean(100 Calls) 0.306 mS, 0.000 mS
Kernel 4 Stats: Mean(100 Calls) 0.676 mS, 0.679 mS
Kernel 5 Stats: Mean(100 Calls) 1.325 mS, 1.326 mS
...
Here kernel 3 is reported at less than 1uS when Kernels are multiply enqueued , but when individually run takes 306 uS (it cant be less than 125uS due to band width considerations). Note when multiply enqued the previous Kernel (2) is reported as taking longer:
My suspicions are when multiply run the event times for Kernel 2 end and Kernel 3 begin are being delayed to reality (there are always close together).
I have not been able to demonstrate as dramatic an example in the simple test I supplied, but I believe it is showing the same principle to what is going on in some of my real code.
Best Regards,
Mark.
(BTW on Holiday Next week)
Hi Mark,
My observation that I can get impossibly short execution times reported when the Kernels are multiply enqueud, implies to me the timings are less accurate in this scenario.
As I mentioned earlier, when multiple commands are queued together, the timing information is not reflecting the actual numbers. This may be a bug. I'll report it to the relevant team.
What I was trying to say is that, in a scenario when multiple commands need to be executed, the batch method may give better overall performance (i.e. queued to end) than executing each command one by one using finish() or waitForEvent(). If any dependency is there, the events can be used to mark the dependency. It's just a suggestion.
BTW, thanks for confirming that OCL 2.0 build gives better timing consistency.
Enjoy the holidays...
Regards,