I'm trying to profile OpenCL kernels as described in section 4.3.1 of the APP OpenCL programming guide (using clGetEventProfilingInfo with CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END).
The results I'm getting often differ between runs quite dramatically (for example, 0.06 ms and 0.3 ms).
I execute the kernel ~10 times and measure minimum, maximum and average time. The difference between minimum and maximum is in some cases quite low (about 2%), but in other cases can be very high (10 times, for example). It's interesting to note, that first runs are not always longest, in fact, it's exactly the opposite in many cases. Another observation is a difference between program runs: it may happen that during first run the kernel is consistently spending 0.06 ms, while during the second run it is spending 4-5 times longer (also consistently, with 2-3% spread).
Profiling works much better on NVIDIA cards (using CUDA). I'm getting 5-7% difference between runs, which I can tolerate.
I tried to profile with and without a monitor attached to the GPU -- it does not seem to make a difference. The GPU I'm using is HD 5850, the system is Windows 7 64 bit.
Such unpredictable results make profiling absolutely useless. I have several implementations of the same algorithm, and I can't pick the best one, as the execution time changes randomly.
What are the exact steps I should make to get predictable and repeatable measurements of kernel execution times?
I have the same problem, sometimes a first run is much shorter sometimes mush longer but what interests me is the relationship between different profiling methods.
My enviroment is HD 5870, Windows XP, APP v2.2 (i guess). I am playing with JPEG compression - running 2 OpenCl kernels "RGB to YUV" and "DCT" on the GPU. Image size 1024x1024.
I gather execution time from 3 sources: 1) sprofile; 2) by enqueue 2 markers before and after kernel execution and data movement; 3) rdtsc clocks on the host. rdtsc is the most trusted thing but because there is an OpenCL overhead it cannot give correct timing but still is a very interesting parameter.
What I have got (image size 1024x1024):
1) sprofile: kernel time 0.2 + data movement 1.3 ms = 1.5 ms.
2) marker: 5.2 ms.
3) rdtsc: 10 ms.
The same with 4096x4096 image size:
1) sprofile: kernel time 3.5 + data movement 19.2 ms = 22.7 ms.
2) marker: 55 ms.
3) rdtsc: 115 ms.
If to look at the ratio between sprofile times it kind of makes sense: the second image is 16 times bigger so 1.5 gives 24 which is close to 22.7. Second and third times do fit in this formula it should have been 83 and 160. Apparently the more data the less overhead but WHY the ratio between measurement methods does not change much ? It remains 5.2/1.5 = 3.4, 10/5.2=1.9, 10/1.5 = 6.6 and 55/22.7= 2.4, 115/55 = 2.1, 115/22.7 = 5.06. OpenCL runtime still consumes 5 times more time than the kernel execution on GPU (or at least profiling reports this).
My verdict is: they should move these java guys that develop runtime to another project .
I've observed the same behavior of profiler. Moreover it's permanent feature and i think everyone faced with it (=. But it is normal and isn't strange. It's all because of GPU power saving mechanisms. When your GPU device has not enough work it scales down frequency and power consumption that's why your kernel executes slower. When this computation power is not enough it speeds up to provide more performance and in this case kernel is computed faster. Profiler obtains information about kernel execution duration from timestamps provided by runtime and GPU. So it cannot be unreliable it can only have some constant error depended on different overheads and delays.