I'm trying to profile OpenCL kernels as described in section 4.3.1 of the APP OpenCL programming guide (using clGetEventProfilingInfo with CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END).
The results I'm getting often differ between runs quite dramatically (for example, 0.06 ms and 0.3 ms).
I execute the kernel ~10 times and measure minimum, maximum and average time. The difference between minimum and maximum is in some cases quite low (about 2%), but in other cases can be very high (10 times, for example). It's interesting to note, that first runs are not always longest, in fact, it's exactly the opposite in many cases. Another observation is a difference between program runs: it may happen that during first run the kernel is consistently spending 0.06 ms, while during the second run it is spending 4-5 times longer (also consistently, with 2-3% spread).
Profiling works much better on NVIDIA cards (using CUDA). I'm getting 5-7% difference between runs, which I can tolerate.
I tried to profile with and without a monitor attached to the GPU -- it does not seem to make a difference. The GPU I'm using is HD 5850, the system is Windows 7 64 bit.
Such unpredictable results make profiling absolutely useless. I have several implementations of the same algorithm, and I can't pick the best one, as the execution time changes randomly.
What are the exact steps I should make to get predictable and repeatable measurements of kernel execution times?