I think you can run some samples in SDK and check the profile. If the result is consistent, I think maybe it's not the GPU caused the problem. Just speculating.
I have the same problem, sometimes a first run is much shorter sometimes mush longer but what interests me is the relationship between different profiling methods.
My enviroment is HD 5870, Windows XP, APP v2.2 (i guess). I am playing with JPEG compression - running 2 OpenCl kernels "RGB to YUV" and "DCT" on the GPU. Image size 1024x1024.
I gather execution time from 3 sources: 1) sprofile; 2) by enqueue 2 markers before and after kernel execution and data movement; 3) rdtsc clocks on the host. rdtsc is the most trusted thing but because there is an OpenCL overhead it cannot give correct timing but still is a very interesting parameter.
What I have got (image size 1024x1024):
1) sprofile: kernel time 0.2 + data movement 1.3 ms = 1.5 ms.
2) marker: 5.2 ms.
3) rdtsc: 10 ms.
The same with 4096x4096 image size:
1) sprofile: kernel time 3.5 + data movement 19.2 ms = 22.7 ms.
2) marker: 55 ms.
3) rdtsc: 115 ms.
If to look at the ratio between sprofile times it kind of makes sense: the second image is 16 times bigger so 1.5 gives 24 which is close to 22.7. Second and third times do fit in this formula it should have been 83 and 160. Apparently the more data the less overhead but WHY the ratio between measurement methods does not change much ? It remains 5.2/1.5 = 3.4, 10/5.2=1.9, 10/1.5 = 6.6 and 55/22.7= 2.4, 115/55 = 2.1, 115/22.7 = 5.06. OpenCL runtime still consumes 5 times more time than the kernel execution on GPU (or at least profiling reports this).
My verdict is: they should move these java guys that develop runtime to another project .
I've observed the same behavior of profiler. Moreover it's permanent feature and i think everyone faced with it (=. But it is normal and isn't strange. It's all because of GPU power saving mechanisms. When your GPU device has not enough work it scales down frequency and power consumption that's why your kernel executes slower. When this computation power is not enough it speeds up to provide more performance and in this case kernel is computed faster. Profiler obtains information about kernel execution duration from timestamps provided by runtime and GPU. So it cannot be unreliable it can only have some constant error depended on different overheads and delays.