I am getting exactly same behavior even when compiz is 100% (I had first iteration little slower sometimes in my previous tests also) (this is on 5870, 850mhz GPU /1200mhz GDDR5)
steps per 10 secs : 4638
steps per 10 secs : 5061
steps per 10 secs : 5058
steps per 10 secs : 5062
steps per 10 secs : 5064
steps per 10 secs : 5065
top - 01:12:20 up 4:48, 4 users, load average: 1.37, 1.37, 1.44
Tasks: 199 total, 2 running, 196 sleeping, 0 stopped, 1 zombie
Cpu(s): 2.4%us, 11.9%sy, 0.0%ni, 85.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 16435264k total, 1392672k used, 15042592k free, 35116k buffers
Swap: 16775164k total, 0k used, 16775164k free, 431780k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4066 supremum 20 0 1284m 90m 44m R 100 0.6 147:32.48 compiz
7698 root 20 0 174m 53m 22m S 10 0.3 0:01.83 a.out
Thanks for your test.
Hmm, so maybe it depends on the card,
I'll install a clean Ubuntu to test this program again.
Also, the fact that 5870 outperform my 7970 is suspicious.
Just to be sure, please tell me if you are you not using the latest version of the APP SDK and the graphics drivers.
I have the box with 7970 up and running, I will soon return back with some numbers. I am using SDK 2.7, and 12.6 drivers (I mentioned it earlier). Actually I just installed ubuntu 12.04 from scratch to this box.
Hmm, right something is strange here... GPU load shows 0%
Adapter 0 - AMD Radeon HD 7900 Series
Core (MHz) Memory (MHz)
Current Clocks : 300 150
Current Peak : 1010 1375
Configurable Peak Range : [300-1125] [150-1575]
GPU load : 0%
and the performance is terrible...
steps per 10 secs : 1448
steps per 10 secs : 1465
steps per 10 secs : 1468
steps per 10 secs : 1465
Anyway, there is a problem in your loop also. You are not waiting for kernel execution to finish before running the enqueueread? I get 50% better performance if I put a clfinish between enqueue kernel and enqueue read statements. But that is not very efficient... (on 7970, it now uses 50% of the card with clfinish, you should find a better solution ...)
on the other hand, if I put clFinish on 5870, there is no difference in execution....
Clfinish is necessary ? The command queue keeps the order of the clenqueue commands and the clEnqueueReadBuffer is blocking. Am I missing something here ?
Yes, when you enqueue a kernel, the host program will continue and run the readbuffer command (which will try to read data from where your kernel is working on in a blocked fashion). Because the enqueue kernel command is not blocking. You should use events to keep track of kernel execution and try not to read/write to memory areas which are used by the kernel while it is executing (obviously). I think I am right, but double check from the manual