I have enabled profiling in OpenCL, and have compared the actual throughput of my program with the throughput measured by the stream analyzer. I run each kernel 1000 times, and calculate the mean of: (#threads in NDRange / time in nanoseconds)*1e9 where time is the profiled kernel runtime as in the OpenCL guide from AMD.
I have an HD5970 and have run the analyzer for an HD5870, and here are the results (in M threads / s, Real is what I measure as above):
kernel_1 Analyzer: 1190 Real: 131
kernel_2 Analyzer: 1861 Real:43
kernel_3 Analyzer: 15797 Real:484
kernel_4 Analyzer: 693 Real:19
Any ideas on what I can do to attempt to get the "theoretical performance" of the analyzer? What do these results suggest, in terms of what can be happening with real hardware that the analyzer does not account for?