cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

aj_guillon
Adept I

Stream Analyzer Inaccurate? Real vs. Analyzed Measurements

I have enabled profiling in OpenCL, and have compared the actual throughput of my program with the throughput measured by the stream analyzer.  I run each kernel 1000 times, and calculate the mean of: (#threads in NDRange / time in nanoseconds)*1e9 where time is the profiled kernel runtime as in the OpenCL guide from AMD.

I have an HD5970 and have run the analyzer for an HD5870, and here are the results (in M threads / s, Real is what I measure as above):

kernel_1 Analyzer: 1190 Real: 131

kernel_2 Analyzer: 1861 Real:43

kernel_3 Analyzer: 15797 Real:484

kernel_4 Analyzer: 693 Real:19

 

Any ideas on what I can do to attempt to get the "theoretical performance" of the analyzer?  What do these results suggest, in terms of what can be happening with real hardware that the analyzer does not account for?

 

Thanks!

0 Likes
6 Replies
afo
Adept I

Hi,

I think that you are not using enough kernels... As I understood, a HD5870 has 64 threads in a wave, but there are two waves running at the same time (when one computes the other does memory access), so you have 128 threads working at the same time, so your 1000 kernels are executed as 8 blocks of 2 waves, so the setup time eats the performance that you are expecting.

As a reference: I execute 64000 kernels in one GPU, each kernel is around 60 C lines of code (mainly rotates and logical operations with 4 memory access per kernel to global memory) and each block of 64K kernels execute in 0.6 miliseconds.

Hope this helps,

Alfonso

0 Likes

I should have put in the number of work items:

kernel_1 223700

kernel_2 334950

kernel_3 671100

kernel_4 334950

Bear in mind... I run 1000 kernel launches... there are 4 kernels here, and each has this many work-items (launced in 1D).

0 Likes

Hi,

It seems that you have a good number of workitems; so in this case I would suspect that memory transfers are a bottleneck... Could you profile with the profiler that integrates with visual studio and see what happens?

best regards,

Alfonso

0 Likes

Unfortunately the profiler does not work with my code, as I have a separate post in the tools forum attempting to figure that one out.  I might be able to manipulate my code to get it to work, but that's a separate topic.

The real question I am asking is: What makes the stream analyzer think this is the performance?  What is it in real hardware that skews the results so drastically, because that is likely the bottleneck in my code (which I have to indirectly find due to problems with the tools at the moment).  If I could understand this, perhaps I can find some insight as to things I can try to get my performance up.

0 Likes

aj_guillon,

It would be really nice if you can send a simple testcase so that we can reproduce the issue.

0 Likes

aj_guillon,

SKA is a static profiling tool and it does not take into accounts the dynamic factors like kernel launch overhead,Memory latencies.

0 Likes