I have enabled profiling in OpenCL, and have compared the actual throughput of my program with the throughput measured by the stream analyzer. I run each kernel 1000 times, and calculate the mean of: (#threads in NDRange / time in nanoseconds)*1e9 where time is the profiled kernel runtime as in the OpenCL guide from AMD.
I have an HD5970 and have run the analyzer for an HD5870, and here are the results (in M threads / s, Real is what I measure as above):
kernel_1 Analyzer: 1190 Real: 131
kernel_2 Analyzer: 1861 Real:43
kernel_3 Analyzer: 15797 Real:484
kernel_4 Analyzer: 693 Real:19
Any ideas on what I can do to attempt to get the "theoretical performance" of the analyzer? What do these results suggest, in terms of what can be happening with real hardware that the analyzer does not account for?
Thanks!
Hi,
I think that you are not using enough kernels... As I understood, a HD5870 has 64 threads in a wave, but there are two waves running at the same time (when one computes the other does memory access), so you have 128 threads working at the same time, so your 1000 kernels are executed as 8 blocks of 2 waves, so the setup time eats the performance that you are expecting.
As a reference: I execute 64000 kernels in one GPU, each kernel is around 60 C lines of code (mainly rotates and logical operations with 4 memory access per kernel to global memory) and each block of 64K kernels execute in 0.6 miliseconds.
Hope this helps,
Alfonso
I should have put in the number of work items:
kernel_1 223700
kernel_2 334950
kernel_3 671100
kernel_4 334950
Bear in mind... I run 1000 kernel launches... there are 4 kernels here, and each has this many work-items (launced in 1D).
Hi,
It seems that you have a good number of workitems; so in this case I would suspect that memory transfers are a bottleneck... Could you profile with the profiler that integrates with visual studio and see what happens?
best regards,
Alfonso
Unfortunately the profiler does not work with my code, as I have a separate post in the tools forum attempting to figure that one out. I might be able to manipulate my code to get it to work, but that's a separate topic.
The real question I am asking is: What makes the stream analyzer think this is the performance? What is it in real hardware that skews the results so drastically, because that is likely the bottleneck in my code (which I have to indirectly find due to problems with the tools at the moment). If I could understand this, perhaps I can find some insight as to things I can try to get my performance up.
aj_guillon,
It would be really nice if you can send a simple testcase so that we can reproduce the issue.
aj_guillon,
SKA is a static profiling tool and it does not take into accounts the dynamic factors like kernel launch overhead,Memory latencies.