Yes, I kind of figured that GPR usage played an important role. So none of the KSA measurables take into account GPR usage? It might be helpful to add this in the future because without it the KSA is mostly useless as a tool to gauge performance of a kernel and isn't that the point of the KSA or am I missing the point? Maybe I misunderstood the use of the KSA?
Using SKA basically provides you an access to the ATI compiler. It uses the Brook+ compiler to compile Brook+ source file to IL. Then, it calls the ATI Shader Compiler to compile IL down to hardware disassembly for various ASICs and under various Catalyst driver. While you can use Brook+ compiler directly instead of SKA, you don't have access to ATI Shader Compiler except through SKA or the ATI driver. In addition, SKA exposes some statistics generated by the Shader Compiler such as the number of GPRs, ALU, fetch instructions, etc. Also, we provided some heuristics to compute the estimated cycle times for your kernel. The heuristics are not perfect as there are many factors that affect the total performance. Please also keep in mind that SKA is a static analysis tool and thus has its own limitations since it is not a run-time profiler.
How will all of the above helpful to you as a game/stream developer?
1. You can tweak your kernel to achieve better performance by looking at the statistics generated by SKA. You should look at all the statistics instead of just focusing on one particular item. ALU:Fetch ratio gives a hint of the balance of your system. You should also try to minimize the number of GPRs used. Finally, the estimated cycle times should also be a low number. Some developers also like to look at the hardware disassembly to gain better understanding on how to tweak their IL kernel.
2. If you want to know how your kernel performs on a particular graphics card, you can use SKA to gauge the performance on that particular graphics card even without having access to the hardware.
3. Similarly, without having to install a new Catalyst driver, you will be able to know whether a shader bug has been fixed/introduced in the new driver. Or even better whether there are some performance improvements for your kernel/shader.
I hope this helps.
EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.
Is the estimated cycle time higher in the second kernel? You can also post both kernels so we will be able to get a better idea of the problem.
So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.
Exactly. However, how close to full occupancy is not the measure for the final run-time of your kernel.
Why? Consider the following example:
Lets say we have a hypothetical GPU with 1 ALU unit and 1 Fetch unit. Consider the following two kernels A and B.
Kernel A generates 100 ALU instructions and 100 Fetch instructions. Thus, its ALU:Fetch ratio is 1.
Kernel B generates 1 ALU instructions and 2 Fetch instructions. Thus, its ALU:Fetch ratio is 0.5.
While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.
Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?
It depends. Is it optimal for balancing the ALU and Fetch resources? Yes, you can't get better than ONE. Is it optimal for the performance of the system? This depends on the number of kernels in flight (this is used to hide the latency of texture fetch), total length of instruction streams, etc. Please see the above example.
1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.
Thank you for the suggestion. I'll pass this request to the team.
2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.
I agree that I'm using the term occupany a bit differently than what you are using it for. I apologize for the confusion. In my mind, the occupany described in the previous post is the theoretical occupancy (not the actual occupany in the GPU) which means we are not taking account of GPRs and other resources.
Because the number of GPRs has a direct effect on the number of threads in flight (to hide the memory latency), if you have a kernel that uses a high number of GPRs, you would want your ALU:Fetch ratio to be much higher 1.0 (to offset the memory latency due to lower number of threads in flight).
3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight? 4. Thanks for the posts, great insight into KSA!!
This is not currently exposed/calculated. Please keep the good suggestions coming though as we are continually trying to improve this tool.
Originally posted by: dukeleto Hello, could I put in a request for a linux version of SKA, or at least that attention be payed to allowing the combination brcc+SKA to work properly with wine under linux? Currently both programs install approximately but I cannot manage to get the SKA to find the (windows version of) brcc. Thanks!
Originally posted by: ryta1203 Can we get the printing fixed in the next release of KSA? It would be great to be able to print out the information. Right now, the printing freezes KSA, doesn't print and ends in KSA not responding. Also, a Find/Replace function would be very nice.
I'll add all these requests into our bug tracking system so it can be prioritized for our future releases.
Originally posted by: ryta1203 YET, the 1st kernel runs twice as fast as the second kernel. The KSA gives NO clue, other than examining the ISA, as to the reason this is happening. The ISA is very similar for both kernels.
We hear you. We do believe that a run-time profiler would be a nice thing to have. I'm actually on your side.
However, it is not true that SKA gives NO clue at all for those two kernels. Without SKA, developers will have no idea why one is faster than the other. Afterall, the ISA is exposed by SKA.