The GPU Developer Tools Team is pleased to announce the release of a new tool Stream KernelAnalyzer.
This is a tool for analyzing the performance of stream kernels on ATI graphics cards/stream processors (AMD Stream SDK 1.3 is required). It was derived from the GPU ShaderAnalyzer tool to specifically target the stream community.
Features of the new tool:
Please do not hesitate to post on the forum if you have any questions.
Sincerely,
GPU Developer Tools Team
Stream KernelAnalyzer (SKA) uses the same GSA analysis modules but it has a different interface. For example, some graphics terms have been removed. It has a better Brook compiler support (warning levels, etc) and it supports FireStream series products.
Also, as of GSA 1.49, the support for Brook and IL has been removed from GSA.
What exactly are you looking for in a profiler that SKA/GSA doesn't provide?
Hi Ryta,
Thank you for your feedbacks.
To answer your questions:
1. No. Currently, we do not have a plan to support this.
2. That is one of the goals when we separated SKA from GSA. Obviously, we haven't done a good job on it, but fear not SKA is still under development (and we have plan to release SKA monthly). Perhaps you can help us identify the terminology that doesn't make sense in SKA? And, I'll try to get that fix in the next release.
I agree with you that we need to do a better job on the documentation. Green does not necessarily mean good. It means that you are ALU bound instead of Fetch bound. Ideally, you want to get the ALU:Fetch ratio as close to one as possible as this means the system is balanced (you are utilizing both the ALU units and Fetch units in the hardware). So if you see red, it means you can add more ALU instructions without really impacting the performance of the kernel. Likewise, if you see green, you can add more fetch instructions (perhaps you can bake some of your computations into texture/memory).
A texture fetch refers to a single memory access.
For the ALU:Fetch ratio, you want to be at ONE (it is a ratio). Yes, ONE means full occupancy.
High non-red numbers are bad as that means the system is not balanced. Red just means Fetch bound; it does not necesarily mean bad, and green means ALU bound. For example it is better at 0.9 red (close to balance) rather than 10.0 green.
The next version of SKA (due next week or so) should be able to handle the kernel above. Basically, we made major improvements in handling complex control flows in the analyzer recently.
For your other questions, I'll get back to it when I have more free time to respond.
Meanwhile can you either post or send us (gputools.support@amd.com) the kernels with the specific problem you mentioned above.
Originally posted by: ryta1203 bpurnomo, I posted the kernel above, you should be able to just copy and paste it no?
I was actually referring to the kernel that will compile but should not be supported by Brook+, etc.
Originally posted by: bpurnomo
For your other questions, I'll get back to it when I have more free time to respond.
Originally posted by: ryta1203 Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.
I don't think Thread/Sec is a better indication than ALU:Fetch. ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions). Thread/Sec is directly related to the estimated cycles of the kernel. Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).
Originally posted by: bpurnomo Originally posted by: ryta1203Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.
I don't think Thread/Sec is a better indication than ALU:Fetch. ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions). Thread/Sec is directly related to the estimated cycles of the kernel. Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).
SKA hasn't taken account the number of GPRs used by the kernel in its computation. This is something that we might add in the future.
Basically, if your kernel uses a lot of GPRs, your performance will suffer. This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads). Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.
Originally posted by: bpurnomo
SKA hasn't taken account the number of GPRs used by the kernel in its computation. This is something that we might add in the future.
Basically, if your kernel uses a lot of GPRs, your performance will suffer. This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads). Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.
Originally posted by: ryta1203
EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.
So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.
Exactly. However, how close to full occupancy is not the measure for the final run-time of your kernel.
Why? Consider the following example:
Lets say we have a hypothetical GPU with 1 ALU unit and 1 Fetch unit. Consider the following two kernels A and B.
Kernel A generates 100 ALU instructions and 100 Fetch instructions. Thus, its ALU:Fetch ratio is 1.
Kernel B generates 1 ALU instructions and 2 Fetch instructions. Thus, its ALU:Fetch ratio is 0.5.
While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.
Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?
It depends. Is it optimal for balancing the ALU and Fetch resources? Yes, you can't get better than ONE. Is it optimal for the performance of the system? This depends on the number of kernels in flight (this is used to hide the latency of texture fetch), total length of instruction streams, etc. Please see the above example.
bpurnomo,
Thanks for the posts, much help!
1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.
Thank you for the suggestion. I'll pass this request to the team.
2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.
I agree that I'm using the term occupany a bit differently than what you are using it for. I apologize for the confusion. In my mind, the occupany described in the previous post is the theoretical occupancy (not the actual occupany in the GPU) which means we are not taking account of GPRs and other resources.
Because the number of GPRs has a direct effect on the number of threads in flight (to hide the memory latency), if you have a kernel that uses a high number of GPRs, you would want your ALU:Fetch ratio to be much higher 1.0 (to offset the memory latency due to lower number of threads in flight).
3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight? 4. Thanks for the posts, great insight into KSA!!
This is not currently exposed/calculated. Please keep the good suggestions coming though as we are continually trying to improve this tool.
I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.
here is the list of the columns
Name -- appearent
Code -- understood
Alu Instructions
Texture Instructions
Global Read Instructions
Interpolator Instructions
Control Flow Instructions
Global Write Instructions
Texture Dependancy Levels
General Purpose Registers
Min Cycles
Max Cycles
Avg Cycles
Estimated Cycles
Estimated Cycles(Bilinear)
Estimated Cycles(Trilinear)
Estimated Cycles(Aniso)
ALU:Fetch Ratio --- understood
ALU:Fetch Ratio(Bilinear)
ALU:Fetch Ratio(Trilinear)
ALU:Fetch Ratio(Aniso)
Bottleneck -- how is this determined?
Bottleneck(Bilinear)
Bottleneck(Trilinear)
Bottleneck(Aniso)
Avg Peak Throughput
Avg Peak Throughput(Bilinear)
Avg Peak Throughput(Trilinear)
Avg Peak Throughput(Aniso)
Avg Throughput Per Clock
Avg Throughput Per Clock(Bilinear)
Avg Throughput Per Clock(Trilinear)
Avg Throughput Per Clock(Aniso)
Max Scratch Registers
Edit: I meant to reply to this post, but accidentally edited it instead.
Originally posted by: FangQ While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.
I think for beginers like me, this type of comments will be very useful to understand ALU:Fetch
I actually find the statement quite confusing for a few reasons:
1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.
2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.
3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.
4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.
The most useful, and really only useful, thing about the KSA is that it gives you the ISA. That's it. All those measureables ("columns") seem to be somewhat meaningless and misleading considering they don't take the GPR usage into account and therefore can't accurately predict the overall system performance, but only the performance of 1 thread.
I actually find the statement quite confusing for a few reasons:
I think we really need to have someone who have experience for GPU profiling to clarify things up. Otherwise, I just feel awkard to read all these numbers without knowing what they can tell me.
I actually find the statement quite confusing for a few reasons:
I just meant that the commends seemed to give me more info than the literal word-expansion as in the Release notes.
Definitely, explaining the meanings of each item in the help file will be useful; it would be much more useful, as emphasized by your comment, to give a guidance on how to interpret and use these metrics in code optimization.
1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.
I don't agree. Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.
2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.
This is incorrect. Fetch/memory operations are as important as ALU operations. If your kernel is not performing any memory operations at all, then its performance might not be optimal. Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance. This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).
3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.
ALU:Fetch ratio is not ALU utilization. They are two different terms.
4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.
I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread). However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.
I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.
here is the list of the columns
Name -- appearent
Code -- understood
Alu Instructions
Texture Instructions
Global Read Instructions
Interpolator Instructions
Control Flow Instructions
Global Write Instructions
ALU, Texture, Global Read, Interpolator, Control Flow, Global Write gives you the count of each type of operations. Thus, smaller number means less work to be done by your kernel.
Texture Dependancy Levels
Smaller is better for this number. It counts how deep is your texture/fetch dependancy level (i.e. how many chains of dependancy your fetch operations). For example your fetch/memory operation might depend on the result of another fetch/memory operation which also depends on another fetch/memory operation, etc (long dependancy should be avoided usually).
General Purpose Registers
The number of registers used by your kernel. Small number is better. This one is big for gauging your performance. This number has a direct relationship with the number of possible threads in-flight at a time (small number equals more threads in-flight). If your kernel contains memory operations, then more threads in-flight means more performance since memory latency can be hidden by many threads (i.e. if a thread is blocked in GPU because it has to wait on a memory fetch, then another thread can be scheduled to be run instead).
Min Cycles
Max Cycles
Avg Cycles
Estimated Cycles
Estimated Cycles(Bilinear)
Estimated Cycles(Trilinear)
Estimated Cycles(Aniso)
This is an estimated value (it doesn't take account of GPRs, thus highly inaccurate if you have high GPRs and many fetch/memory ops) based on a magic formula.
Bilinear, trilinear, and aniso comes from the graphics world with regards of how the memory fetch is performed (each fetch operation can perform more than one fetch to also retrieve the adjacent memory locations for some averaging/filtering calculations).
ALU:Fetch Ratio --- understood
ALU:Fetch Ratio(Bilinear)
ALU:Fetch Ratio(Trilinear)
ALU:Fetch Ratio(Aniso)
Bottleneck -- how is this determined?
Bottleneck(Bilinear)
Bottleneck(Trilinear)
Bottleneck(Aniso)
Bottleneck is computed based on the number of ALU, Fetch, Control flow, Interpolator instructions. Similar to the estimated cycle computation, this can be inaccurate.
Avg Peak Throughput
Avg Peak Throughput(Bilinear)
Avg Peak Throughput(Trilinear)
Avg Peak Throughput(Aniso)
Avg Throughput Per Clock
Avg Throughput Per Clock(Bilinear)
Avg Throughput Per Clock(Trilinear)
Avg Throughput Per Clock(Aniso)
Typically higher number is better.
Max Scratch Registers
Smaller number is better.
Also, remember the first rule in optimizing your system: you have to find where the bottleneck first, then improve the metric related to the bottleneck. If you improve on one metric that is unrelated to your bottleneck, it will not improve the performance of the system. For example, the bottleneck for your system performance is too many memory/fetch operations. Reducing the number of ALU operations won't improve the perfomance of your system in this case.
thank you so much bpurnomo, this is very helpful.
As you said, I am trying to find out the bottleneck of a Monte-Carlo code I wrote recently. The code appeared to be slightly slower than CPU (Intel Q6700) with a 4650 card, which is unexpected.
The code contains 3 kernels, the first two simulate the movement of a particle, and the 3rd distributes the values of the particle to a 3D grid (using a scatter output, see my post here )
First of all, SKA gave me two options for each kernel, one with "_addr", one without, what are the differences?
Using the one without _addr, here are the stats of my 3 kernels:
kernel1: ALU:46,TEX:5,CF:13,GlobalWrite:1,GPR:10,ALU_Fetch:1.63,Avg:8.13,Thread\Clock:1.97
kernel2: ALU:68,TEX:3,CF:11,GlobalWrite:1,GPR:10,ALU_Fetch:2.19,Avg:3.28,Thread\Clock:2.44
kernel3: ALU:8,TEX:1,CF:7,GlobalWrite:2,GPR:5,ALU_Fetch:4,Avg:2,Thread\Clock:4.0
In your opinion, what do you think is the key cause for the low performance? (ALU and CF are too high?)
(I also have a CUDA version of this code, and have achieved >100x acceleration on a 8800GT card. Given 8800GT only has 112 stream processors, and 4650 has 320 stream processors, I am expecting a even greater speed-up ratio, is this a reasonable expectation?)
Originally posted by: bpurnomo
I don't agree. Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.
You are correct actually, I apologize for the confusion. It does in fact NOT mean this, only that the max number of warps are in flight, I believe, but I could be wrong.
This is incorrect. Fetch/memory operations are as important as ALU operations. If your kernel is not performing any memory operations at all, then its performance might not be optimal. Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance. This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).
Wouldn't this only be the case if the ALU computation time exceeds the fetch?
ALU:Fetch ratio is not ALU utilization. They are two different terms.
Yes, this is why I said it.
I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread). However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.
But can't you also have too many threads in flight?
Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?
Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?
Originally posted by: ryta1203 Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?
You can already do this with the current version of SKA.
Originally posted by: bpurnomo Originally posted by: ryta1203 Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?
You can already do this with the current version of SKA.
Thank you, I wasn't aware of that. Same request though for this as Brook+.... putting the ISA into copy and paste form (aka having header and footers).
Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?
I'll add the request to our bug tracking system.
Originally posted by: bpurnomo Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?
I'll add the request to our bug tracking system.
Also, can you request that they add the "\n" after every line to the IL.h compilation?
I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.
Also, can you request that they add the "\n" after every line to the IL.h compilation?
I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.
This should be fixed in the next version.
Where does the extra GPR come from in the SKA?
For example, if the ISA only uses R0 and R1 then the SKA reports GPR=3.
If the ISA only uses R0 then the SKA reports GPR=2.
It seems that if there are n registers used (including Tx registers) then the SKA reports n+1 GPR.
I'm just wondering where the other GPR comes from.