Archives Discussions

bpurnomo · ‎12-10-2008

The GPU Developer Tools Team is pleased to announce the release of a new tool Stream KernelAnalyzer.

This is a tool for analyzing the performance of stream kernels on ATI graphics cards/stream processors (AMD Stream SDK 1.3 is required). It was derived from the GPU ShaderAnalyzer tool to specifically target the stream community.

Features of the new tool:

Support Brook+ & IL kernel.
Support IL compute kernel for ATI Radeon 4800 series graphics card.
Support for ATI Stream SDK 1.3.
Support for AMD FireStream 9170, 9250 and 9270 stream processors.
Support for ATI Radeon 2000, 3000, and 4000 series graphics card.

Please do not hesitate to post on the forum if you have any questions.

Sincerely,

GPU Developer Tools Team

ryta1203 · ‎01-14-2009

Great news! Looking at this how is this different from the GSA? It looks the same.

BTW, is there plans for a profiler in the works?

bpurnomo · ‎01-14-2009

Stream KernelAnalyzer (SKA) uses the same GSA analysis modules but it has a different interface. For example, some graphics terms have been removed. It has a better Brook compiler support (warning levels, etc) and it supports FireStream series products.

Also, as of GSA 1.49, the support for Brook and IL has been removed from GSA.

What exactly are you looking for in a profiler that SKA/GSA doesn't provide?

ryta1203 · ‎01-14-2009

1. Does the KSA provide runtime analysis?

2. And this is the much bigger part: Is it possible to tweak the terminology to make more sense to non-graphics people and to add documentation to describe what the data means??

I'm sitting here looking at this data but it really tells me nothing. When I first read this thread I thought that you would be making an Analyzer specific to GPGPU that would read like that, instead this seems to just be a simpler GSA specific to GPGPU. The terminology is confusing and tells non-graphics people little about what is going on with the kernel, IMO.

Am I missing something here?

Maybe just better documentation might do the trick, I don't know. I really can't gather much from the KSA right now other than "red bad, green good".

bpurnomo · ‎01-14-2009

Hi Ryta,
Thank you for your feedbacks.

To answer your questions:

1. No. Currently, we do not have a plan to support this.

2. That is one of the goals when we separated SKA from GSA. Obviously, we haven't done a good job on it, but fear not SKA is still under development (and we have plan to release SKA monthly). Perhaps you can help us identify the terminology that doesn't make sense in SKA? And, I'll try to get that fix in the next release.

I agree with you that we need to do a better job on the documentation. Green does not necessarily mean good. It means that you are ALU bound instead of Fetch bound. Ideally, you want to get the ALU:Fetch ratio as close to one as possible as this means the system is balanced (you are utilizing both the ALU units and Fetch units in the hardware). So if you see red, it means you can add more ALU instructions without really impacting the performance of the kernel. Likewise, if you see green, you can add more fetch instructions (perhaps you can bake some of your computations into texture/memory).

ryta1203 · ‎01-14-2009

Maybe it's just the terminology, when you say texture fetch, are you just refering to a memory access? Or a set of memory accesses?

So ONE is the number that you ideally want to be at, meaning that at the number ONE your kernel is running optimally according to the number of ALU:Fetch instructions?

So, I guess really what would be great is to see some kind of occupancy (to use a "CUDA" term), or saturation. Does ONE mean full occupancy or does it simply mean there exists a nice balance and that it's possible to increase performance and maintain that balance.

Please understand that I'm not asking the SKA to tell anything about optimal performance, but to tell that the kernel, the way it is now, is using all the resources and nothing is left to waste. Is this basically what the non-red numbers mean? That makes sense to me.

A lot of this might make more sense when documentation is done on performance improvements, I hope when they make that doc that they refer to the SKA and how to use it properly.

bpurnomo · ‎01-14-2009

A texture fetch refers to a single memory access.

For the ALU:Fetch ratio, you want to be at ONE (it is a ratio). Yes, ONE means full occupancy.

High non-red numbers are bad as that means the system is not balanced. Red just means Fetch bound; it does not necesarily mean bad, and green means ALU bound. For example it is better at 0.9 red (close to balance) rather than 10.0 green.

ryta1203 · ‎01-19-2009

There seems to be an issue with the SKA:

It accepts and analyzes local (kernel) arrays, which are not supported in Brook+. In fact, if you use a local array you will get better results (ALU:Fetch ratio) in the SKA and the code will compile just fine. ALSO, the code will compile just fine in Brook+ in VS2005 (that's what I'm using so I can't say about anything else) BUT will NOT produce correct results.

In SKA, if I used temp[4] instead of temp0, temp1, temp2, temp3 I got a much better ALU:Fetch ratio. If I just used the four flouts (ie. temp0,...) then my ALU:Fetch ratio didn't change. Let me know if you want to see the kernels I am talking about.

I just think this should be fixed because it gives improper results and can be confusing to users.

EDIT: I'm also VERY interested in the Throughput being explained. What is the max throughput for the 4850/4870? It seems like the Throughput number would give you a more accurate account as to how "saturated" you are than the ALU:Fetch ratio, correct? Should this number be 4000M Threads/Sec for the 4870?

ALSO, I new version should be coming out soon?

ALSO, a Find/Replace would be very useful, IMO.

ryta1203 · ‎01-19-2009

For the following kernel, I get N/A yet the code compiles into both 4870 and IL assembly just fine, what is the issue?

kernel void advection1_s(float4 Fin1to4<>, float4 Fin5to8<>, float4 Fin9<>, float GEOs[], int gx,
int mx, int my, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
{
int k,IB,xd,yd,xdt,ydt, x, y, idx;
idx = instance().x;
x = idx%gx;
y = (int)floor((float)idx/(float)gx);
//Bounce back at solid wall
Fs1to4=Fin1to4;
Fs5to8=Fin5to8;
Fs9=Fin9;
if(GEOs[idx]==1.0f)
{
// loop 0
xd=x;
yd=y;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x;
ydt=y;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs1to4.w = Fin1to4.w;
}
}
// loop 1
xd=x-1;
yd=y;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x+1;
ydt=y;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs1to4.x = Fin1to4.z;
}
}

// loop 2
xd=x;
yd=y-1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x;
ydt=y+1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs1to4.y = Fin5to8.w;

}
}
// loop 3
xd=x+1;
yd=y;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x-1;
ydt=y;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs1to4.z = Fin1to4.x;
}
}
// loop 4
xd=x;
yd=y+1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x;
ydt=y-1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs5to8.w = Fin1to4.y;
}
}
// loop 5
xd=x-1;
yd=y-1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x+1;
ydt=y+1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs5to8.x = Fin5to8.z;
}
}
// loop 6
xd=x+1;
yd=y-1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x-1;
ydt=y+1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs5to8.y = Fin9.w;
}
}
// loop 7
xd=x+1;
yd=y+1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x-1;
ydt=y-1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs5to8.z = Fin5to8.x;
}
}
// loop 8
xd=x-1;
yd=y+1;
if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
{
xdt=x+1;
ydt=y-1;
if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
{
Fs9.w = Fin5to8.y;
}
}
}

}

bpurnomo · ‎01-19-2009

The next version of SKA (due next week or so) should be able to handle the kernel above. Basically, we made major improvements in handling complex control flows in the analyzer recently.

For your other questions, I'll get back to it when I have more free time to respond.

Meanwhile can you either post or send us (gputools.support@amd.com) the kernels with the specific problem you mentioned above.

ryta1203 · ‎01-19-2009

bpurnomo,

I posted the kernel above, you should be able to just copy and paste it no?

bpurnomo · ‎01-19-2009

Originally posted by: ryta1203 bpurnomo, I posted the kernel above, you should be able to just copy and paste it no?

I was actually referring to the kernel that will compile but should not be supported by Brook+, etc.

ryta1203 · ‎01-19-2009

OK, here are three kernels and data for comparison:

1st Kernel
kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
{
if(GEOs==2||GEOs==3)
{
ps=0.0f;
us=0.0f;
vs=0.0f;

ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
fin5to8.z + fin9.w;
us = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0]
+ fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
vs = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1]
+ fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
}
}

1st kernel Data
Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
Radeon HD 4870,23,0,3.00,12.32,5.21,4.60,0.65,Texture Fetch,3.48,2609 M Threads\Sec,0,1,10,44,22

2nd Kernel:
kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
{
float t0, t1;
if(GEOs==2||GEOs==3)
{
ps=0.0f;
us=0.0f;
vs=0.0f;

ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
fin5to8.z + fin9.w;
t0 = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0];
t1 = fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
us = t0 + t1;
t0 = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1];
t1 = fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
vs = t0 + t1;
}
}

2nd Kernel Data:
Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
Radeon HD 4870,23,0,3.00,12.32,5.21,4.60,0.65,Texture Fetch,3.48,2609 M Threads\Sec,0,1,10,45,22

FINAL Kernel with local arrays:
kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
{
float t[2];
if(GEOs==2||GEOs==3)
{
ps=0.0f;
us=0.0f;
vs=0.0f;

ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
fin5to8.z + fin9.w;
t[0] = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0];
t[1] = fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
us = t[0] + t[1];
t[0] = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1];
t[1] = fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
vs = t[0] + t[1];
}
}

FINAL Kernel Data:
Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
Radeon HD 4870,7,0,3.00,3.00,3.00,3.00,1.88,Global Write,5.33,4000 M Threads\Sec,0,1,5,15,4

NOW, you can see that the GPR has gone from 23 to 7, the ALU:Fetch has gone from .65 to 1.88 and that the Throughput has gone from 2609M Threads/sec to 4000M Threads/sec

Now, even if KSA is ignoring certain code to get these results and is simply choosing not to generate unsupported code, I believe that the user should get an error saying that certain code is unsupported and the code should not compile and give results.

This while my other kernel listed above is correct (compiles and runs correctly with proper output) but produces only N/A for the data. I'm glad this is possibly going to be fixed in the next version!

ryta1203 · ‎01-19-2009

Originally posted by: bpurnomo

For your other questions, I'll get back to it when I have more free time to respond.

Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel:

Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker?

The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.

bpurnomo · ‎01-20-2009

Originally posted by: ryta1203
Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.
Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.

I don't think Thread/Sec is a better indication than ALU:Fetch. ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions). Thread/Sec is directly related to the estimated cycles of the kernel. Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).

ryta1203 · ‎01-20-2009

It would seem to me that a good indication would be to tell the developers how much of the GPU is being used.

So, to reiterate, if the ALU:Fetch ratio is "1.00" then all the SPs are being utilized 100% on the GPU? If the ALU:Fetch ratio doesn't give an inidication of that then that is what is needed, some percentage of the SPs being used.

ryta1203 · ‎01-25-2009

Originally posted by: bpurnomo

Originally posted by: ryta1203
Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.
Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.

I don't think Thread/Sec is a better indication than ALU:Fetch. ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions). Thread/Sec is directly related to the estimated cycles of the kernel. Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).

I have a kernel where if I make a slight modification to it the GPR goes from 13 to 23; HOWEVER, the ALU:Fetch goes from .88 to 1.08

If I go by the ALU:Fetch as the mark of optimization then I should use the 1.08 kernel, since it's closer to 1 and not in red or green. The kernel did not run any faster.

I have other kernels that act much the same way, where I can modify them to get better ALU:Fetch ratio (closer to 1) but get no speedup.

I am calling every kernel the same amount of time (there are no brances taken in between kernel calls).

bpurnomo · ‎01-27-2009

SKA hasn't taken account the number of GPRs used by the kernel in its computation. This is something that we might add in the future.

Basically, if your kernel uses a lot of GPRs, your performance will suffer. This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads). Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.

ryta1203 · ‎01-27-2009

Originally posted by: bpurnomo

SKA hasn't taken account the number of GPRs used by the kernel in its computation. This is something that we might add in the future.

Basically, if your kernel uses a lot of GPRs, your performance will suffer. This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads). Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.

Yes, I kind of figured that GPR usage played an important role. So none of the KSA measurables take into account GPR usage?

It might be helpful to add this in the future because without it the KSA is mostly useless as a tool to gauge performance of a kernel and isn't that the point of the KSA or am I missing the point? Maybe I misunderstood the use of the KSA?

EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.

ryta1203 · ‎01-28-2009

Originally posted by: ryta1203

EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.

So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.

Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?

bpurnomo · ‎01-28-2009

So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.

Exactly. However, how close to full occupancy is not the measure for the final run-time of your kernel.

Why? Consider the following example:

Lets say we have a hypothetical GPU with 1 ALU unit and 1 Fetch unit. Consider the following two kernels A and B.

Kernel A generates 100 ALU instructions and 100 Fetch instructions. Thus, its ALU:Fetch ratio is 1.

Kernel B generates 1 ALU instructions and 2 Fetch instructions. Thus, its ALU:Fetch ratio is 0.5.

While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.

Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?

It depends. Is it optimal for balancing the ALU and Fetch resources? Yes, you can't get better than ONE. Is it optimal for the performance of the system? This depends on the number of kernels in flight (this is used to hide the latency of texture fetch), total length of instruction streams, etc. Please see the above example.

ryta1203 · ‎01-28-2009

bpurnomo,

Thanks for the posts, much help!

1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.

2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.

3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight?

4. Thanks for the posts, great insight into KSA!!

tanja1 · ‎01-30-2009

bpurnomo,

Thanks for the posts, much help!

bpurnomo · ‎01-30-2009

1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.

Thank you for the suggestion. I'll pass this request to the team.

2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.

I agree that I'm using the term occupany a bit differently than what you are using it for. I apologize for the confusion. In my mind, the occupany described in the previous post is the theoretical occupancy (not the actual occupany in the GPU) which means we are not taking account of GPRs and other resources.

Because the number of GPRs has a direct effect on the number of threads in flight (to hide the memory latency), if you have a kernel that uses a high number of GPRs, you would want your ALU:Fetch ratio to be much higher 1.0 (to offset the memory latency due to lower number of threads in flight).

3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight? 4. Thanks for the posts, great insight into KSA!!

This is not currently exposed/calculated. Please keep the good suggestions coming though as we are continually trying to improve this tool.

FangQ · ‎02-19-2009

I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.

here is the list of the columns

Name -- appearent
Code -- understood
Alu Instructions
Texture Instructions
Global Read Instructions
Interpolator Instructions
Control Flow Instructions
Global Write Instructions

Texture Dependancy Levels

General Purpose Registers

Min Cycles
Max Cycles
Avg Cycles
Estimated Cycles
Estimated Cycles(Bilinear)
Estimated Cycles(Trilinear)
Estimated Cycles(Aniso)

ALU:Fetch Ratio --- understood
ALU:Fetch Ratio(Bilinear)
ALU:Fetch Ratio(Trilinear)
ALU:Fetch Ratio(Aniso)

Bottleneck -- how is this determined?
Bottleneck(Bilinear)
Bottleneck(Trilinear)
Bottleneck(Aniso)

Avg Peak Throughput
Avg Peak Throughput(Bilinear)
Avg Peak Throughput(Trilinear)
Avg Peak Throughput(Aniso)
Avg Throughput Per Clock
Avg Throughput Per Clock(Bilinear)
Avg Throughput Per Clock(Trilinear)
Avg Throughput Per Clock(Aniso)

Max Scratch Registers

Edit: I meant to reply to this post, but accidentally edited it instead.

ryta1203 · ‎02-19-2009

Originally posted by: FangQ
While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.

I think for beginers like me, this type of comments will be very useful to understand ALU:Fetch

I actually find the statement quite confusing for a few reasons:

1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.

2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.

3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.

4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.

The most useful, and really only useful, thing about the KSA is that it gives you the ISA. That's it. All those measureables ("columns") seem to be somewhat meaningless and misleading considering they don't take the GPR usage into account and therefore can't accurately predict the overall system performance, but only the performance of 1 thread.

FangQ · ‎02-19-2009

I actually find the statement quite confusing for a few reasons:

I think we really need to have someone who have experience for GPU profiling to clarify things up. Otherwise, I just feel awkard to read all these numbers without knowing what they can tell me.

FangQ · ‎02-19-2009

I actually find the statement quite confusing for a few reasons:

I just meant that the commends seemed to give me more info than the literal word-expansion as in the Release notes.

Definitely, explaining the meanings of each item in the help file will be useful; it would be much more useful, as emphasized by your comment, to give a guidance on how to interpret and use these metrics in code optimization.

bpurnomo · ‎02-20-2009

1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.

I don't agree. Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.

2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.

This is incorrect. Fetch/memory operations are as important as ALU operations. If your kernel is not performing any memory operations at all, then its performance might not be optimal. Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance. This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).

3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.

ALU:Fetch ratio is not ALU utilization. They are two different terms.

4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.

I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread). However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.

bpurnomo · ‎02-20-2009

I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.

here is the list of the columns
Name -- appearent
Code -- understood
Alu Instructions
Texture Instructions
Global Read Instructions
Interpolator Instructions
Control Flow Instructions
Global Write Instructions

ALU, Texture, Global Read, Interpolator, Control Flow, Global Write gives you the count of each type of operations. Thus, smaller number means less work to be done by your kernel.

Texture Dependancy Levels

Smaller is better for this number. It counts how deep is your texture/fetch dependancy level (i.e. how many chains of dependancy your fetch operations). For example your fetch/memory operation might depend on the result of another fetch/memory operation which also depends on another fetch/memory operation, etc (long dependancy should be avoided usually).

General Purpose Registers

The number of registers used by your kernel. Small number is better. This one is big for gauging your performance. This number has a direct relationship with the number of possible threads in-flight at a time (small number equals more threads in-flight). If your kernel contains memory operations, then more threads in-flight means more performance since memory latency can be hidden by many threads (i.e. if a thread is blocked in GPU because it has to wait on a memory fetch, then another thread can be scheduled to be run instead).

Min Cycles
Max Cycles
Avg Cycles
Estimated Cycles
Estimated Cycles(Bilinear)
Estimated Cycles(Trilinear)
Estimated Cycles(Aniso)

This is an estimated value (it doesn't take account of GPRs, thus highly inaccurate if you have high GPRs and many fetch/memory ops) based on a magic formula.

Bilinear, trilinear, and aniso comes from the graphics world with regards of how the memory fetch is performed (each fetch operation can perform more than one fetch to also retrieve the adjacent memory locations for some averaging/filtering calculations).

ALU:Fetch Ratio --- understood
ALU:Fetch Ratio(Bilinear)
ALU:Fetch Ratio(Trilinear)
ALU:Fetch Ratio(Aniso)

Bottleneck -- how is this determined?
Bottleneck(Bilinear)
Bottleneck(Trilinear)
Bottleneck(Aniso)

Bottleneck is computed based on the number of ALU, Fetch, Control flow, Interpolator instructions. Similar to the estimated cycle computation, this can be inaccurate.

Avg Peak Throughput
Avg Peak Throughput(Bilinear)
Avg Peak Throughput(Trilinear)
Avg Peak Throughput(Aniso)
Avg Throughput Per Clock
Avg Throughput Per Clock(Bilinear)
Avg Throughput Per Clock(Trilinear)
Avg Throughput Per Clock(Aniso)

Typically higher number is better.

Max Scratch Registers

Smaller number is better.

Also, remember the first rule in optimizing your system: you have to find where the bottleneck first, then improve the metric related to the bottleneck. If you improve on one metric that is unrelated to your bottleneck, it will not improve the performance of the system. For example, the bottleneck for your system performance is too many memory/fetch operations. Reducing the number of ALU operations won't improve the perfomance of your system in this case.

FangQ · ‎02-20-2009

thank you so much bpurnomo, this is very helpful.

As you said, I am trying to find out the bottleneck of a Monte-Carlo code I wrote recently. The code appeared to be slightly slower than CPU (Intel Q6700) with a 4650 card, which is unexpected.

The code contains 3 kernels, the first two simulate the movement of a particle, and the 3rd distributes the values of the particle to a 3D grid (using a scatter output, see my post here )

First of all, SKA gave me two options for each kernel, one with "_addr", one without, what are the differences?

Using the one without _addr, here are the stats of my 3 kernels:

kernel1: ALU:46,TEX:5,CF:13,GlobalWrite:1,GPR:10,ALU_Fetch:1.63,Avg:8.13,Thread\Clock:1.97

kernel2: ALU:68,TEX:3,CF:11,GlobalWrite:1,GPR:10,ALU_Fetch:2.19,Avg:3.28,Thread\Clock:2.44

kernel3: ALU:8,TEX:1,CF:7,GlobalWrite:2,GPR:5,ALU_Fetch:4,Avg:2,Thread\Clock:4.0

In your opinion, what do you think is the key cause for the low performance? (ALU and CF are too high?)

(I also have a CUDA version of this code, and have achieved >100x acceleration on a 8800GT card. Given 8800GT only has 112 stream processors, and 4650 has 320 stream processors, I am expecting a even greater speed-up ratio, is this a reasonable expectation?)

ryta1203 · ‎02-20-2009

Originally posted by: bpurnomo

I don't agree. Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.

You are correct actually, I apologize for the confusion. It does in fact NOT mean this, only that the max number of warps are in flight, I believe, but I could be wrong.

This is incorrect. Fetch/memory operations are as important as ALU operations. If your kernel is not performing any memory operations at all, then its performance might not be optimal. Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance. This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).

Wouldn't this only be the case if the ALU computation time exceeds the fetch?

ALU:Fetch ratio is not ALU utilization. They are two different terms.

Yes, this is why I said it.

I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread). However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.

But can't you also have too many threads in flight?

ryta1203 · ‎03-10-2009

Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

ryta1203 · ‎03-10-2009

Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?

bpurnomo · ‎03-10-2009

Originally posted by: ryta1203 Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?

You can already do this with the current version of SKA.

ryta1203 · ‎03-11-2009

Originally posted by: bpurnomo
Originally posted by: ryta1203 Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?

You can already do this with the current version of SKA.

Thank you, I wasn't aware of that. Same request though for this as Brook+.... putting the ISA into copy and paste form (aka having header and footers).

bpurnomo · ‎03-10-2009

Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

I'll add the request to our bug tracking system.

ryta1203 · ‎03-11-2009

Originally posted by: bpurnomo
Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

I'll add the request to our bug tracking system.

Also, can you request that they add the "\n" after every line to the IL.h compilation?

I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.

bpurnomo · ‎03-11-2009

Also, can you request that they add the "\n" after every line to the IL.h compilation?

I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.

This should be fixed in the next version.

ryta1203 · ‎03-30-2009

Where does the extra GPR come from in the SKA?

For example, if the ISA only uses R0 and R1 then the SKA reports GPR=3.

If the ISA only uses R0 then the SKA reports GPR=2.

It seems that if there are n registers used (including Tx registers) then the SKA reports n+1 GPR.

I'm just wondering where the other GPR comes from.

Archives Discussions

Stream KernelAnalyzer is now available!