74 Replies Latest reply on Sep 9, 2009 7:17 PM by bpurnomo

    Stream KernelAnalyzer is now available!

    bpurnomo

      The GPU Developer Tools Team is pleased to announce the release of a new tool Stream KernelAnalyzer

      This is a tool for analyzing the performance of stream kernels on ATI graphics cards/stream processors (AMD Stream SDK 1.3 is required).  It was derived from the GPU ShaderAnalyzer tool to specifically target the stream community. 

      Features of the new tool:

      • Support Brook+ & IL kernel.
      • Support IL compute kernel for ATI Radeon 4800 series graphics card.
      • Support for ATI Stream SDK 1.3.
      • Support for AMD FireStream 9170, 9250 and 9270 stream processors.
      • Support for ATI Radeon 2000, 3000, and 4000 series graphics card.

      Please do not hesitate to post on the forum if you have any questions.

      Sincerely,

      GPU Developer Tools Team

        • Stream KernelAnalyzer is now available!
          ryta1203
          Great news! Looking at this how is this different from the GSA? It looks the same.

          BTW, is there plans for a profiler in the works?
            • Stream KernelAnalyzer is now available!
              bpurnomo

              Stream KernelAnalyzer (SKA) uses the same GSA analysis modules but it has a different interface.  For example, some graphics terms have been removed.  It has a better Brook compiler support (warning levels, etc) and it supports FireStream series products.

              Also, as of GSA 1.49, the support for Brook and IL has been removed from GSA.

              What exactly are you looking for in a profiler that SKA/GSA doesn't provide?

               

               

                • Stream KernelAnalyzer is now available!
                  ryta1203
                  1. Does the KSA provide runtime analysis?

                  2. And this is the much bigger part: Is it possible to tweak the terminology to make more sense to non-graphics people and to add documentation to describe what the data means??

                  I'm sitting here looking at this data but it really tells me nothing. When I first read this thread I thought that you would be making an Analyzer specific to GPGPU that would read like that, instead this seems to just be a simpler GSA specific to GPGPU. The terminology is confusing and tells non-graphics people little about what is going on with the kernel, IMO.

                  Am I missing something here?

                  Maybe just better documentation might do the trick, I don't know. I really can't gather much from the KSA right now other than "red bad, green good".
                    • Stream KernelAnalyzer is now available!
                      bpurnomo

                      Hi Ryta,
                      Thank you for your feedbacks.

                      To answer your questions:

                      1. No.  Currently, we do not have a plan to support this. 

                      2.  That is one of the goals when we separated SKA from GSA.  Obviously, we haven't done a good job on it, but fear not SKA is still under development (and we have plan to release SKA monthly).  Perhaps you can help us identify the terminology that doesn't make sense in SKA? And, I'll try to get that fix in the next release.

                      I agree with you that we need to do a better job on the documentation.  Green does not necessarily mean good.  It means that you are ALU bound instead of Fetch bound.  Ideally, you want to get the ALU:Fetch ratio as close to one as possible as this means the system is balanced (you are utilizing both the ALU units and Fetch units in the hardware).  So if you see red, it means you can add more ALU instructions without really impacting the performance of the kernel.  Likewise, if you see green, you can add more fetch instructions (perhaps you can bake some of your computations into texture/memory).

                       

                        • Stream KernelAnalyzer is now available!
                          ryta1203
                          Maybe it's just the terminology, when you say texture fetch, are you just refering to a memory access? Or a set of memory accesses?

                          So ONE is the number that you ideally want to be at, meaning that at the number ONE your kernel is running optimally according to the number of ALU:Fetch instructions?

                          So, I guess really what would be great is to see some kind of occupancy (to use a "CUDA" term), or saturation. Does ONE mean full occupancy or does it simply mean there exists a nice balance and that it's possible to increase performance and maintain that balance.

                          Please understand that I'm not asking the SKA to tell anything about optimal performance, but to tell that the kernel, the way it is now, is using all the resources and nothing is left to waste. Is this basically what the non-red numbers mean? That makes sense to me.

                          A lot of this might make more sense when documentation is done on performance improvements, I hope when they make that doc that they refer to the SKA and how to use it properly.
                            • Stream KernelAnalyzer is now available!
                              bpurnomo

                              A texture fetch refers to a single memory access.

                              For the ALU:Fetch ratio, you want to be at ONE (it is a ratio).  Yes, ONE means full occupancy.

                              High non-red numbers are bad as that means the system is not balanced.  Red just means Fetch bound; it does not necesarily mean bad, and green means ALU bound.  For example it is better at 0.9 red (close to balance) rather than 10.0 green.

                               

                                • Stream KernelAnalyzer is now available!
                                  ryta1203
                                  There seems to be an issue with the SKA:

                                  It accepts and analyzes local (kernel) arrays, which are not supported in Brook+. In fact, if you use a local array you will get better results (ALU:Fetch ratio) in the SKA and the code will compile just fine. ALSO, the code will compile just fine in Brook+ in VS2005 (that's what I'm using so I can't say about anything else) BUT will NOT produce correct results.

                                  In SKA, if I used temp[4] instead of temp0, temp1, temp2, temp3 I got a much better ALU:Fetch ratio. If I just used the four flouts (ie. temp0,...) then my ALU:Fetch ratio didn't change. Let me know if you want to see the kernels I am talking about.

                                  I just think this should be fixed because it gives improper results and can be confusing to users.

                                  EDIT: I'm also VERY interested in the Throughput being explained. What is the max throughput for the 4850/4870? It seems like the Throughput number would give you a more accurate account as to how "saturated" you are than the ALU:Fetch ratio, correct? Should this number be 4000M Threads/Sec for the 4870?

                                  ALSO, I new version should be coming out soon?

                                  ALSO, a Find/Replace would be very useful, IMO.
                                    • Stream KernelAnalyzer is now available!
                                      ryta1203
                                      For the following kernel, I get N/A yet the code compiles into both 4870 and IL assembly just fine, what is the issue?

                                      kernel void advection1_s(float4 Fin1to4<>, float4 Fin5to8<>, float4 Fin9<>, float GEOs[], int gx,
                                      int mx, int my, out float4 Fs9<>, out float4 Fs5to8<>, out float4 Fs1to4<>)
                                      {
                                      int k,IB,xd,yd,xdt,ydt, x, y, idx;
                                      idx = instance().x;
                                      x = idx%gx;
                                      y = (int)floor((float)idx/(float)gx);
                                      //Bounce back at solid wall
                                      Fs1to4=Fin1to4;
                                      Fs5to8=Fin5to8;
                                      Fs9=Fin9;
                                      if(GEOs[idx]==1.0f)
                                      {
                                      // loop 0
                                      xd=x;
                                      yd=y;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x;
                                      ydt=y;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs1to4.w = Fin1to4.w;
                                      }
                                      }
                                      // loop 1
                                      xd=x-1;
                                      yd=y;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x+1;
                                      ydt=y;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs1to4.x = Fin1to4.z;
                                      }
                                      }

                                      // loop 2
                                      xd=x;
                                      yd=y-1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x;
                                      ydt=y+1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs1to4.y = Fin5to8.w;

                                      }
                                      }
                                      // loop 3
                                      xd=x+1;
                                      yd=y;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x-1;
                                      ydt=y;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs1to4.z = Fin1to4.x;
                                      }
                                      }
                                      // loop 4
                                      xd=x;
                                      yd=y+1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x;
                                      ydt=y-1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs5to8.w = Fin1to4.y;
                                      }
                                      }
                                      // loop 5
                                      xd=x-1;
                                      yd=y-1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x+1;
                                      ydt=y+1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs5to8.x = Fin5to8.z;
                                      }
                                      }
                                      // loop 6
                                      xd=x+1;
                                      yd=y-1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x-1;
                                      ydt=y+1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs5to8.y = Fin9.w;
                                      }
                                      }
                                      // loop 7
                                      xd=x+1;
                                      yd=y+1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x-1;
                                      ydt=y-1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs5to8.z = Fin5to8.x;
                                      }
                                      }
                                      // loop 8
                                      xd=x-1;
                                      yd=y+1;
                                      if((xd<1)||(xd>mx)||(yd<1)||(yd>my))
                                      {
                                      xdt=x+1;
                                      ydt=y-1;
                                      if((GEOs[xdt+gx*ydt]==2.0f)||(GEOs[xdt+gx*ydt]==3.0f))
                                      {
                                      Fs9.w = Fin5to8.y;
                                      }
                                      }
                                      }

                                      }
                                        • Stream KernelAnalyzer is now available!
                                          bpurnomo

                                          The next version of SKA (due next week or so) should be able to handle the kernel above.  Basically, we made major improvements in handling complex control flows in the analyzer recently.

                                          For your other questions, I'll get back to it when I have more free time to respond.

                                          Meanwhile can you either post or send us (gputools.support@amd.com) the kernels with the specific problem you mentioned above.

                                           

                                           

                                           

                                            • Stream KernelAnalyzer is now available!
                                              ryta1203
                                              bpurnomo,

                                              I posted the kernel above, you should be able to just copy and paste it no?
                                                • Stream KernelAnalyzer is now available!
                                                  bpurnomo

                                                   

                                                  Originally posted by: ryta1203 bpurnomo, I posted the kernel above, you should be able to just copy and paste it no?


                                                  I was actually referring to the kernel that will compile but should not be supported by Brook+, etc.

                                                   

                                                    • Stream KernelAnalyzer is now available!
                                                      ryta1203
                                                      OK, here are three kernels and data for comparison:

                                                      1st Kernel
                                                      kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
                                                      float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
                                                      {
                                                      if(GEOs==2||GEOs==3)
                                                      {
                                                      ps=0.0f;
                                                      us=0.0f;
                                                      vs=0.0f;

                                                      ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
                                                      fin5to8.z + fin9.w;
                                                      us = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0]
                                                      + fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
                                                      vs = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1]
                                                      + fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
                                                      }
                                                      }

                                                      1st kernel Data
                                                      Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
                                                      Radeon HD 4870,23,0,3.00,12.32,5.21,4.60,0.65,Texture Fetch,3.48,2609 M Threads\Sec,0,1,10,44,22


                                                      2nd Kernel:
                                                      kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
                                                      float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
                                                      {
                                                      float t0, t1;
                                                      if(GEOs==2||GEOs==3)
                                                      {
                                                      ps=0.0f;
                                                      us=0.0f;
                                                      vs=0.0f;

                                                      ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
                                                      fin5to8.z + fin9.w;
                                                      t0 = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0];
                                                      t1 = fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
                                                      us = t0 + t1;
                                                      t0 = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1];
                                                      t1 = fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
                                                      vs = t0 + t1;
                                                      }
                                                      }

                                                      2nd Kernel Data:
                                                      Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
                                                      Radeon HD 4870,23,0,3.00,12.32,5.21,4.60,0.65,Texture Fetch,3.48,2609 M Threads\Sec,0,1,10,45,22

                                                      FINAL Kernel with local arrays:
                                                      kernel void macro_s(float4 fin1to4<>, float4 fin5to8<>, float4 fin9<>, float pin<>, float uin<>, float vin<>,
                                                      float GEOs<>, int e[], out float ps<>, out float us<>, out float vs<>)
                                                      {
                                                      float t[2];
                                                      if(GEOs==2||GEOs==3)
                                                      {
                                                      ps=0.0f;
                                                      us=0.0f;
                                                      vs=0.0f;

                                                      ps = fin1to4.w + fin1to4.x + fin1to4.y + fin1to4.z + fin5to8.w + fin5to8.x + fin5to8.y +
                                                      fin5to8.z + fin9.w;
                                                      t[0] = fin1to4.w*e[0+9*0] + fin1to4.x*e[1+9*0] + fin1to4.y*e[2+9*0] + fin1to4.z*e[3+9*0] + fin5to8.w*e[4+9*0];
                                                      t[1] = fin5to8.x*e[5+9*0] + fin5to8.y*e[6+9*0] + fin5to8.z*e[7+9*0] + fin9.w*e[8+9*0];
                                                      us = t[0] + t[1];
                                                      t[0] = fin1to4.w*e[0+9*1] + fin1to4.x*e[1+9*1] + fin1to4.y*e[2+9*1] + fin1to4.z*e[3+9*1] + fin5to8.w*e[4+9*1];
                                                      t[1] = fin5to8.x*e[5+9*1] + fin5to8.y*e[6+9*1] + fin5to8.z*e[7+9*1] + fin9.w*e[8+9*1];
                                                      vs = t[0] + t[1];
                                                      }
                                                      }

                                                      FINAL Kernel Data:
                                                      Name,GPR,Scratch Reg,Min,Max,Avg,Est Cycles,ALU:Fetch,BottleNeck,%s\Clock,Throughput,GlobalRead,GlobalWrite,CF,ALU,TEX
                                                      Radeon HD 4870,7,0,3.00,3.00,3.00,3.00,1.88,Global Write,5.33,4000 M Threads\Sec,0,1,5,15,4



                                                      NOW, you can see that the GPR has gone from 23 to 7, the ALU:Fetch has gone from .65 to 1.88 and that the Throughput has gone from 2609M Threads/sec to 4000M Threads/sec

                                                      Now, even if KSA is ignoring certain code to get these results and is simply choosing not to generate unsupported code, I believe that the user should get an error saying that certain code is unsupported and the code should not compile and give results.

                                                      This while my other kernel listed above is correct (compiles and runs correctly with proper output) but produces only N/A for the data. I'm glad this is possibly going to be fixed in the next version!
                                                  • Stream KernelAnalyzer is now available!
                                                    ryta1203
                                                    Originally posted by: bpurnomo



                                                    For your other questions, I'll get back to it when I have more free time to respond.


                                                    Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel:

                                                    Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker?

                                                    The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.
                                                      • Stream KernelAnalyzer is now available!
                                                        bpurnomo

                                                         

                                                        Originally posted by: ryta1203
                                                        Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.
                                                        Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.


                                                        I don't think Thread/Sec is a better indication than ALU:Fetch.  ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions).  Thread/Sec is directly related to the estimated cycles of the kernel.   Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).

                                                         

                                                          • Stream KernelAnalyzer is now available!
                                                            ryta1203
                                                            It would seem to me that a good indication would be to tell the developers how much of the GPU is being used.

                                                            So, to reiterate, if the ALU:Fetch ratio is "1.00" then all the SPs are being utilized 100% on the GPU? If the ALU:Fetch ratio doesn't give an inidication of that then that is what is needed, some percentage of the SPs being used.
                                                            • Stream KernelAnalyzer is now available!
                                                              ryta1203
                                                              Originally posted by: bpurnomo

                                                              Originally posted by: ryta1203
                                                              Originally posted by: bpurnomo For your other questions, I'll get back to it when I have more free time to respond.
                                                              Let me rephrase my Thread/Sec question since it's obvious that the number of threads in throughput depends entirely on the given kernel: Is the Threads/Sec marker a better indication of saturation than the ALU:Fetch marker? The KSA has no real way of telling you what the saturation point is, so I guess you could continue to tweak forever, possibly wasting a lot of time just to make things worse. This is probably the most annoying feature of the tool.





                                                              I don't think Thread/Sec is a better indication than ALU:Fetch.  ALU:Fetch will guide developers how to optimize their kernel (they can remove/add more ALU/Fetch instructions).  Thread/Sec is directly related to the estimated cycles of the kernel.   Also, keep in mind that the real throughput of the hardware is also affected by the number of registers/GPRS used by your kernel (which is not accounted yet by the Thread/Sec's calculation).




                                                               



                                                              I have a kernel where if I make a slight modification to it the GPR goes from 13 to 23; HOWEVER, the ALU:Fetch goes from .88 to 1.08

                                                              If I go by the ALU:Fetch as the mark of optimization then I should use the 1.08 kernel, since it's closer to 1 and not in red or green. The kernel did not run any faster.

                                                              I have other kernels that act much the same way, where I can modify them to get better ALU:Fetch ratio (closer to 1) but get no speedup.

                                                              I am calling every kernel the same amount of time (there are no brances taken in between kernel calls).
                                                                • Stream KernelAnalyzer is now available!
                                                                  bpurnomo

                                                                  SKA hasn't taken account the number of GPRs used by the kernel in its computation.    This is something that we might add in the future.

                                                                  Basically, if your kernel uses a lot of GPRs, your performance will suffer.   This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads).  Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.

                                                                    • Stream KernelAnalyzer is now available!
                                                                      ryta1203
                                                                      Originally posted by: bpurnomo

                                                                      SKA hasn't taken account the number of GPRs used by the kernel in its computation.    This is something that we might add in the future.


                                                                      Basically, if your kernel uses a lot of GPRs, your performance will suffer.   This is because the number of GPRs directly relates to the number of possible threads in flight (more GPRs per kernel = less threads).  Having only a few threads in flight will impact performance as GPUs rely on having many threads in flight to hide memory latency.


                                                                      Yes, I kind of figured that GPR usage played an important role. So none of the KSA measurables take into account GPR usage?

                                                                      It might be helpful to add this in the future because without it the KSA is mostly useless as a tool to gauge performance of a kernel and isn't that the point of the KSA or am I missing the point? Maybe I misunderstood the use of the KSA?

                                                                      EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.
                                                                        • Stream KernelAnalyzer is now available!
                                                                          ryta1203
                                                                          Originally posted by: ryta1203



                                                                          EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.


                                                                          So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.

                                                                          Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?

                                                                            • Stream KernelAnalyzer is now available!
                                                                              bpurnomo

                                                                               

                                                                              So, my point here is just that there are obviously multiple things that can effect performance but that it would be great to have a single measurable (along with all the above things) to tell exactly how close to full occupancy you are.


                                                                              Exactly.   However, how close to full occupancy is not the measure for the final run-time of your kernel.

                                                                              Why?  Consider the following example:

                                                                              Lets say we have a hypothetical GPU with 1 ALU unit and 1 Fetch unit.   Consider the following two kernels A and B.

                                                                              Kernel A generates 100 ALU instructions and 100 Fetch instructions.  Thus, its ALU:Fetch ratio is 1.

                                                                              Kernel B generates 1 ALU instructions and 2 Fetch instructions.  Thus, its ALU:Fetch ratio is 0.5.

                                                                              While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.

                                                                               

                                                                              Also, Micah is under the impression that a ONE ALU:Fetch ratio is not optimal and it depends on the texture fetch times, etc. Is this true?


                                                                              It depends.   Is it optimal for balancing the ALU and Fetch resources? Yes, you can't get better than ONE.   Is it optimal for the performance of the system? This depends on the number of kernels in flight (this is used to hide the latency of texture fetch), total length of instruction streams, etc.  Please see the above example.

                                                                               

                                                                                • Stream KernelAnalyzer is now available!
                                                                                  ryta1203
                                                                                  bpurnomo,

                                                                                  Thanks for the posts, much help!

                                                                                  1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.

                                                                                  2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.

                                                                                  3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight?

                                                                                  4. Thanks for the posts, great insight into KSA!!
                                                                                    • thanks
                                                                                      tanja1

                                                                                      bpurnomo,

                                                                                      Thanks for the posts, much help!

                                                                                      • Stream KernelAnalyzer is now available!
                                                                                        bpurnomo

                                                                                         

                                                                                        1. Once again, I would like to plug my request for a run-time profiler. I think this could go a long way in promoting AMD Stream Computing. As it stands right now it's 1) easier to code for CUDA and 2) easier to improve performance for CUDA cards. The CUDA profiler gives great info to help developers achieve full occupancy, which brings me to my next point. ISA programming is needed to really gain performance from AMD cards.


                                                                                        Thank you for the suggestion.  I'll pass this request to the team.

                                                                                         

                                                                                        2. I think we have different definitions of "occupancy". Occupancy to me means that all the ALUs are being used all the time. In the compute world all I really care about is the ALUs, if the ALUs are being fully utilized then that's great, since I use the GPU for computing. If I can make performance increases that's great, but I want to make sure that all the ALUs are being used all the time, that's the goal.


                                                                                        I agree that I'm using the term occupany a bit differently than what you are using it for.  I apologize for the confusion.   In my mind, the occupany described in the previous post is the theoretical occupancy (not the actual occupany in the GPU) which means we are not taking account of GPRs and other resources. 

                                                                                        Because the number of GPRs has a direct effect on the number of threads in flight (to hide the memory latency), if you have a kernel that uses a high number of GPRs, you would want your ALU:Fetch ratio to be much higher 1.0 (to offset the memory latency due to lower number of threads in flight).

                                                                                         

                                                                                        3. My only real question: What about measurables in the KSA for wavefront size and/or threads in flight? 4. Thanks for the posts, great insight into KSA!!


                                                                                        This is not currently exposed/calculated.  Please keep the good suggestions coming though as we are continually trying to improve this tool.

                                                                                         

                                                                                      • Stream KernelAnalyzer is now available!
                                                                                        FangQ

                                                                                        I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.

                                                                                        here is the list of the columns

                                                                                        Name  -- appearent
                                                                                        Code  -- understood
                                                                                        Alu Instructions
                                                                                        Texture Instructions
                                                                                        Global Read Instructions
                                                                                        Interpolator Instructions
                                                                                        Control Flow Instructions
                                                                                        Global Write Instructions

                                                                                         

                                                                                        Texture Dependancy Levels

                                                                                         

                                                                                        General Purpose Registers

                                                                                         

                                                                                        Min Cycles
                                                                                        Max Cycles
                                                                                        Avg Cycles
                                                                                        Estimated Cycles
                                                                                        Estimated Cycles(Bilinear)
                                                                                        Estimated Cycles(Trilinear)
                                                                                        Estimated Cycles(Aniso)

                                                                                         

                                                                                        ALU:Fetch Ratio             --- understood
                                                                                        ALU:Fetch Ratio(Bilinear)
                                                                                        ALU:Fetch Ratio(Trilinear)
                                                                                        ALU:Fetch Ratio(Aniso)

                                                                                        Bottleneck   -- how is this determined?
                                                                                        Bottleneck(Bilinear)
                                                                                        Bottleneck(Trilinear)
                                                                                        Bottleneck(Aniso)

                                                                                         

                                                                                        Avg Peak Throughput
                                                                                        Avg Peak Throughput(Bilinear)
                                                                                        Avg Peak Throughput(Trilinear)
                                                                                        Avg Peak Throughput(Aniso)
                                                                                        Avg Throughput Per Clock
                                                                                        Avg Throughput Per Clock(Bilinear)
                                                                                        Avg Throughput Per Clock(Trilinear)
                                                                                        Avg Throughput Per Clock(Aniso)

                                                                                         

                                                                                        Max Scratch Registers

                                                                                        Edit: I meant to reply to this post, but accidentally edited it instead.

                                                                                          • Stream KernelAnalyzer is now available!
                                                                                            ryta1203

                                                                                             

                                                                                            Originally posted by: FangQ
                                                                                            While kernel A is more optimal in the term of using all the GPU resources (thus it is running at full occupancy), I think we can tell that kernel B's run-time will be much better.


                                                                                            I think for beginers like me, this type of comments will be very useful to understand ALU:Fetch

                                                                                            I actually find the statement quite confusing for a few reasons:

                                                                                            1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.

                                                                                            2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.

                                                                                            3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.

                                                                                            4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.

                                                                                            The most useful, and really only useful, thing about the KSA is that it gives you the ISA. That's it. All those measureables ("columns") seem to be somewhat meaningless and misleading considering they don't take the GPR usage into account and therefore can't accurately predict the overall system performance, but only the performance of 1 thread.

                                                                                              • Stream KernelAnalyzer is now available!
                                                                                                FangQ

                                                                                                 

                                                                                                I actually find the statement quite confusing for a few reasons:


                                                                                                I think we really need to have someone who have experience for GPU profiling to clarify things up. Otherwise, I just feel awkard to read all these numbers without knowing what they can tell me.

                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                  FangQ

                                                                                                   

                                                                                                  I actually find the statement quite confusing for a few reasons:


                                                                                                  I just meant that the commends seemed to give me more info than the literal word-expansion as in the Release notes.

                                                                                                  Definitely, explaining the meanings of each item in the help file will be useful; it would be much more useful, as emphasized by your comment, to give a guidance on how to interpret and use these metrics in code optimization.

                                                                                                  • Stream KernelAnalyzer is now available!
                                                                                                    bpurnomo

                                                                                                     

                                                                                                    1) Full Occupancy, in CUDA terms, means 100% ALU Utilization, and that is what it should mean.


                                                                                                    I don't agree.  Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.

                                                                                                     

                                                                                                     

                                                                                                    2) Why should it mean that? Because no one cares about fetching, we only care about computing. Computing is done by ALUs and hence if we can get 100% ALU Utilization we don't really care what the fetch units are doing. So I find the term occupancy, in the way AMD is using it, quite wrong and confusing.


                                                                                                    This is incorrect.  Fetch/memory operations are as important as ALU operations.  If your kernel is not performing any memory operations at all, then its performance might not be optimal.  Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance.  This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).

                                                                                                     

                                                                                                     

                                                                                                    3) Sadly, the ALU:Fetch ratio tells you nothing about the percentage of ALU Utilization.


                                                                                                    ALU:Fetch ratio is not ALU utilization.  They are two different terms.

                                                                                                     

                                                                                                     

                                                                                                    4) If, supposedly, ONE is the ratio we are going for and it turns out it's not the best ratio for performance then why are we going for that to begin with, since all we care about is performance? This makes little sense.


                                                                                                     

                                                                                                    I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread).  However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.

                                                                                                     

                                                                                                      • Stream KernelAnalyzer is now available!
                                                                                                        bpurnomo

                                                                                                         

                                                                                                        I am wondering if anyone want to explain for other columns - what these metrics mean (large means better or small means better) and what are the target values? the descriptions of each column in the Readme file really does not givel much information.

                                                                                                        here is the list of the columns

                                                                                                        Name  -- appearent
                                                                                                        Code  -- understood
                                                                                                        Alu Instructions
                                                                                                        Texture Instructions
                                                                                                        Global Read Instructions
                                                                                                        Interpolator Instructions
                                                                                                        Control Flow Instructions
                                                                                                        Global Write Instructions



                                                                                                        ALU, Texture, Global Read, Interpolator, Control Flow, Global Write gives you the count of each type of operations.  Thus, smaller number means less work to be done by your kernel.


                                                                                                        Texture Dependancy Levels


                                                                                                        Smaller is better for this number. It counts how deep is your texture/fetch dependancy level (i.e. how many chains of dependancy your fetch operations).  For example your fetch/memory operation might depend on the result of another fetch/memory operation which also depends on another fetch/memory operation, etc (long dependancy should be avoided usually).


                                                                                                        General Purpose Registers


                                                                                                        The number of registers used by your kernel. Small number is better.  This one is big for gauging your performance.  This number has a direct relationship with the number of possible threads in-flight at a time (small number equals more threads in-flight).   If your kernel contains memory operations, then more threads in-flight means more performance since memory latency can be hidden by many threads (i.e. if a thread is blocked in GPU because it has to wait on a memory fetch, then another thread can be scheduled to be run instead).

                                                                                                         

                                                                                                        Min Cycles
                                                                                                        Max Cycles
                                                                                                        Avg Cycles
                                                                                                        Estimated Cycles
                                                                                                        Estimated Cycles(Bilinear)
                                                                                                        Estimated Cycles(Trilinear)
                                                                                                        Estimated Cycles(Aniso)


                                                                                                        This is an estimated value (it doesn't take account of GPRs, thus highly inaccurate if you have high GPRs and many fetch/memory ops) based on a magic formula.

                                                                                                        Bilinear, trilinear, and aniso comes from the graphics world with regards of how the memory fetch is performed (each fetch operation can perform more than one fetch to also retrieve the adjacent memory locations for some averaging/filtering calculations).

                                                                                                         


                                                                                                        ALU:Fetch Ratio             --- understood
                                                                                                        ALU:Fetch Ratio(Bilinear)
                                                                                                        ALU:Fetch Ratio(Trilinear)
                                                                                                        ALU:Fetch Ratio(Aniso)

                                                                                                        Bottleneck   -- how is this determined?
                                                                                                        Bottleneck(Bilinear)
                                                                                                        Bottleneck(Trilinear)
                                                                                                        Bottleneck(Aniso)



                                                                                                        Bottleneck is computed based on the number of ALU, Fetch, Control flow, Interpolator instructions. Similar to the estimated cycle computation, this can be inaccurate.

                                                                                                         

                                                                                                        Avg Peak Throughput
                                                                                                        Avg Peak Throughput(Bilinear)
                                                                                                        Avg Peak Throughput(Trilinear)
                                                                                                        Avg Peak Throughput(Aniso)
                                                                                                        Avg Throughput Per Clock
                                                                                                        Avg Throughput Per Clock(Bilinear)
                                                                                                        Avg Throughput Per Clock(Trilinear)
                                                                                                        Avg Throughput Per Clock(Aniso)


                                                                                                        Typically higher number is better.


                                                                                                        Max Scratch Registers


                                                                                                        Smaller number is better.

                                                                                                         

                                                                                                        Also, remember the first rule in optimizing your system: you have to find where the bottleneck first, then improve the metric related to the bottleneck.  If you improve on one metric that is unrelated to your bottleneck, it will not improve the performance of the system.  For example, the bottleneck for your system performance is too many memory/fetch operations.  Reducing the number of ALU operations won't improve the perfomance of your system in this case.

                                                                                                          • Stream KernelAnalyzer is now available!
                                                                                                            FangQ

                                                                                                            thank you so much bpurnomo, this is very helpful.

                                                                                                            As you said, I am trying to find out the bottleneck of a Monte-Carlo code I wrote recently. The code appeared to be slightly slower than CPU (Intel Q6700) with a 4650 card, which is unexpected.

                                                                                                            The code contains 3 kernels, the first two simulate the movement of a particle, and the 3rd distributes the values of the particle to a 3D grid (using a scatter output, see my post here )

                                                                                                            First of all, SKA gave me two options for each kernel, one with "_addr", one without, what are the differences?

                                                                                                            Using the one without _addr, here are the stats of my 3 kernels:

                                                                                                            kernel1: ALU:46,TEX:5,CF:13,GlobalWrite:1,GPR:10,ALU_Fetch:1.63,Avg:8.13,Thread\Clock:1.97

                                                                                                            kernel2: ALU:68,TEX:3,CF:11,GlobalWrite:1,GPR:10,ALU_Fetch:2.19,Avg:3.28,Thread\Clock:2.44

                                                                                                            kernel3: ALU:8,TEX:1,CF:7,GlobalWrite:2,GPR:5,ALU_Fetch:4,Avg:2,Thread\Clock:4.0

                                                                                                            In your opinion, what do you think is the key cause for the low performance? (ALU and CF are too high?)

                                                                                                            (I also have a CUDA version of this code, and have achieved >100x acceleration on a 8800GT card. Given 8800GT only has 112 stream processors, and 4650 has 320 stream processors, I am expecting a even greater speed-up ratio, is this a reasonable expectation?)

                                                                                                          • Stream KernelAnalyzer is now available!
                                                                                                            ryta1203

                                                                                                             

                                                                                                            Originally posted by: bpurnomo

                                                                                                            I don't agree.  Full occupancy should be different than 100% ALU Utilization unless the GPU only consists of ALU units.



                                                                                                            You are correct actually, I apologize for the confusion. It does in fact NOT mean this, only that the max number of warps are in flight, I believe, but I could be wrong.

                                                                                                             

                                                                                                            This is incorrect.  Fetch/memory operations are as important as ALU operations.  If your kernel is not performing any memory operations at all, then its performance might not be optimal.  Some of the ALU operations can be replaced by a table/memory lookup instead and you might end up with better overall performance.  This is a standard optimization technique in the graphics world (for example replacing long ALU computations--such as sqrt--- with a table lookup instead).

                                                                                                            Wouldn't this only be the case if the ALU computation time exceeds the fetch?

                                                                                                             

                                                                                                            ALU:Fetch ratio is not ALU utilization.  They are two different terms.

                                                                                                            Yes, this is why I said it.

                                                                                                             

                                                                                                             

                                                                                                            I should clarify that ONE is the best ratio if you don't take fetch latency into account (if you can hide those latencies with having many threads in flight---this is a point that I have made several times in this thread).  However, in practice, the more complex your kernel (with many ALU ops, fetch ops, and using a lot of GPRs---thus few threads in-flight) then the higher ALU:Fetch ratio you should be shooting for.

                                                                                                            But can't you also have too many threads in flight?

                                                                                                              • Stream KernelAnalyzer is now available!
                                                                                                                ryta1203

                                                                                                                Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

                                                                                                                  • Stream KernelAnalyzer is now available!
                                                                                                                    ryta1203

                                                                                                                    Also, in that same mind set, would it be possible to have an IL to ISA compiler in the SKA?

                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                      bpurnomo

                                                                                                                       

                                                                                                                      Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

                                                                                                                       

                                                                                                                      I'll add the request to our bug tracking system.

                                                                                                                       

                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                          ryta1203

                                                                                                                           

                                                                                                                          Originally posted by: bpurnomo
                                                                                                                          Originally posted by: ryta1203 Would it be possible to have the SKA generate FULL ISA? For example, the SKA generates the IL header file, which can essentially be copy and pasted, is it possible to have something like this for ISA in the SKA?

                                                                                                                           

                                                                                                                          I'll add the request to our bug tracking system.

                                                                                                                           

                                                                                                                           

                                                                                                                          Also, can you request that they add the "\n" after every line to the IL.h compilation?

                                                                                                                          I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.

                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                              bpurnomo

                                                                                                                               



                                                                                                                               

                                                                                                                              Also, can you request that they add the "\n" after every line to the IL.h compilation?

                                                                                                                               

                                                                                                                              I tried to simply copy and paste the IL and it wouldn't compile without the "\n" after every line, so I had to manual add "\n" to EVERY line myself, this should be VERY easy to do and can save us a lot of time, thanks.

                                                                                                                               

                                                                                                                              This should be fixed in the next version.

                                                                                                                               

                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                  ryta1203

                                                                                                                                  Where does the extra GPR come from in the SKA?

                                                                                                                                  For example, if the ISA only uses R0 and R1 then the SKA reports GPR=3.

                                                                                                                                  If the ISA only uses R0 then the SKA reports GPR=2.

                                                                                                                                  It seems that if there are n registers used (including Tx registers) then the SKA reports n+1 GPR.

                                                                                                                                  I'm just wondering where the other GPR comes from.

                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                      bpurnomo

                                                                                                                                      Hi Ryta1203,

                                                                                                                                         This is a bug in SKA.  It used to be the number of GPR reported by the ISA was off by one, but it seems now this bug has been fixed in the ISA side. 

                                                                                                                                       

                                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                                          ryta1203

                                                                                                                                          bpurnomo, so it should be just n, not n-1

                                                                                                                                          For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

                                                                                                                                          Also, just to clarify, it does count the T registers right, since they are a GPR?

                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                              bpurnomo

                                                                                                                                               

                                                                                                                                              Originally posted by: ryta1203 bpurnomo, so it should be just n, not n-1

                                                                                                                                               

                                                                                                                                              For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

                                                                                                                                               

                                                                                                                                              Also, just to clarify, it does count the T registers right, since they are a GPR?

                                                                                                                                               

                                                                                                                                              Yeah it should just be n (according to your definition).  This should be fixed in the next version of SKA.

                                                                                                                                              T registers are not part of the GPR calculation.  They are clause temporaries (they don't span across clauses) and they have their own dedicated pool.

                                                                                                                                               

                                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                                  ryta1203

                                                                                                                                                  bpurnomo,

                                                                                                                                                   I'd like to know if this is a bug. It appears that the ALU:Fetch ratio reported by SKA is 1.00 IF the ALU to TEX instruction ratio is 4:1, at least this is the case for larger input sizes, such as 4 and 8. For an input size of 2 I get a ALU:Fetch ratio of 1.25 even though the ALU is 8 and the TEX is 2 (4:1). Why does this seem inconsistent? I understand why the "Bottleneck" might be different but it seems to me that the ALU:Fetch ratio should maintain the same formula (4:1 ALU:TEX instructions).

                                                                                                                                                   Any ideas?

                                                                                                                                                  EDIT: ALL this info assumes RV770, sorry if this wasn't mentioned. ALSO, for the R600 it holds true EVEN for an input size of 2.

                                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                                      bpurnomo

                                                                                                                                                      Yes they should use the same formula.

                                                                                                                                                      Can you please post the two kernels where you are seeing the discrepancies in the ALU:Fetch ratio so that I can better understand the problem?  Thank you.

                                                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                                                          ryta1203

                                                                                                                                                          4 inputs:

                                                                                                                                                          il_ps_2_0
                                                                                                                                                          dcl_input_position_interp(linear_noperspective) v0.x
                                                                                                                                                          dcl_output_generic o0
                                                                                                                                                          dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                                                                          dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                                                                          dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                                                                          dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
                                                                                                                                                          sample_resource(0)_sampler(0) r0, v0.x
                                                                                                                                                          sample_resource(1)_sampler(1) r1, v0.x
                                                                                                                                                          sample_resource(2)_sampler(2) r2, v0.x
                                                                                                                                                          sample_resource(3)_sampler(3) r3, v0.x
                                                                                                                                                          mul r4, r1, r0
                                                                                                                                                          mul r5, r4, r2
                                                                                                                                                          mul r6, r5, r3
                                                                                                                                                          mul r7, r6, r5
                                                                                                                                                          mul r8, r7, r6
                                                                                                                                                          mul r9, r8, r7
                                                                                                                                                          mul r10, r9, r8
                                                                                                                                                          mul r11, r10, r9
                                                                                                                                                          mul r12, r11, r10
                                                                                                                                                          mul r13, r12, r11
                                                                                                                                                          mul r14, r13, r12
                                                                                                                                                          mul r15, r14, r13
                                                                                                                                                          mul r16, r15, r14
                                                                                                                                                          mul r17, r16, r15
                                                                                                                                                          mul r18, r17, r16
                                                                                                                                                          mul r19, r18, r17
                                                                                                                                                          mov o0, r19
                                                                                                                                                          ret_dyn
                                                                                                                                                          end

                                                                                                                                                           

                                                                                                                                                          2 inputs:

                                                                                                                                                           

                                                                                                                                                          il_ps_2_0

                                                                                                                                                          dcl_input_position_interp(linear_noperspective) v0.x

                                                                                                                                                          dcl_output_generic o0

                                                                                                                                                          dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

                                                                                                                                                          dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

                                                                                                                                                          sample_resource(0)_sampler(0) r0, v0.x

                                                                                                                                                          sample_resource(1)_sampler(1) r1, v0.x

                                                                                                                                                          mul r2, r1, r0

                                                                                                                                                          mul r3, r2, r1

                                                                                                                                                          mul r4, r3, r2

                                                                                                                                                          mul r5, r4, r3

                                                                                                                                                          mul r6, r5, r4

                                                                                                                                                          mul r7, r6, r5

                                                                                                                                                          mul r8, r7, r6

                                                                                                                                                          mul r9, r8, r7

                                                                                                                                                          mov o0, r9

                                                                                                                                                          ret_dyn

                                                                                                                                                          end

                                                                                                                                                           

                                                                                                                                                          The ALU:Fetch INSTRUCTION ratio for both of these are 4:1, so they both should (if my limited understanding is correct) be getting 1.00 ALU:Fetch in the SKA; however, the first kernel does and the second kernel does not. the second kernel gives a 1.25 and the Bottleneck goes from ALUOps to GlobalWrite (though I didn't think this should effect the ALU:Fetch ratio number). The second kernel reports a 1.00 ALU:Fetch ratio for R600 but a 1.25 ALU:Fetch ratio for RV770 even though the ISA is EXACTLY the same.

                                                                                                                                                          Are there other factors that effect the ALU:Fetch ratio? I'm also curious because if I have a constant number of inputs and vary the outputs R600 and RV770 end with different ALU:fetch ratios, again the RV770 ALU:Fetch not matching the 4:1 expected value.



                                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                                              bpurnomo

                                                                                                                                                              I had a chance to look at this earlier.  Apparently, right now the implementation of ALU:Fetch ratio is the ratio between the longest non-fetch instructions vs fetch instructions.   So in the case where Global Write is the bottleneck, the ALUFetch ratio reports the ratio between Global Write instructions vs Fetch instructions.   This is what happened in your second kernel above.

                                                                                                                                                              We can probably change it such that it always report only the ALU instructions vs Fetch instructions if that is a more useful number. 

                                                                                                                                                              Also, in the case where ALU is the bottleneck, the ALU:Fetch ratio also depends on the ratio of ALU and Fetch unit in the hardware.  That is why you might get different results with different hardware.  R600 and R770 actually have the same ratio of ALU unit and Fetch unit which is 1 to 4, however, R770 has 2.5x the number of ALU unit compares to the R600.

                                                                                                                                                               

                                                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                                                  ryta1203

                                                                                                                                                                  bpurnomo,

                                                                                                                                                                     Thank you. The current way makes sense to me, however it becomes much more difficult to predict the ALU:Fetch ratio with any sort of accuracy.

                                                                                                                                                                     Whether or not which is more useful is really for you guys to decide since you know much more than we do.

                                                                                                                                                                     I understand they are different for different hardware; however, looking at it from a performance standpoint based on ALU:Fetch, it won't matter since the ratio is abstracted from the hardware. For example, whether using the R600, RV770 or Rxxxx if the ALU:Fetch  for a given problem performs best at 1.0 that is what I am looking at.

                                                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                                                      ryta1203

                                                                                                                                                                      I have a question regarding the reporting of Estimated Cycles and the printed ISA.

                                                                                                                                                                      In the printed ISA, it appears that each instruction is 1 cycle, correct? Where can I find cycle count in the documentation?

                                                                                                                                                                      If I have 9 ISA instructions how can the estimated cycle count be below 9?

                                                                                                                                                                       

                                                                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                                                                          bpurnomo

                                                                                                                                                                          The estimated cycles is the effective cycles per thread (taking account of the throughput).  Assuming that we only take account of the ALU insructions (in actuality it also takes account of other instruction types), for Radeon 4870 with 10 SIMDs, a shader with 9 ALU instructions will have effective cyles of 9/10 cycles.

                                                                                                                                                                           

                                                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                                                              ryta1203

                                                                                                                                                                              So I have a kernel with 48 ALU instructions and the estimated cycle count is 4.8?

                                                                                                                                                                              Also, it looks like the SKA only takes into account the ALU instructions, is this correct? It doesn't take into account texture instructions? Even when the ALU clause is dependent on data from the TEX clause (this would add cycles since the ALU clause would stall correct?)?

                                                                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                                                                  bpurnomo

                                                                                                                                                                                   

                                                                                                                                                                                  Originally posted by: ryta1203 So I have a kernel with 48 ALU instructions and the estimated cycle count is 4.8?


                                                                                                                                                                                  Yes if the ALU instructions is the bottleneck.

                                                                                                                                                                                   

                                                                                                                                                                                  Also, it looks like the SKA only takes into account the ALU instructions, is this correct? It doesn't take into account texture instructions? Even when the ALU clause is dependent on data from the TEX clause (this would add cycles since the ALU clause would stall correct?)?

                                                                                                                                                                                   

                                                                                                                                                                                  No.  SKA takes into account all instruction types (you can look at the Bottleneck field). 

                                                                                                                                                                                  That is why I mentioned that the estimated cycles is the effective cycles.  Instructions can be dependent on the other instructions in a single thread.  But we can execute 10x threads in parallel that are not dependant on the results of instructions on other threads.

                                                                                                                                                                                  For example, 1 thread (with 100 ALU instructions) might take 100 cycles, but 10 threads will also still take 100 cycles (since we have 10 ALU units), 100 threads will take 1000 cycles, etc.  So the effective cycles per thread is 10 cycles.

                                                                                                                                                                                  Note that if there are many threads (or wavefronts) in flight, then the fetch latency will be hidden (as when one thread stalls, another thread will take its place).

                                                                                                                                                                                   

                                                                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                                                                      ryta1203

                                                                                                                                                                                      Ah, I understand, you don't give the actual kernel cycle count.

                                                                                                                                                                                      There might be some latency hiding but you are still going to have some overhead if there is dependency due to the first wavefronts being run.

                                                                                                                                                                                      Yes, many wavefronts in flight MIGHT improve performance but this is not necessarily the case all the time, in fact I have seen many examples where the GPR count was reduced signifcantly (theoretically allowing more wavefronts to run) but the performance was reduced significantly also. I only mention this as a warning for others reading the thread.

                                                                                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                                                                                          ryta1203

                                                                                                                                                                                          bpurnomo,

                                                                                                                                                                                           Also, if I may and you know: If an ALU clause is dependent on data from a TEX clause does the ALU execute once the data is available or once the TEX clause is completed? Is this documented somewhere?

                                                                                                                                                                                          Sorry, this is technically unrelated to the GPU Tools, I just thought you might know.

                                                                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                                                                              bpurnomo

                                                                                                                                                                                               

                                                                                                                                                                                              Originally posted by: ryta1203 bpurnomo,

                                                                                                                                                                                               

                                                                                                                                                                                               Also, if I may and you know: If an ALU clause is dependent on data from a TEX clause does the ALU execute once the data is available or once the TEX clause is completed? Is this documented somewhere?

                                                                                                                                                                                               

                                                                                                                                                                                              Sorry, this is technically unrelated to the GPU Tools, I just thought you might know.

                                                                                                                                                                                               

                                                                                                                                                                                              I believe it is the latter (but not 100% sure).  The two different methods should only affect performance when you are GPR-bound.  Otherwise, other threads will be to be scheduled to fill in the gap (then effectively the two methods will have the same performance).

                                                                                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                                                                                  ryta1203

                                                                                                                                                                                                  bpurnomo,

                                                                                                                                                                                                    Not so much 1 indicator as less than 100.   And not even that so much as documentation that can point developers in the right direction.

                                                                                                                                                                                                    4 things I think can be improved: poor docs, no profiler (still), no compiler opt levels, no working assembler. These are the big things for me. I don't mind having to figure out 100 things but please give me the support to do it (ie. the above 4 things I've mentioned).

                                                                                                                                                                                                    I appreciate your time, as always, thanks!

                                                                                                                                                                                              • Stream KernelAnalyzer is now available!
                                                                                                                                                                                                bpurnomo

                                                                                                                                                                                                 

                                                                                                                                                                                                Originally posted by: ryta1203 Ah, I understand, you don't give the actual kernel cycle count

                                                                                                                                                                                                Correct since you won't be running only a single thread in your application (if you are, then you are using GPU incorrectly).

                                                                                                                                                                                                 

                                                                                                                                                                                                Yes, many wavefronts in flight MIGHT improve performance but this is not necessarily the case all the time, in fact I have seen many examples where the GPR count was reduced signifcantly (theoretically allowing more wavefronts to run) but the performance was reduced significantly also. I only mention this as a warning for others reading the thread.


                                                                                                                                                                                                Yes.  Many factors affect performance.  I understand that you have been asking for a while for a single indicator to predict performance, unfortunately, we won't be able to provide this.  There are many factors that will affect performance. This is why there are still much research in optimizing performance for GPGPU applications. The number of instructions matter, the type also matters.  Not only that, to achieve high performance, your application has to utilize the cache (LDS) in each SIMD efficiently.  You should also minimize dependency, etc.  Then, there is a compiler factor: some magic settings detected might cause your kernel to run x times slower than it should be (because it defaults to a slow path for the code to be conservative), etc.

                                                                                                                                                                                                 

                                                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                                                              ryta1203

                                                                                                                                                                               

                                                                                                                                                                              Originally posted by: bpurnomo
                                                                                                                                                                              Originally posted by: ryta1203 bpurnomo, so it should be just n, not n-1

                                                                                                                                                                               

                                                                                                                                                                              For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

                                                                                                                                                                               

                                                                                                                                                                              Also, just to clarify, it does count the T registers right, since they are a GPR?

                                                                                                                                                                               

                                                                                                                                                                              Yeah it should just be n (according to your definition).  This should be fixed in the next version of SKA.

                                                                                                                                                                              T registers are not part of the GPR calculation.  They are clause temporaries (they don't span across clauses) and they have their own dedicated pool.

                                                                                                                                                                               

                                                                                                                                                                              bpurnomo,

                                                                                                                                                                                 According to the ISA docs they do reduce the available GPRs a thread can use, so this would be important when calculating WFs used, yes?

                                                                                                                                                                                Since they reduce the GPRs available to a thread it would be nice to have them calculated in to the total number of GPRs used by the kernel (also, it would be helpful if it did this for both sides of the T registers, odd and even). This will help when a developer wants to look at the total number of WFs running in parallel since the T registers effects this.

                                                                                                                                                                              Thanks again.

                                                                                                                                                  • Stream KernelAnalyzer is now available!
                                                                                                                                                    bpurnomo

                                                                                                                                                     

                                                                                                                                                    Yes, I kind of figured that GPR usage played an important role. So none of the KSA measurables take into account GPR usage? It might be helpful to add this in the future because without it the KSA is mostly useless as a tool to gauge performance of a kernel and isn't that the point of the KSA or am I missing the point? Maybe I misunderstood the use of the KSA?


                                                                                                                                                    Using SKA basically provides you an access to the ATI compiler.   It uses the Brook+ compiler to compile Brook+ source file to IL.  Then, it calls the ATI Shader Compiler to compile IL down to hardware disassembly for various ASICs and under various Catalyst driver.  While you can use Brook+ compiler directly instead of SKA, you don't have access to ATI Shader Compiler except through SKA or the ATI driver.  In addition, SKA exposes some statistics generated by the Shader Compiler such as the number of GPRs, ALU, fetch instructions, etc.  Also, we provided some heuristics to compute the estimated cycle times for your kernel.  The heuristics are not perfect as there are many factors that affect the total performance.  Please also keep in mind that SKA is a static analysis tool and thus has its own limitations since it is not a run-time profiler.

                                                                                                                                                    How will all of the above helpful to you as a game/stream developer? 

                                                                                                                                                    1. You can tweak your kernel to achieve better performance by looking at the statistics generated by SKA.  You should look at all the statistics instead of just focusing on one particular item.  ALU:Fetch ratio gives a hint of the balance of your system.  You should also try to minimize the number of GPRs used.  Finally, the estimated cycle times should also be a low number.  Some developers also like to look at the hardware disassembly to gain better understanding on how to tweak their IL kernel.

                                                                                                                                                    2. If you want to know how your kernel performs on a particular graphics card, you can use SKA to gauge the performance on that particular graphics card even without having access to the hardware. 

                                                                                                                                                    3.  Similarly, without having to install a new Catalyst driver, you will be able to know whether a shader bug has been fixed/introduced in the new driver.  Or even better whether there are some performance improvements for your kernel/shader.

                                                                                                                                                    I hope this helps.

                                                                                                                                                     

                                                                                                                                                    EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.


                                                                                                                                                    Is the estimated cycle time higher in the second kernel?  You can also post both kernels so we will be able to get a better idea of the problem.

                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                          dukeleto
                                                                                                                          Hello,
                                                                                                                          could I put in a request for a linux version of SKA, or at least that attention be payed to allowing the combination brcc+SKA to work properly with wine under linux?
                                                                                                                          Currently both programs install approximately but I cannot manage to get the SKA to find the (windows version of) brcc.
                                                                                                                          Thanks!
                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                              ryta1203
                                                                                                                              Can we get the printing fixed in the next release of KSA? It would be great to be able to print out the information. Right now, the printing freezes KSA, doesn't print and ends in KSA not responding.

                                                                                                                              Also, a Find/Replace function would be very nice.
                                                                                                                                • Stream KernelAnalyzer is now available!
                                                                                                                                  bpurnomo

                                                                                                                                   

                                                                                                                                  Originally posted by: dukeleto Hello, could I put in a request for a linux version of SKA, or at least that attention be payed to allowing the combination brcc+SKA to work properly with wine under linux? Currently both programs install approximately but I cannot manage to get the SKA to find the (windows version of) brcc. Thanks!


                                                                                                                                   

                                                                                                                                  Originally posted by: ryta1203 Can we get the printing fixed in the next release of KSA? It would be great to be able to print out the information. Right now, the printing freezes KSA, doesn't print and ends in KSA not responding. Also, a Find/Replace function would be very nice.


                                                                                                                                  I'll add all these requests into our bug tracking system so it can be prioritized for our future releases.

                                                                                                                                  Cheers.

                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                      ryta1203
                                                                                                                                      bpurnomo,

                                                                                                                                      Sounds great. I seem to have another example that might help in convincing anyone that the KSA is limited, these two kernels give the same info from the KSA (with a little different ISA, just slightly):

                                                                                                                                      kernel void step1(float4 a<>, float4 b<>, out float c<>, out float d<>)
                                                                                                                                      {
                                                                                                                                      c = a.x + a.y + a.z + a.w;
                                                                                                                                      d = b.x + b.y + b.z + b.w;
                                                                                                                                      }

                                                                                                                                      kernel void step2(float4 a<>, float4 b<>, out float4 out1<>, out float4 out2<>)
                                                                                                                                      {
                                                                                                                                      //float4 temp;
                                                                                                                                      out1.x = a.x + a.y + a.z + a.w;
                                                                                                                                      out2.x = b.x + b.y + b.z + b.w;
                                                                                                                                      //out1 = temp;
                                                                                                                                      //out2 = temp;
                                                                                                                                      }


                                                                                                                                      YET, the 1st kernel runs twice as fast as the second kernel. The KSA gives NO clue, other than examining the ISA, as to the reason this is happening. The ISA is very similar for both kernels.
                                                                                                                                        • Stream KernelAnalyzer is now available!
                                                                                                                                          bpurnomo

                                                                                                                                           

                                                                                                                                          Originally posted by: ryta1203 YET, the 1st kernel runs twice as fast as the second kernel. The KSA gives NO clue, other than examining the ISA, as to the reason this is happening. The ISA is very similar for both kernels.


                                                                                                                                          We hear you.  We do believe that a run-time profiler would be a nice thing to have.  I'm actually on your side.

                                                                                                                                          However, it is not true that SKA gives NO clue at all for those two kernels.  Without SKA, developers will have no idea why one is faster than the other.  Afterall, the ISA is exposed by SKA.

                                                                                                                                           

                                                                                                                                            • Stream KernelAnalyzer is now available!
                                                                                                                                              ryta1203
                                                                                                                                              Looking at the KSA and the ISA, I can't tell why the one is faster than the other.

                                                                                                                                              The ISA is very similar. Either way, the developer needs to know ISA in order optimizations on these kernels (very simple kernels at that). Wouldn't it be easier just to write kernels like this in ISA? If so, then there is no need for higher level languages and for KSA at all.

                                                                                                                                              I'm glad to hear a run-time profiler is in talk, I think this will be very good depending on the type of information it profiles.
                                                                                                                                    • Stream KernelAnalyzer is now available!
                                                                                                                                      naughtykid

                                                                                                                                      when there will be a linux version?