cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bpurnomo
Staff

Stream KernelAnalyzer is now available!

The GPU Developer Tools Team is pleased to announce the release of a new tool Stream KernelAnalyzer

This is a tool for analyzing the performance of stream kernels on ATI graphics cards/stream processors (AMD Stream SDK 1.3 is required).  It was derived from the GPU ShaderAnalyzer tool to specifically target the stream community. 

Features of the new tool:

  • Support Brook+ & IL kernel.
  • Support IL compute kernel for ATI Radeon 4800 series graphics card.
  • Support for ATI Stream SDK 1.3.
  • Support for AMD FireStream 9170, 9250 and 9270 stream processors.
  • Support for ATI Radeon 2000, 3000, and 4000 series graphics card.

Please do not hesitate to post on the forum if you have any questions.

Sincerely,

GPU Developer Tools Team

0 Likes
74 Replies

Hi Ryta1203,

   This is a bug in SKA.  It used to be the number of GPR reported by the ISA was off by one, but it seems now this bug has been fixed in the ISA side. 

 

0 Likes

bpurnomo, so it should be just n, not n-1

For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

Also, just to clarify, it does count the T registers right, since they are a GPR?

0 Likes

Originally posted by: ryta1203 bpurnomo, so it should be just n, not n-1

 

For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

 

Also, just to clarify, it does count the T registers right, since they are a GPR?

 

Yeah it should just be n (according to your definition).  This should be fixed in the next version of SKA.

T registers are not part of the GPR calculation.  They are clause temporaries (they don't span across clauses) and they have their own dedicated pool.

 

0 Likes

bpurnomo,

 I'd like to know if this is a bug. It appears that the ALU:Fetch ratio reported by SKA is 1.00 IF the ALU to TEX instruction ratio is 4:1, at least this is the case for larger input sizes, such as 4 and 8. For an input size of 2 I get a ALU:Fetch ratio of 1.25 even though the ALU is 8 and the TEX is 2 (4:1). Why does this seem inconsistent? I understand why the "Bottleneck" might be different but it seems to me that the ALU:Fetch ratio should maintain the same formula (4:1 ALU:TEX instructions).

 Any ideas?

EDIT: ALL this info assumes RV770, sorry if this wasn't mentioned. ALSO, for the R600 it holds true EVEN for an input size of 2.

0 Likes

Yes they should use the same formula.

Can you please post the two kernels where you are seeing the discrepancies in the ALU:Fetch ratio so that I can better understand the problem?  Thank you.

0 Likes

4 inputs:

il_ps_2_0
dcl_input_position_interp(linear_noperspective) v0.x
dcl_output_generic o0
dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(2)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
dcl_resource_id(3)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)
sample_resource(0)_sampler(0) r0, v0.x
sample_resource(1)_sampler(1) r1, v0.x
sample_resource(2)_sampler(2) r2, v0.x
sample_resource(3)_sampler(3) r3, v0.x
mul r4, r1, r0
mul r5, r4, r2
mul r6, r5, r3
mul r7, r6, r5
mul r8, r7, r6
mul r9, r8, r7
mul r10, r9, r8
mul r11, r10, r9
mul r12, r11, r10
mul r13, r12, r11
mul r14, r13, r12
mul r15, r14, r13
mul r16, r15, r14
mul r17, r16, r15
mul r18, r17, r16
mul r19, r18, r17
mov o0, r19
ret_dyn
end

 

2 inputs:

il_ps_2_0

dcl_input_position_interp(linear_noperspective) v0.x

dcl_output_generic o0

dcl_resource_id(0)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

dcl_resource_id(1)_type(2d,unnorm)_fmtx(float)_fmty(float)_fmtz(float)_fmtw(float)

sample_resource(0)_sampler(0) r0, v0.x

sample_resource(1)_sampler(1) r1, v0.x

mul r2, r1, r0

mul r3, r2, r1

mul r4, r3, r2

mul r5, r4, r3

mul r6, r5, r4

mul r7, r6, r5

mul r8, r7, r6

mul r9, r8, r7

mov o0, r9

ret_dyn

end

 

The ALU:Fetch INSTRUCTION ratio for both of these are 4:1, so they both should (if my limited understanding is correct) be getting 1.00 ALU:Fetch in the SKA; however, the first kernel does and the second kernel does not. the second kernel gives a 1.25 and the Bottleneck goes from ALUOps to GlobalWrite (though I didn't think this should effect the ALU:Fetch ratio number). The second kernel reports a 1.00 ALU:Fetch ratio for R600 but a 1.25 ALU:Fetch ratio for RV770 even though the ISA is EXACTLY the same.

Are there other factors that effect the ALU:Fetch ratio? I'm also curious because if I have a constant number of inputs and vary the outputs R600 and RV770 end with different ALU:fetch ratios, again the RV770 ALU:Fetch not matching the 4:1 expected value.



0 Likes

I had a chance to look at this earlier.  Apparently, right now the implementation of ALU:Fetch ratio is the ratio between the longest non-fetch instructions vs fetch instructions.   So in the case where Global Write is the bottleneck, the ALUFetch ratio reports the ratio between Global Write instructions vs Fetch instructions.   This is what happened in your second kernel above.

We can probably change it such that it always report only the ALU instructions vs Fetch instructions if that is a more useful number. 

Also, in the case where ALU is the bottleneck, the ALU:Fetch ratio also depends on the ratio of ALU and Fetch unit in the hardware.  That is why you might get different results with different hardware.  R600 and R770 actually have the same ratio of ALU unit and Fetch unit which is 1 to 4, however, R770 has 2.5x the number of ALU unit compares to the R600.

 

0 Likes

bpurnomo,

   Thank you. The current way makes sense to me, however it becomes much more difficult to predict the ALU:Fetch ratio with any sort of accuracy.

   Whether or not which is more useful is really for you guys to decide since you know much more than we do.

   I understand they are different for different hardware; however, looking at it from a performance standpoint based on ALU:Fetch, it won't matter since the ratio is abstracted from the hardware. For example, whether using the R600, RV770 or Rxxxx if the ALU:Fetch  for a given problem performs best at 1.0 that is what I am looking at.

0 Likes

I have a question regarding the reporting of Estimated Cycles and the printed ISA.

In the printed ISA, it appears that each instruction is 1 cycle, correct? Where can I find cycle count in the documentation?

If I have 9 ISA instructions how can the estimated cycle count be below 9?

 

0 Likes

The estimated cycles is the effective cycles per thread (taking account of the throughput).  Assuming that we only take account of the ALU insructions (in actuality it also takes account of other instruction types), for Radeon 4870 with 10 SIMDs, a shader with 9 ALU instructions will have effective cyles of 9/10 cycles.

 

0 Likes

So I have a kernel with 48 ALU instructions and the estimated cycle count is 4.8?

Also, it looks like the SKA only takes into account the ALU instructions, is this correct? It doesn't take into account texture instructions? Even when the ALU clause is dependent on data from the TEX clause (this would add cycles since the ALU clause would stall correct?)?

0 Likes

Originally posted by: ryta1203 So I have a kernel with 48 ALU instructions and the estimated cycle count is 4.8?


Yes if the ALU instructions is the bottleneck.

 

Also, it looks like the SKA only takes into account the ALU instructions, is this correct? It doesn't take into account texture instructions? Even when the ALU clause is dependent on data from the TEX clause (this would add cycles since the ALU clause would stall correct?)?

 

No.  SKA takes into account all instruction types (you can look at the Bottleneck field). 

That is why I mentioned that the estimated cycles is the effective cycles.  Instructions can be dependent on the other instructions in a single thread.  But we can execute 10x threads in parallel that are not dependant on the results of instructions on other threads.

For example, 1 thread (with 100 ALU instructions) might take 100 cycles, but 10 threads will also still take 100 cycles (since we have 10 ALU units), 100 threads will take 1000 cycles, etc.  So the effective cycles per thread is 10 cycles.

Note that if there are many threads (or wavefronts) in flight, then the fetch latency will be hidden (as when one thread stalls, another thread will take its place).

 

0 Likes

Ah, I understand, you don't give the actual kernel cycle count.

There might be some latency hiding but you are still going to have some overhead if there is dependency due to the first wavefronts being run.

Yes, many wavefronts in flight MIGHT improve performance but this is not necessarily the case all the time, in fact I have seen many examples where the GPR count was reduced signifcantly (theoretically allowing more wavefronts to run) but the performance was reduced significantly also. I only mention this as a warning for others reading the thread.

0 Likes

bpurnomo,

 Also, if I may and you know: If an ALU clause is dependent on data from a TEX clause does the ALU execute once the data is available or once the TEX clause is completed? Is this documented somewhere?

Sorry, this is technically unrelated to the GPU Tools, I just thought you might know.

0 Likes

Originally posted by: ryta1203 bpurnomo,

 

 Also, if I may and you know: If an ALU clause is dependent on data from a TEX clause does the ALU execute once the data is available or once the TEX clause is completed? Is this documented somewhere?

 

Sorry, this is technically unrelated to the GPU Tools, I just thought you might know.

 

I believe it is the latter (but not 100% sure).  The two different methods should only affect performance when you are GPR-bound.  Otherwise, other threads will be to be scheduled to fill in the gap (then effectively the two methods will have the same performance).

0 Likes

bpurnomo,

  Not so much 1 indicator as less than 100.   And not even that so much as documentation that can point developers in the right direction.

  4 things I think can be improved: poor docs, no profiler (still), no compiler opt levels, no working assembler. These are the big things for me. I don't mind having to figure out 100 things but please give me the support to do it (ie. the above 4 things I've mentioned).

  I appreciate your time, as always, thanks!

0 Likes

Originally posted by: ryta1203 Ah, I understand, you don't give the actual kernel cycle count

Correct since you won't be running only a single thread in your application (if you are, then you are using GPU incorrectly).

Yes, many wavefronts in flight MIGHT improve performance but this is not necessarily the case all the time, in fact I have seen many examples where the GPR count was reduced signifcantly (theoretically allowing more wavefronts to run) but the performance was reduced significantly also. I only mention this as a warning for others reading the thread.


Yes.  Many factors affect performance.  I understand that you have been asking for a while for a single indicator to predict performance, unfortunately, we won't be able to provide this.  There are many factors that will affect performance. This is why there are still much research in optimizing performance for GPGPU applications. The number of instructions matter, the type also matters.  Not only that, to achieve high performance, your application has to utilize the cache (LDS) in each SIMD efficiently.  You should also minimize dependency, etc.  Then, there is a compiler factor: some magic settings detected might cause your kernel to run x times slower than it should be (because it defaults to a slow path for the code to be conservative), etc.

 

0 Likes

Originally posted by: bpurnomo
Originally posted by: ryta1203 bpurnomo, so it should be just n, not n-1

 

For example, if the registers being used are R0 through R5 without any T registers, then it should report 6 GPR?

 

Also, just to clarify, it does count the T registers right, since they are a GPR?

 

Yeah it should just be n (according to your definition).  This should be fixed in the next version of SKA.

T registers are not part of the GPR calculation.  They are clause temporaries (they don't span across clauses) and they have their own dedicated pool.

 

bpurnomo,

   According to the ISA docs they do reduce the available GPRs a thread can use, so this would be important when calculating WFs used, yes?

  Since they reduce the GPRs available to a thread it would be nice to have them calculated in to the total number of GPRs used by the kernel (also, it would be helpful if it did this for both sides of the T registers, odd and even). This will help when a developer wants to look at the total number of WFs running in parallel since the T registers effects this.

Thanks again.

0 Likes

Yes. There are dedicated pools for the T registers, however, when they are not used, the space can be utilized for GPRs.  

I'll put in your request to report the number of T registers used by a kernel.

 

0 Likes

Oh, ok.

I actually thought there for awhile that you guys weren't support the SKA anymore.... actually I had hoped that you might be moving on to getting a profiler working. Aw, one can dream.

0 Likes

Originally posted by: bpurnomo

I'll put in your request to report the number of T registers used by a kernel.

 

bpurnomo, can I also put in a request that the SKA use the installed CAL compiler... for example, right now SKA is using 9.4 (I believe) and since I want to make sure everything is correct I can't update my drivers to the newest drivers until the SKA does....

It would be great if the SKA would just use whatever drivers I have installed, that way I don't have to worry about this problem.

0 Likes

Hi Ryta, Unfortunately your request probably won't get filled.  The reason is SKA includes its own custom CAL compiler (it came from the CAL compiler in the driver with our own modifications to work with SKA).  For security reasons, the CAL compiler team will probably object to the extra interfaces we added into the compiler. 

Hopefully we will update SKA soon.

 

0 Likes

When can we expect OpenCL support from the SKA? Are we ever going to get a profiler by any chance?

0 Likes

The new SKA still does not show full assembly.... it would be very nice to have this implemented. By full assembly I mean the header and footer for the ISA, so we can code in IL or Brook+ but "tweak" in assembly and use that assembly image in our program.

0 Likes

Hi Ryta,

    We haven't updated SKA to add your request yet.  Most of the improvements in SKA are under the hood that you might not see until the next month or so.

 

0 Likes

Yes, I kind of figured that GPR usage played an important role. So none of the KSA measurables take into account GPR usage? It might be helpful to add this in the future because without it the KSA is mostly useless as a tool to gauge performance of a kernel and isn't that the point of the KSA or am I missing the point? Maybe I misunderstood the use of the KSA?


Using SKA basically provides you an access to the ATI compiler.   It uses the Brook+ compiler to compile Brook+ source file to IL.  Then, it calls the ATI Shader Compiler to compile IL down to hardware disassembly for various ASICs and under various Catalyst driver.  While you can use Brook+ compiler directly instead of SKA, you don't have access to ATI Shader Compiler except through SKA or the ATI driver.  In addition, SKA exposes some statistics generated by the Shader Compiler such as the number of GPRs, ALU, fetch instructions, etc.  Also, we provided some heuristics to compute the estimated cycle times for your kernel.  The heuristics are not perfect as there are many factors that affect the total performance.  Please also keep in mind that SKA is a static analysis tool and thus has its own limitations since it is not a run-time profiler.

How will all of the above helpful to you as a game/stream developer? 

1. You can tweak your kernel to achieve better performance by looking at the statistics generated by SKA.  You should look at all the statistics instead of just focusing on one particular item.  ALU:Fetch ratio gives a hint of the balance of your system.  You should also try to minimize the number of GPRs used.  Finally, the estimated cycle times should also be a low number.  Some developers also like to look at the hardware disassembly to gain better understanding on how to tweak their IL kernel.

2. If you want to know how your kernel performs on a particular graphics card, you can use SKA to gauge the performance on that particular graphics card even without having access to the hardware. 

3.  Similarly, without having to install a new Catalyst driver, you will be able to know whether a shader bug has been fixed/introduced in the new driver.  Or even better whether there are some performance improvements for your kernel/shader.

I hope this helps.

EDIT: It's also important to note that my GPR usage has gone down with the another example (ALU:Fetch going from .88 to 1.07 and GPR going from 13 to 9) and this increases the runtime of the program. This is what is confusing to me.


Is the estimated cycle time higher in the second kernel?  You can also post both kernels so we will be able to get a better idea of the problem.

0 Likes
dukeleto
Adept I

Hello,
could I put in a request for a linux version of SKA, or at least that attention be payed to allowing the combination brcc+SKA to work properly with wine under linux?
Currently both programs install approximately but I cannot manage to get the SKA to find the (windows version of) brcc.
Thanks!
0 Likes

Can we get the printing fixed in the next release of KSA? It would be great to be able to print out the information. Right now, the printing freezes KSA, doesn't print and ends in KSA not responding.

Also, a Find/Replace function would be very nice.
0 Likes

Originally posted by: dukeleto Hello, could I put in a request for a linux version of SKA, or at least that attention be payed to allowing the combination brcc+SKA to work properly with wine under linux? Currently both programs install approximately but I cannot manage to get the SKA to find the (windows version of) brcc. Thanks!


Originally posted by: ryta1203 Can we get the printing fixed in the next release of KSA? It would be great to be able to print out the information. Right now, the printing freezes KSA, doesn't print and ends in KSA not responding. Also, a Find/Replace function would be very nice.


I'll add all these requests into our bug tracking system so it can be prioritized for our future releases.

Cheers.

0 Likes

bpurnomo,

Sounds great. I seem to have another example that might help in convincing anyone that the KSA is limited, these two kernels give the same info from the KSA (with a little different ISA, just slightly):

kernel void step1(float4 a<>, float4 b<>, out float c<>, out float d<>)
{
c = a.x + a.y + a.z + a.w;
d = b.x + b.y + b.z + b.w;
}

kernel void step2(float4 a<>, float4 b<>, out float4 out1<>, out float4 out2<>)
{
//float4 temp;
out1.x = a.x + a.y + a.z + a.w;
out2.x = b.x + b.y + b.z + b.w;
//out1 = temp;
//out2 = temp;
}


YET, the 1st kernel runs twice as fast as the second kernel. The KSA gives NO clue, other than examining the ISA, as to the reason this is happening. The ISA is very similar for both kernels.
0 Likes

Originally posted by: ryta1203 YET, the 1st kernel runs twice as fast as the second kernel. The KSA gives NO clue, other than examining the ISA, as to the reason this is happening. The ISA is very similar for both kernels.


We hear you.  We do believe that a run-time profiler would be a nice thing to have.  I'm actually on your side.

However, it is not true that SKA gives NO clue at all for those two kernels.  Without SKA, developers will have no idea why one is faster than the other.  Afterall, the ISA is exposed by SKA.

 

0 Likes

Looking at the KSA and the ISA, I can't tell why the one is faster than the other.

The ISA is very similar. Either way, the developer needs to know ISA in order optimizations on these kernels (very simple kernels at that). Wouldn't it be easier just to write kernels like this in ISA? If so, then there is no need for higher level languages and for KSA at all.

I'm glad to hear a run-time profiler is in talk, I think this will be very good depending on the type of information it profiles.
0 Likes
naughtykid
Journeyman III

when there will be a linux version?

0 Likes

Originally posted by: naughtykid when there will be a linux version?

 

Currently, we do not have plan to offer a linux version of the tool.

However, we are continually evaluating our priority list based on customer's feedback so the plan might change in the future.

 

0 Likes

A linux version of both the KSA and the hopefully upcoming profiler would be great.
0 Likes