cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

corry
Adept III

Accuracy of estimated Throughput in KernelAnalyzer

I suppose I'll likely find out soon, but was just curious to get some initial idea, how accurate is that number.  If it says I will get 1B threads/sec, can I expect that number?  1/2?  1/4?  I don't trust the number on some architectures since on the older ones, it says I'd use 2 GP registers, and see 3-4x performance vs a cayman.  I kind-of find that hard to believe.  I am targetting more modern GPU's, and perhaps using instructions not present on them, do they just get "optmized out" on the older architectures showing me huge performace numbers?

 

0 Likes
6 Replies
genaganna
Journeyman III

Originally posted by: corry I suppose I'll likely find out soon, but was just curious to get some initial idea, how accurate is that number.  If it says I will get 1B threads/sec, can I expect that number?  1/2?  1/4?  I don't trust the number on some architectures since on the older ones, it says I'd use 2 GP registers, and see 3-4x performance vs a cayman.  I kind-of find that hard to believe.  I am targetting more modern GPU's, and perhaps using instructions not present on them, do they just get "optmized out" on the older architectures showing me huge performace numbers?

 



Could you please copy your experimental code here which helps us to analyze issue?  Please what is the old card you are using.

0 Likes

Sorry I never got back here....I don't think its a bug, which is why it didn't bother me enough to hit here.

I was asking if the throughput numbers given by the tool are accurate.  I stated my reason for doubt, and I'll try to restate that.  Just looking for people saying when it says you are going to see 1200M Threads/sec, is that really what can be expected?

I don't have any older cards, but by default, Kernel Analyzer will generate throughput, GPR's used, etc to the information window.  Originally I just sorted by highest throughput as I figured that would put the 6970 (fastest one in the list) on the top, but to my surprise the older cards were showing up to 2x as fast, and showed they were only using 2 GPRs.  Right now I have a lot of debug output and therefore several UAVs so right now, it shows NA for the other cards.  Once I rip all that back out, I'll post a screenshot of the output.

0 Likes

Ok, now I *KNOW* this is optimistic....very, very, very optimistic....Ok fine, 100% unrealistic, and a bug 🙂  17G Threads/Sec would sure be nice for even a simple kernel!

 

0 Likes

A little more help on this...

I found that the offending code was adding a large loop.  I guess to understand it, I have to explain a little...

my il_cs_2_0 (main) function basically just calls 5 helper functions.  1 of them involves a pretty heavy section of processing, and one other is also fairly hefty on the ALU, but not as bad as the first one.  The rest are setup, packing/unpacking/byte order, etc.  Nothing too big.

Without the loop, I am currently getting ~300M Threads/sec in the Throughput tab of the kernel analyzer program.  The loop counter, to test performace, is set to 262144, or 0x40000 a nice power of 2 that will take more than 2 seconds to run, and thus, should provide a good average time for results.  When I enable that loop, the throughput number jumps.  At first I though, perhaps the M was just supposed to be a K.  This is not the case though, as 300M/262144 is 1144, not 17000 something...so I'm at a loss.  I can confirm at least that its GPR usage number seems to be correct in that when I optimized for less GPRs used, and the count in the window went up, my kernel ran slower.  I should also note, I'm not seeing the throughput given in the analyzer, I'm only seeing about 77% of that...

0 Likes

Originally posted by: corry A little more help on this...

 

I found that the offending code was adding a large loop.  I guess to understand it, I have to explain a little...

 

my il_cs_2_0 (main) function basically just calls 5 helper functions.  1 of them involves a pretty heavy section of processing, and one other is also fairly hefty on the ALU, but not as bad as the first one.  The rest are setup, packing/unpacking/byte order, etc.  Nothing too big.

 

Without the loop, I am currently getting ~300M Threads/sec in the Throughput tab of the kernel analyzer program.  The loop counter, to test performace, is set to 262144, or 0x40000 a nice power of 2 that will take more than 2 seconds to run, and thus, should provide a good average time for results.  When I enable that loop, the throughput number jumps.  At first I though, perhaps the M was just supposed to be a K.  This is not the case though, as 300M/262144 is 1144, not 17000 something...so I'm at a loss.  I can confirm at least that its GPR usage number seems to be correct in that when I optimized for less GPRs used, and the count in the window went up, my kernel ran slower.  I should also note, I'm not seeing the throughput given in the analyzer, I'm only seeing about 77% of that...

 

Could you please file a ticket with your kernel at http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=8?

 

0 Likes

Originally posted by: genaganna Could you please file a ticket with your kernel at http://developer.amd.com/support/KnowledgeBase/pages/HelpdeskTicketForm.aspx?Category=8?

 



Only think I could file would be a test case, and I really don't have time for that.  Sorry.  I'd like to see the tools get better, but somehow I get the feeling that rather than get better, they are just going to stop supporting the interfaces I need.

0 Likes