cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bpurnomo
Staff

New! ATI Stream Profiler version 2.0 is now available

64-bit and DirectCompute support

We are pleased to announce the release of a new version of ATI Stream Profiler, version 2.0.

ATI Stream Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your OpenCL™ or DirectCompute application runs. This information can then be used by developers to discover where the bottlenecks are in the application and find ways to optimize their application's performance.

New updates in this version include

  • Support for profiling DirectCompute (DirectX 11) applications.
  • Support for profiling 64-bit OpenCL™ applications.
  • Reduced plugin's installation time.
0 Likes
12 Replies
ryta1203
Journeyman III

I would like to see at least two counters added:

1. ALU stalled by Fetch

2. Write Busy

0 Likes

Also, why am I getting the same values for FetchBusy and FetchStalled? It almost looks like the value is simply being copied over but I'm not sure which is being copied to which so I can't tell if it's the Busy or the Stalled.

The other thing is that this number is significantly different now. I can't tell if it was the Profiler 2.0 or Catalyst 10.11.

0 Likes

Originally posted by: ryta1203 Also, why am I getting the same values for FetchBusy and FetchStalled? It almost looks like the value is simply being copied over but I'm not sure which is being copied to which so I can't tell if it's the Busy or the Stalled.

 

The other thing is that this number is significantly different now. I can't tell if it was the Profiler 2.0 or Catalyst 10.11.

 

Please see my response here.

0 Likes

Originally posted by: bpurnomo
Originally posted by: ryta1203 Also, why am I getting the same values for FetchBusy and FetchStalled? It almost looks like the value is simply being copied over but I'm not sure which is being copied to which so I can't tell if it's the Busy or the Stalled.

 

The other thing is that this number is significantly different now. I can't tell if it was the Profiler 2.0 or Catalyst 10.11.

 

Please see my response here.

Please see my response here: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=143181

0 Likes

Originally posted by: bpurnomo
Originally posted by: ryta1203 Also, why am I getting the same values for FetchBusy and FetchStalled? It almost looks like the value is simply being copied over but I'm not sure which is being copied to which so I can't tell if it's the Busy or the Stalled.

 

The other thing is that this number is significantly different now. I can't tell if it was the Profiler 2.0 or Catalyst 10.11.

 

Please see my response here.

bpurnomo,

 As part of the tools team you might also want to check this thread out:

http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=142757&enterthread=y

Seems that the CacheHit counter might also be reporting the wrong values. Is this true? If not, have you looked into this at all?

0 Likes

Originally posted by: ryta1203 I would like to see at least two counters added:

 

1. ALU stalled by Fetch

 

2. Write Busy

 

Thank you for your feedback.

Could you please clarify what you are looking for in point 1 above?  The ALU units and Fetch units can process instructions in parallel: the hardware just switches to process another wavefront if the results from the Fetch units are not yet available.

For your second request, this is not possible with the current hardware, however, we will consider it for the future generations.  Thanks!

0 Likes

Originally posted by: bpurnomo
Originally posted by: ryta1203 I would like to see at least two counters added:

 

1. ALU stalled by Fetch

 

2. Write Busy

 

Thank you for your feedback.

Could you please clarify what you are looking for in point 1 above?  The ALU units and Fetch units can process instructions in parallel: the hardware just switches to process another wavefront if the results from the Fetch units are not yet available.

For your second request, this is not possible with the current hardware, however, we will consider it for the future generations.  Thanks!

1. Yes, I understand this but there is no time in any application where ALU units are starved? If you can tell me no, then I will accept that.

2. Understood.

Thank you.

0 Likes

1. Yes, I understand this but there is no time in any application where ALU units are starved? If you can tell me no, then I will accept that.

ALU units can starved but because of the way the hardware operates, there may not be a direct relationship to the Fetch units (it may be too complicated to construct one).  I am requesting a clarification to understand what exactly are you trying to achieve; perhaps we can address it in some other ways.

0 Likes

bpurnomo,

  I guess it's not necessary that the profiler give so much detailed information as that.

  Yes, essentially what you said was too complicated is what I'd like to know. I suppose that I can extract this from the ISA and the number of wavefronts running given the latency of context switching and fetch memory but even then cache has to be taken into account.

  For now, I'll just use 100 - ALU Busy, at least this gives me some indication but doesn't really tell me why the ALU is stalled.

0 Likes

bpurnomo,

  It would also be great to see the number of simul wavefronts. For example, you have "Wavefronts" but it's actually just Total Wavefronts, which is easy to calculate: Total Threads/Wavefront Size.

  What is slightly harder to calculate is the number of simul wavefronts (which might be a good indication of letting us know how much latency is being hidden or if we have enough wavefronts to hide latency). I know this can be calculated but it's not as easy to calculate as Total Wavefronts and you guys have included that so...

0 Likes

bpurnomo,

  Please read this thread: http://forums.amd.com/devforum/messageview.cfm?catid=390&threadid=143399&enterthread=y

Any feedback you can give would be great.

0 Likes

For the Matrix Transpose example, there is no indication given by the profiler as to why the kernel takes so long to run.

For instance, given thread size of 4k*4k with blocksize of 16x16 the run time is ~43.02.

Actual ALU busy time: ~1.05

Actual Fetch busy time: ~.249

Est Write Time: ~.4429 + Write Stall Time of ~5.175306

Est LDS Time: ~2.467 with no stalls

My request is for problems of this nature if we could get some counters that might lend to being more useful in telling where most of the time is being spent, it's difficult to tell here since these don't even come close to ~43.02.

Thanks.

0 Likes