12 Replies Latest reply on Dec 13, 2010 5:29 PM by ryta1203

    New! ATI Stream Profiler version 2.0 is now available

    bpurnomo
      64-bit and DirectCompute support

      We are pleased to announce the release of a new version of ATI Stream Profiler, version 2.0.

      ATI Stream Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your OpenCL™ or DirectCompute application runs. This information can then be used by developers to discover where the bottlenecks are in the application and find ways to optimize their application's performance.

      New updates in this version include

      • Support for profiling DirectCompute (DirectX 11) applications.
      • Support for profiling 64-bit OpenCL™ applications.
      • Reduced plugin's installation time.
        • New! ATI Stream Profiler version 2.0 is now available
          ryta1203

          I would like to see at least two counters added:

          1. ALU stalled by Fetch

          2. Write Busy

            • New! ATI Stream Profiler version 2.0 is now available
              ryta1203

              Also, why am I getting the same values for FetchBusy and FetchStalled? It almost looks like the value is simply being copied over but I'm not sure which is being copied to which so I can't tell if it's the Busy or the Stalled.

              The other thing is that this number is significantly different now. I can't tell if it was the Profiler 2.0 or Catalyst 10.11.

              • New! ATI Stream Profiler version 2.0 is now available
                bpurnomo

                 

                Originally posted by: ryta1203 I would like to see at least two counters added:

                 

                1. ALU stalled by Fetch

                 

                2. Write Busy

                 

                Thank you for your feedback.

                Could you please clarify what you are looking for in point 1 above?  The ALU units and Fetch units can process instructions in parallel: the hardware just switches to process another wavefront if the results from the Fetch units are not yet available.

                For your second request, this is not possible with the current hardware, however, we will consider it for the future generations.  Thanks!

                  • New! ATI Stream Profiler version 2.0 is now available
                    ryta1203

                     

                    Originally posted by: bpurnomo
                    Originally posted by: ryta1203 I would like to see at least two counters added:

                     

                    1. ALU stalled by Fetch

                     

                    2. Write Busy

                     

                    Thank you for your feedback.

                    Could you please clarify what you are looking for in point 1 above?  The ALU units and Fetch units can process instructions in parallel: the hardware just switches to process another wavefront if the results from the Fetch units are not yet available.

                    For your second request, this is not possible with the current hardware, however, we will consider it for the future generations.  Thanks!

                    1. Yes, I understand this but there is no time in any application where ALU units are starved? If you can tell me no, then I will accept that.

                    2. Understood.

                    Thank you.

                    • New! ATI Stream Profiler version 2.0 is now available
                      ryta1203

                      For the Matrix Transpose example, there is no indication given by the profiler as to why the kernel takes so long to run.

                      For instance, given thread size of 4k*4k with blocksize of 16x16 the run time is ~43.02.

                      Actual ALU busy time: ~1.05

                      Actual Fetch busy time: ~.249

                      Est Write Time: ~.4429 + Write Stall Time of ~5.175306

                      Est LDS Time: ~2.467 with no stalls

                      My request is for problems of this nature if we could get some counters that might lend to being more useful in telling where most of the time is being spent, it's difficult to tell here since these don't even come close to ~43.02.

                      Thanks.