21 Replies Latest reply on Sep 18, 2010 7:27 PM by Raistmer

    New! ATI Stream Profiler version 1.4 is now available

    bpurnomo

      We are pleased to announce the release of a new version of ATI Stream Profiler, version 1.4.

      ATI Stream Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your OpenCL™ application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application's performance.


      New updates in this version include

      • Support for Stream SDK v2.2.
      • Support OpenCL™ 1.1.
      • Support Microsoft® Visual Studio® 2010.
      • Support for command line interface.
      • Added support to check whether the current version is up-to-date.
      • Fixed data transfer size for image objects.
      • Updated counter names and descriptions.

       

      Please post your feedback here.

        • New! ATI Stream Profiler version 1.3 is now available
          tomhammo

          Thanks! the additional performance counters are useful.

          However  - there is one performance counter I think a few of us would love to see in v1.4 of the profiler: the number of concurrent workgroups per SIMD.

          For example, say four workitems use 8 KB of local memory - ideally four would run in parallel at the same time (there is not more local memory for 5 or more). At the moment there is no way to verify that this has actually occurred - other than in roundabout ways, for example by measuring execution time.

          Whilst it is possible to determine this by checking resource usage of each work item (#GPRS, amount of local memory) and comparing to the max available per SIMD... it would be great to have a performance counter verifying the exact amount of parallelism that ends up being exploited.

          For example - right now I am pretty sure that a kernel I am working on only interleaves two workgroups per SIMD at a time - even though I have enqueued more than enough workgroups and adjusted resource usage so at least four workgroups should be running in parallel per SIMD. but performance says otherwise. The performance counter would save me a lot of time tracing this issue down.

          regards,

          - Tom Hammond

            • New! ATI Stream Profiler version 1.3 is now available
              bpurnomo

              Thanks for the feedback, this is a great suggestion.  This performance counter is not possible with our current hardware architecture, however we'll consider it for future generations.

                • New! ATI Stream Profiler version 1.3 is now available
                  ryta1203

                  I have an issue with the profiler.

                  It's profiling my kernel fine at smaller dimenions (ie. 256^2 or 1024^2), but at larger dimensions (2048^2) I'm getting no output for the profiler, even though the code is running fine (checked against CPU reference version!)??

                  Any ideas as to why this might be happening? Could it have something to do with the kernel size?

                  • New! ATI Stream Profiler version 1.3 is now available
                    ryta1203

                    bpurnomo,

                      I also have a question as to why the timing of the kernels via the profiler varies so much?

                      For example, sometimes I get 14ms for a run and other times 18ms.

                    Currently, I am just taking the mean over 10 or so runs to get a more stable timing, but is this fluctuation normal?

                      • New! ATI Stream Profiler version 1.3 is now available
                        ngaloppo

                        Using the new version (1.4).

                        I am experiencing inconsistency between my CPU timings and what ATI Stream is reporting. In pseudocode: 

                        timer.start();

                        clEnqueueMap*();

                        etc...

                        clEnqueueNDRangeKernel();

                        clFinish();

                        timer.stop();

                         

                        Total GPU time as reported by ATI Stream profiler 1.4: ~ 38 ms

                        Total CPU time as reported by timer class (simply using QueryPerformanceCounter): ~ 110 ms

                         

                        Is there any reason for this inconsistency that I'm not understanding? Thanks!

                         

                         

                         

                          • New! ATI Stream Profiler version 1.3 is now available
                            ryta1203

                            Just a guess but I believe the profiler times just the tranfers and the kernel time. Are you adding this up altogether?

                            Plus, I'm sure there is some overhead associated with the OpenCL API calls that's probably not included in the profiler timings.

                              • New! ATI Stream Profiler version 1.3 is now available
                                ngaloppo

                                 

                                Originally posted by: ryta1203 Just a guess but I believe the profiler times just the tranfers and the kernel time. Are you adding this up altogether?

                                 

                                Plus, I'm sure there is some overhead associated with the OpenCL API calls that's probably not included in the profiler timings.

                                 

                                Yes, I'm adding the Map & Kernel times together. Adds up to about 38ms. I'm not eager to move back to SDK 2.1 because it doesn't support OpenCL 1.1.

                                 

                                  • New! ATI Stream Profiler version 1.4 is now available
                                    bpurnomo

                                    Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

                                    We may report the run-time/driver overheads in some formats in the future version of the tool.

                                    ryta1203,

                                    My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

                                      • New! ATI Stream Profiler version 1.4 is now available
                                        ryta1203

                                         

                                        Originally posted by: bpurnomo

                                        ryta1203,

                                        My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

                                        If this were true then one would think this would be consistent across different versions of the SDk; however, I'm only seeing this with SDK 2.2 and not SDK 2.1.

                                        Does SDK 2.1 not support concurrent mem trans/kernel execution?

                                        Also, I'm using cl_finish() in between each action.

                                        • New! ATI Stream Profiler version 1.4 is now available
                                          ngaloppo

                                           

                                          Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

                                           

                                          We may report the run-time/driver overheads in some formats in the future version of the tool.

                                           

                                           

                                          I would be very surprised that the overhead of 3 API calls (Map, Enqueue, Map) is costing 70ms.

                                          • New! ATI Stream Profiler version 1.4 is now available
                                            ryta1203

                                             

                                            Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

                                            We may report the run-time/driver overheads in some formats in the future version of the tool.

                                            ryta1203,

                                            My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

                                            Why are there concurrent memory transfers?

                                            Each call is waiting on some event and I have clFinish(commandQueue) wrapped around the kernel? Unless I am misunderstanding that function call's purpose?

                                            • New! ATI Stream Profiler version 1.4 is now available
                                              ryta1203

                                               

                                              Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

                                              We may report the run-time/driver overheads in some formats in the future version of the tool.

                                              ryta1203,

                                              My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

                                              So can you please tell me how to get accurate timings without the memory transfers?

                                              This happens even when I use blocking writes and reads.

                                                • New! ATI Stream Profiler version 1.4 is now available
                                                  bpurnomo

                                                  When you use the profiler, do you see one or more CreateBuffer (or CreateImage) API calls with N/A timings?

                                                    • New! ATI Stream Profiler version 1.4 is now available
                                                      ryta1203

                                                       

                                                      Originally posted by: bpurnomo When you use the profiler, do you see one or more CreateBuffer (or CreateImage) API calls with N/A timings?

                                                      It depends, if I use clCreateBuffer and copy the pointer, then yes.

                                                      If I don't do this and use clEnqueueWriteBuffer non-blocking then no, I see WriteBufferAsynch

                                                      If I use clEnqueueWriteBuffer blocking I see WriteBuffer

                                                      All three of these have the variations with the samples for 2kx2k problem size: DCT (6.x ms up to 11.x ms), Mersenne Twister (14.x up to 18.x ms), Black Scholes (6.x ms up to 11.x ms).

                                                      For the DCT, for example, if I go up to 4k*4k problem size then the timings become much much more stable and I no longer see this fluctuation (variances occur at the .0x ms range, not at the x.xx ms range).

                                          • New! ATI Stream Profiler version 1.3 is now available
                                            ryta1203

                                             

                                            Originally posted by: ryta1203 bpurnomo,

                                              I also have a question as to why the timing of the kernels via the profiler varies so much?

                                              For example, sometimes I get 14ms for a run and other times 18ms.

                                            Currently, I am just taking the mean over 10 or so runs to get a more stable timing, but is this fluctuation normal?

                                            After switching back to SDK 2.1, I'm not having this problem anymore.

                                      • New! ATI Stream Profiler version 1.4 is now available
                                        helloworld922

                                        I'm having issues with file associations with Visual Studio 2008 after I installed the stream profiler.

                                        When I try to open any file that is associated with visual studio, I get a pop-up saying:

                                         

                                        [quote]There was a problem sending the command to the program[/quote]

                                         

                                        The onlly other information provided was the path of the file in the message box displaying the error. The file is not opened once the message box is closed.

                                        Interestingly enough, this only occurs when opening the file requires opening visual studio. So if I start visual studio first and then open the file either from the visual studio open dialog, or open the file from windows explorer there is no problem.

                                        I've only had this problem with non-project/solution files.

                                        Is there a way around this without just uninstalling the stream profiler (perhaps a setting somewhere)?

                                        Specs:

                                        ATI 5650HD, windows 7 64-bit home premium, Intel core i5 mobile processor, ATI Stream SDK 2.2 with stream profiler 1.4

                                        • New! ATI Stream Profiler version 1.4 is now available
                                          Raistmer
                                          Originally posted by: bpurnomo
                                          Please post your feedback here.



                                          1) VS integration broken for non-VC compilers.
                                          2) command line version leads to unrestricted memory consumption and finaly to crash of profiling application.