cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

bpurnomo
Staff

New! ATI Stream Profiler version 1.4 is now available

We are pleased to announce the release of a new version of ATI Stream Profiler, version 1.4.

ATI Stream Profiler is a Microsoft® Visual Studio® integrated runtime profiler that gathers performance data from the GPU as your OpenCL™ application runs. This information can then be used by developers to discover where the bottlenecks are in their OpenCL™ application and find ways to optimize their application's performance.


New updates in this version include

  • Support for Stream SDK v2.2.
  • Support OpenCL™ 1.1.
  • Support Microsoft® Visual Studio® 2010.
  • Support for command line interface.
  • Added support to check whether the current version is up-to-date.
  • Fixed data transfer size for image objects.
  • Updated counter names and descriptions.

 

Please post your feedback here.

0 Likes
21 Replies
tomhammo
Journeyman III

Thanks! the additional performance counters are useful.

However  - there is one performance counter I think a few of us would love to see in v1.4 of the profiler: the number of concurrent workgroups per SIMD.

For example, say four workitems use 8 KB of local memory - ideally four would run in parallel at the same time (there is not more local memory for 5 or more). At the moment there is no way to verify that this has actually occurred - other than in roundabout ways, for example by measuring execution time.

Whilst it is possible to determine this by checking resource usage of each work item (#GPRS, amount of local memory) and comparing to the max available per SIMD... it would be great to have a performance counter verifying the exact amount of parallelism that ends up being exploited.

For example - right now I am pretty sure that a kernel I am working on only interleaves two workgroups per SIMD at a time - even though I have enqueued more than enough workgroups and adjusted resource usage so at least four workgroups should be running in parallel per SIMD. but performance says otherwise. The performance counter would save me a lot of time tracing this issue down.

regards,

- Tom Hammond

0 Likes

Thanks for the feedback, this is a great suggestion.  This performance counter is not possible with our current hardware architecture, however we'll consider it for future generations.

0 Likes

I have an issue with the profiler.

It's profiling my kernel fine at smaller dimenions (ie. 256^2 or 1024^2), but at larger dimensions (2048^2) I'm getting no output for the profiler, even though the code is running fine (checked against CPU reference version!)??

Any ideas as to why this might be happening? Could it have something to do with the kernel size?

0 Likes

Would you be able to send us a test case so we can reproduce it in house?  Please send it to gputools.support@amd.com.

0 Likes

Originally posted by: bpurnomo Would you be able to send us a test case so we can reproduce it in house?  Please send it to gputools.support@amd.com.

It now runs fine. Side note: my GPR usage has increased significantly without me changing any code, odd.

0 Likes

Originally posted by: bpurnomo Would you be able to send us a test case so we can reproduce it in house?  Please send it to gputools.support@amd.com.

bpurnomo,

   I'm now able to profile 2048*2048 with the new SDK but now when I go to 3072*3072 I get the same problems as before (ie. no profiling information even though the code verifies against CPUreference just fine).

0 Likes

bpurnomo,

  I also have a question as to why the timing of the kernels via the profiler varies so much?

  For example, sometimes I get 14ms for a run and other times 18ms.

Currently, I am just taking the mean over 10 or so runs to get a more stable timing, but is this fluctuation normal?

0 Likes

Using the new version (1.4).

I am experiencing inconsistency between my CPU timings and what ATI Stream is reporting. In pseudocode: 

timer.start();

clEnqueueMap*();

etc...

clEnqueueNDRangeKernel();

clFinish();

timer.stop();

 

Total GPU time as reported by ATI Stream profiler 1.4: ~ 38 ms

Total CPU time as reported by timer class (simply using QueryPerformanceCounter): ~ 110 ms

 

Is there any reason for this inconsistency that I'm not understanding? Thanks!

 

 

 

0 Likes

Just a guess but I believe the profiler times just the tranfers and the kernel time. Are you adding this up altogether?

Plus, I'm sure there is some overhead associated with the OpenCL API calls that's probably not included in the profiler timings.

0 Likes

Originally posted by: ryta1203 Just a guess but I believe the profiler times just the tranfers and the kernel time. Are you adding this up altogether?

 

Plus, I'm sure there is some overhead associated with the OpenCL API calls that's probably not included in the profiler timings.

 

Yes, I'm adding the Map & Kernel times together. Adds up to about 38ms. I'm not eager to move back to SDK 2.1 because it doesn't support OpenCL 1.1.

 

0 Likes

Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

We may report the run-time/driver overheads in some formats in the future version of the tool.

ryta1203,

My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

0 Likes

Originally posted by: bpurnomo

ryta1203,

My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

If this were true then one would think this would be consistent across different versions of the SDk; however, I'm only seeing this with SDK 2.2 and not SDK 2.1.

Does SDK 2.1 not support concurrent mem trans/kernel execution?

Also, I'm using cl_finish() in between each action.

0 Likes

Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

 

We may report the run-time/driver overheads in some formats in the future version of the tool.

 

 

I would be very surprised that the overhead of 3 API calls (Map, Enqueue, Map) is costing 70ms.

0 Likes

Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

We may report the run-time/driver overheads in some formats in the future version of the tool.

ryta1203,

My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

Why are there concurrent memory transfers?

Each call is waiting on some event and I have clFinish(commandQueue) wrapped around the kernel? Unless I am misunderstanding that function call's purpose?

0 Likes

Originally posted by: bpurnomo Naturally, the GPU time reported in the profiler doesn't include the run-time or driver overhead time (the difference between submitted and start timestamp).

We may report the run-time/driver overheads in some formats in the future version of the tool.

ryta1203,

My guess for the timing fluctations is because there are concurrent memory transfers at the same time as the kernel execution.

So can you please tell me how to get accurate timings without the memory transfers?

This happens even when I use blocking writes and reads.

0 Likes

When you use the profiler, do you see one or more CreateBuffer (or CreateImage) API calls with N/A timings?

0 Likes

Originally posted by: bpurnomo When you use the profiler, do you see one or more CreateBuffer (or CreateImage) API calls with N/A timings?

It depends, if I use clCreateBuffer and copy the pointer, then yes.

If I don't do this and use clEnqueueWriteBuffer non-blocking then no, I see WriteBufferAsynch

If I use clEnqueueWriteBuffer blocking I see WriteBuffer

All three of these have the variations with the samples for 2kx2k problem size: DCT (6.x ms up to 11.x ms), Mersenne Twister (14.x up to 18.x ms), Black Scholes (6.x ms up to 11.x ms).

For the DCT, for example, if I go up to 4k*4k problem size then the timings become much much more stable and I no longer see this fluctuation (variances occur at the .0x ms range, not at the x.xx ms range).

0 Likes

Originally posted by: ryta1203 bpurnomo,

  I also have a question as to why the timing of the kernels via the profiler varies so much?

  For example, sometimes I get 14ms for a run and other times 18ms.

Currently, I am just taking the mean over 10 or so runs to get a more stable timing, but is this fluctuation normal?

After switching back to SDK 2.1, I'm not having this problem anymore.

0 Likes

I'm having issues with file associations with Visual Studio 2008 after I installed the stream profiler.

When I try to open any file that is associated with visual studio, I get a pop-up saying:

 

There was a problem sending the command to the program

 

The onlly other information provided was the path of the file in the message box displaying the error. The file is not opened once the message box is closed.

Interestingly enough, this only occurs when opening the file requires opening visual studio. So if I start visual studio first and then open the file either from the visual studio open dialog, or open the file from windows explorer there is no problem.

I've only had this problem with non-project/solution files.

Is there a way around this without just uninstalling the stream profiler (perhaps a setting somewhere)?

Specs:

ATI 5650HD, windows 7 64-bit home premium, Intel core i5 mobile processor, ATI Stream SDK 2.2 with stream profiler 1.4

0 Likes

Thank you for the report.  We will investigate this problem.

0 Likes
Raistmer
Adept II

Originally posted by: bpurnomo
Please post your feedback here.



1) VS integration broken for non-VC compilers.
2) command line version leads to unrestricted memory consumption and finaly to crash of profiling application.

0 Likes