Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

What does the Time counter in AMD APP Profiler list?
states that the Time counter is

"For a kernel dispatch operation: time spent executing the kernel in milliseconds (does not include the kernel setup time). For a buffer or image object operation, time spent transferring data in milliseconds."

which makes me wonder whether "time spent executing the kernel" is wall clock time or CPU (GPU) time.

It seems to be the CPU (GPU) time, the time during which kernel instructions are actively being processed, excluding any wait times on memory operations etc.

Let me explain why I think this is the case:

I profiled 2 different versions of my application, let's call them FAST and SLOW. Both have the same domain size, same number of work items, they differed in memory access and buffer entry sizes (float2 vs float3).

SLOW had a low Time counter and a high FetchUnitStalled percentage.
FAST had a high Time counter and a low FetchUnitStalled percentage.

Still, FAST executed visibly faster than SLOW. This was also veryfied by measuring the wall clock time in the client application (including glFinish before and afterwards).
If Time indeed lists the CPU (GPU) time then this would make sense, because even though that time is larger in FAST, different wavefronts might be executed while others wait for memory operations, therefore executing the whole kernel FAST faster. On the other hand in SLOW the compiler/runtime/driver might not decide to switch wavefronts and instead perform all those tiny waits, in total taking more time, because the memory access stalls are not hidden.

Does Time really list CPU (GPU) time?

What time does GPUTime in the description of FetchUnitStalled
"The percentage of GPUTime the Fetch unit is stalled."
refer to then?


4 Replies


Nice investigation. But if go by defnitions mentioned in programming guide.

Time: is the GPU time from the instant kernel was launched to the instant finished.So it should include all the fetch, ALU,Write times otherwise the other two definitions have no significance at all.

ALUBusy:The percentage of GPUTime ALU instructions are processed. Value range: 0% (bad) to 100% (optimal).

FetchUnitBusy:The percentage of GPUTime the Fetch unit is active. The result includes the stall time (FetchUnitStalled). This is measured with all extra fetches and any cache or memory effects taken into account. Value range: 0% to 100% (fetch-bound).

Similarly Write Unit Busy.

Some of these operations may happen in parallel(like one wavefront stalled to write and execution happening on another). which reasonably explains the choice of using percentage w.r.t GPUTime for definitions. So now we can estimate how much does the write and fetch affect ALU and how much of it hidden.

Please post the code you obtained your results. Also mention the system Details.



These are my personal ideas and may not be 100% correct. Although I  try to give to best information.


You are right, it wouldn't make much sense otherwise.

Also, I just realized that I was mislead by my results. I believe the timings I acquired are worthless because I used the GPU_MAX_HEAP_SIZE=100 environment variable option.

Yesterday I observed huge performance differences between application runs when using this variable. But obviously I failed to connect these findings to my previous weird timing results mentioned in the original post.

Now everything seems clear, the Time is indeed the wall clock time. The driver or something just gives wrong results because there might be things going on it does not know about. GPU_MAX_HEAP_SIZE is totally unsupported and we have been warned not to use it. So this is my own fault, well.

Thank you for your help himanshu.gautam, it made me realize that there must be a fault on my side.


i experiment with GPU_MAX_HEAP_SiZE. and i can say that outstandings buffers (beyond original 512MB limit) are in main memory and not on device eg you get 5GB/s of memory bandwidth.