states that the Time counter is
"For a kernel dispatch operation: time spent executing the kernel in milliseconds (does not include the kernel setup time). For a buffer or image object operation, time spent transferring data in milliseconds."
which makes me wonder whether "time spent executing the kernel" is wall clock time or CPU (GPU) time.
It seems to be the CPU (GPU) time, the time during which kernel instructions are actively being processed, excluding any wait times on memory operations etc.
Let me explain why I think this is the case:
I profiled 2 different versions of my application, let's call them FAST and SLOW. Both have the same domain size, same number of work items, they differed in memory access and buffer entry sizes (float2 vs float3).
SLOW had a low Time counter and a high FetchUnitStalled percentage.
FAST had a high Time counter and a low FetchUnitStalled percentage.
Still, FAST executed visibly faster than SLOW. This was also veryfied by measuring the wall clock time in the client application (including glFinish before and afterwards).
If Time indeed lists the CPU (GPU) time then this would make sense, because even though that time is larger in FAST, different wavefronts might be executed while others wait for memory operations, therefore executing the whole kernel FAST faster. On the other hand in SLOW the compiler/runtime/driver might not decide to switch wavefronts and instead perform all those tiny waits, in total taking more time, because the memory access stalls are not hidden.
Does Time really list CPU (GPU) time?
What time does GPUTime in the description of FetchUnitStalled
"The percentage of GPUTime the Fetch unit is stalled."
refer to then?