Strange Profiling Results for Asynchronous Commands

Discussion created by robert.dietrich on Mar 18, 2010
Latest reply on Jan 26, 2012 by robert.dietrich

I want to get correct profiling information for asynchronous commands in an in-order command queue. The OpenCL spec says the following for in-order command queues: "A prior command on the queue completes before the following command begins. "

Therefore I modified the NVIDIA SDK 3.0 oclDotProduct example, enabled the queue for profiling and gave events to the clEnqueueWriteBuffer and clEnqueueNDRangeKernel calls. After a blocking clEnqueueReadBuffer I printed out the CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END for each call.

The call sequence is: 2 non-blocking clEnqueueWriteBuffer -- 1 clEnqueueNDRangeKernel -- 1 blocking clEnqueueReadBuffer.

According to the OpenCL spec the commands should be executed one after another. The results are correct for both CUDA and ATI-Stream GPUs, but the profiling information for ATI-Stream are strange. The three non-blocking commands start with different OpenCL timestamps and end with the same. In my test case:

clStartTime: 159722421037346; clStopTime: 159722471600728
clStartTime: 159722422219277; clStopTime: 159722471600728
clStartTime: 159722466764412; clStopTime: 159722471600728

The duration of the calls is measured correctly, but they are overlapping. If flushes are introduced between the calls, the OpenCL timestamps are perfect one after another. Even clEnqueueBarrier does not help to serialize the non-blocking commands in this in-order queue.

Has anyone experienced the same and knows a solution or workaround?

I am using: SUSE SLED 11, ATI-Stream 2.01, AMD 5770, CUDA SDK 3.0 OpenCL