Archives Discussions

robert_dietrich · ‎03-18-2010

I want to get correct profiling information for asynchronous commands in an in-order command queue. The OpenCL spec says the following for in-order command queues: "A prior command on the queue completes before the following command begins. "

Therefore I modified the NVIDIA SDK 3.0 oclDotProduct example, enabled the queue for profiling and gave events to the clEnqueueWriteBuffer and clEnqueueNDRangeKernel calls. After a blocking clEnqueueReadBuffer I printed out the CL_PROFILING_COMMAND_START and CL_PROFILING_COMMAND_END for each call.

The call sequence is: 2 non-blocking clEnqueueWriteBuffer -- 1 clEnqueueNDRangeKernel -- 1 blocking clEnqueueReadBuffer.

According to the OpenCL spec the commands should be executed one after another. The results are correct for both CUDA and ATI-Stream GPUs, but the profiling information for ATI-Stream are strange. The three non-blocking commands start with different OpenCL timestamps and end with the same. In my test case:

clStartTime: 159722421037346; clStopTime: 159722471600728
clStartTime: 159722422219277; clStopTime: 159722471600728
clStartTime: 159722466764412; clStopTime: 159722471600728

The duration of the calls is measured correctly, but they are overlapping. If flushes are introduced between the calls, the OpenCL timestamps are perfect one after another. Even clEnqueueBarrier does not help to serialize the non-blocking commands in this in-order queue.

Has anyone experienced the same and knows a solution or workaround?

I am using: SUSE SLED 11, ATI-Stream 2.01, AMD 5770, CUDA SDK 3.0 OpenCL

omkaranathan · ‎03-19-2010

robert,

Could you post the modified code?

robert_dietrich · ‎03-23-2010

The CUDA SDK 3.0 oclDotProduct is just an example. I attached the code adaption of oclDoProduct.cpp.

// -------------------------------------------------------- // Core sequence... copy input data to GPU, compute, copy results back /* Profiling test */ cl_event evts[3]; // Asynchronous write of data to GPU device shrLog("clEnqueueWriteBuffer (SrcA and SrcB)...\n"); ciErrNum = clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcA, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize * 4, srcA, 0, NULL, &evts[0]); //clFlush(cqCommandQueue); ciErrNum |= clEnqueueWriteBuffer(cqCommandQueue, cmDevSrcB, CL_FALSE, 0, sizeof(cl_float) * szGlobalWorkSize * 4, srcB, 0, NULL, &evts[1]); shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup); //clFlush(cqCommandQueue); // Launch kernel shrLog("clEnqueueNDRangeKernel (DotProduct)...\n"); ciErrNum = clEnqueueNDRangeKernel(cqCommandQueue, ckKernel, 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, &evts[2]); shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup); // Read back results and check accumulated errors shrLog("clEnqueueReadBuffer (Dst)...\n\n"); ciErrNum = clEnqueueReadBuffer(cqCommandQueue, cmDevDst, CL_TRUE, 0, sizeof(cl_float) * szGlobalWorkSize, dst, 0, NULL, NULL); shrCheckErrorEX (ciErrNum, CL_SUCCESS, pCleanup); /* Profiling test */ for(int i = 0; i < 3; i++){ cl_ulong ocl_strttime, ocl_stoptime, ocl_submittime, ocl_queuedTime; clGetEventProfilingInfo(evts, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &ocl_strttime, NULL); clGetEventProfilingInfo(evts, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &ocl_stoptime, NULL); clGetEventProfilingInfo(evts, CL_PROFILING_COMMAND_SUBMIT, sizeof(cl_ulong), &ocl_submittime, NULL); clGetEventProfilingInfo(evts, CL_PROFILING_COMMAND_QUEUED, sizeof(cl_ulong), &ocl_queuedTime, NULL); printf("CL_strtTime: %llu; CL_stopTime: %llu; CL_submitTime: %llu; CL_queuedTime: %llu\n", ocl_strttime, ocl_stoptime, ocl_submittime, ocl_queuedTime); }

omkaranathan · ‎04-06-2010

Developers are looking into the issue, thanks for reporting.

robert_dietrich · ‎05-11-2010

I just installed ATI Stream SDK 2.1 and linux drivers 10.4. The problem seems to be solved now!

robert_dietrich · ‎05-17-2010

Actually the issue is not solved for ATI/AMD GPUs! I just used the wrong device. For CPUs profiling information are correct.

omkaranathan · ‎05-17-2010

This being looked into by developers, you can expect this to be fixed in an upcoming release.

robert_dietrich · ‎08-23-2010

I just tried the new Stream SDK 2.2 with ATI Catalyst 10.7, but profiling information are still incorrect for my ATI Radeon HD 5770. In contrast my NVIDIA Tesla C2050 card and the CPU return correct profiling information with the current ATI Stream SDK.

genaganna · ‎08-23-2010

Originally posted by: robert.dietrich I just tried the new Stream SDK 2.2 with ATI Catalyst 10.7, but profiling information are still incorrect for my ATI Radeon HD 5770. In contrast my NVIDIA Tesla C2050 card and the CPU return correct profiling information with the current ATI Stream SDK.

This is not fixed yet. You can expect in upcomming releases.

robert_dietrich · ‎01-26-2012

With Stream SDK 2.6, Catalyst 11.12 and AMD Radeon 6950 the profiling results for asynchronous commands are correct.

Archives Discussions

Strange Profiling Results for Asynchronous Commands