Archives Discussions

perhaad · ‎11-28-2010

Kernel calls without flush and finish between them lead to same end time

Hello All,

For the OpenCL profiliing calls to clGetEventProfilingInfo() the end times of multiple kernels enqueud after each other seems to be the same.

Basically for multiple kernels enqueued on a device

<--K0--><--K1--><--K2-->

K0-K2 all report the same end time (I record a list of events and query at the end)

I remember that this was discussed in the forums some time back but I couldn't find the thread, to check for an update I wanted to check if this still exists even with the newer drivers like 10.10 and with the OpenCL 2.2 SDK implementation.

I still see this issue on Linux 64 bit Ubuntu and 10.10 Catalyst with 2.2 SDK

d_a_a_ · ‎11-29-2010

It seems that this isn't wrong. If I understand correctly, without explicit call to clFinish (or explicit event dependency) the implementation is free to start and finish kernels when it thinks is more appropriate.

perhaad · ‎11-29-2010

Originally posted by: d.a.a. It seems that this isn't wrong. If I understand correctly, without explicit call to clFinish (or explicit event dependency) the implementation is free to start and finish kernels when it thinks is more appropriate.

I dont know if that is the case. The implementation should only be able to do that if the CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE flag is provided, otherwise the queue is supposed to execute in order.

Either way, multiple kernel executions shouldn't have the same end time because that would *appear* like we are executing multiple kernels at a time which isn't the case

nou · ‎11-29-2010

yes i read specification carefully and without out-of-order flag implies in order execution. and every command in queue is implicit synchronized with previous command that it do not execute before previous end.

perhaad · ‎11-30-2010

I understand about the commands on a queue occurring serially. But shouldnt the end times of each command be distinct from each other ?

himanshu_gautam · ‎11-30-2010

I checked the situation with internal SDK by enqueueing 25 clEnqueueWriteBuffer commands. They all execute sequentially and no two commands end at the same time.

It would be nice if you share your code.

ravikeshri · ‎11-30-2010

perhaad, We have experienced the same scenario. The reason is explained in the ATI_Stream_SDK_OpenCL_Programming_Guide section 4.3.1. It says that "To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch. Commands submitted as batch report similar start times and the same end time." Your commands are run as batch so the end times are same.

I don't understand why the start time is different though. My trials also showed a few commands enqueued later started before other commnds but they all ended at the same time.

ravikeshri · ‎11-30-2010

@Himanshu, Did you use the SDK 2.2? We could observe this scenario in SDK 2.2.

perhaad · ‎12-01-2010

@Himanshu I see it with EnqueNDRange Kernels, I dont do any device IO / synchronization in between. I dont know if that matters ?

I have a pretty bulky app that depends on opencv. I will try and put together a simpler test case

@Ravikeshri, Even with the batching I would have thought that the counters might output separate values for OpenCL which would be for timing a kernel. I dont know if information about each Enqueue or its individual timing is lost once the *batching* is done. I am not super-familiar with things at the CAL-level

On a related thing, it is fine on Windows.

For linux I do a hacky work around: Enqueue a simple do nothing kernel at the end of each real kernel, so end time of real kernel ~ start of dummy kernel

himanshu_gautam · ‎12-01-2010

ravikeshri,

my apologies. the clenqueuewrirebuffer was calling a clfinish after every time.I modified my code and checked with some simple kernel. This time i get very similar end times for all the kernels as expected according to openCL programming guide.

I think you can write a test querying the profiling info of each ndrange call after enqueueing a large number of such calls and deduce the batch size after analyzing the profiling info of each kernel.

perhaad,

I think to query the correct profiling of individual kernels we can run them individually by calling a clFinish after every kernel.After performance optimizations you can club them together by removing the clFinish().

ravikeshri · ‎12-02-2010

Himanshu, I did a few tests and found that the batch size seems to be different for different GPUs. By enqueuing enough kernels I am able to minimize the kernel launch delay. Thanks.

Archives Discussions

OpenCL Profiler End times