Archives Discussions

fv123 · ‎05-26-2014

Hello,

I have a program that needs to execute several kernels in successive order for about 10000 times.

Now the execution time of the kernels seems to be fine for me, but when i started profiling the events, it shows a large delay before the kernels are even executed.

Normally this wouldn't bother me too much, but as I need to do this many times, these "queue" times sum up and let the runtime explode. (The "wasted" time actually shows up in system time)

I attached an example code that shows how I handle the Enqueue calls and the profile routine I use.

Additionally an output file is attached that shows the (start-queued) time and (end-start) times in ms of my used kernels.

As you can see, the average "queued" time is longer than the execution time itself.

Note: openmp parameters aren't activated right now (NUM_THREAD_ID=0), therefore only one queue etc is used

2nd Note: The clWaitForEvents are only to assure the kernels have finished before the profile is made. Removing events and waits doesn't improve the wall time.

Is there anything I can do to the Enqueue calls or something else to improve these "queue" times effectively?

dipak · ‎06-03-2014

Hi,
Thanks for your reply.

Your numbers are a bit confusing. But, similar to you, my observation is that T_BATCH is several times faster than T_WAIT specially for MA/MA 100 as kernels itself have low execution time. Typically, in your case, you can say:

Total time taken to complete all kernels = Avg. kernel execution time * Number of times called + Total kernel launch overhead time.

That's why, for a fixed number of kernel launch, the realization of kernel launch overhead decreases as the avg. kernel execution time increases.

Individual kernel launch overhead time is typically in the range of some microseconds but actual number depends on the device and driver (that's why you cannot do much). But, there are some general ways to reduce the total kernel launch overhead time when launching a lot of small kernels. The guides are as follows:
1. Avoid any clWaitForEvents between kernel launches such that several kernels can be batched together and sent to the device, minimizing the kernel launch delay.
2. If not required, don't enable the profiling (e.g. don't set CL_QUEUE_PROFILING_ENABLE flag) when creating an command queue because profiling adds some extra overhead.

But most important point is, always try to avoid large number of kernel launching. Instead, if possible, modify the algorithm and merge small kernels and launch fewer number of kernels.

Thanks and Regards,

View solution in original post

dipak · ‎05-27-2014

Hi,

looking your output, it seems that your kernel launching time is much higher than actual kernel execution time. From your post, it would be very difficult to comment about the actual reason behind this. But to improve performance, I would like to suggest you a few points.

I don't know why you have to launch so many kernels. But you know, launching a kernel to a device has some time overhead. Due to that if possible, instead of launching too many very small kernels (compare to its actual computation task), launch fewer kernels with larger work size and more computation.

Another thing I want to point out here is clWaitForEvent() within the loop. When you wait for the kernel to complete within the loop, things become more serial. If your algorithm permits, I would recommend removing the wait for event from inside the loop. Instead put clFinish() end of the loop to ensure that all the enqueued kernels have been completed. In this case many EnqueueKernel commands are batched together and sent to the device, minimizing the kernel launch delay.

Thanks,

Dipak

fv123 · ‎05-27-2014

Thanks for your advice,

I already tried to merge some of the kernels, but their structure is way to different to do this efficiently resulting in a way higher runtime of said kernels. Some of them run on different work sizes and differ in the number of kernel arguments. Does this have an effect on the launch time?

My next step will be to implement parts of the loop into the kernels to better utilize the gpu, but i already know that I can't implement every part. Due to that the problem stays the same: The overhead time sums up and will increase the wall time.

Maybe there is another way to improve the launch time on a more general approach.

To your second point: I only put in these clWaitForEvents() is because otherwise the event profile will give an error message. The program runs fine without any profiling or events/WaitForEvents, but it doesn't affect the wall time.

If you need any more specific information about my program say so.

maxdz8 · ‎05-28-2014

Maybe something is getting lost in translation but launch time seems to have little to do with kernel complexity. Sure setting up parameters has a cost but grows sublinearly.

It could be useful to know the value of npar, if it is large it will change things a lot. What's the value of Global_Work_Size[0]?

fv123 · ‎05-28-2014

Ok, I did some test cases and it seems like my algorithm has nothing to do with these launch times.

My test cases consist of either Matrix Addition or Matrix Multiplication with either 100 elements or 1000 elements.

I took the average times of the launch overhead over 100000 clEnqueueNDRangeKernel() calls.

The calculations have been done on three different machines each with a Radeon HD 7970.

The execution times are roughly the same on all devices.

Here is a table with the data (times in ms):

1 2 3 execution time

MA 100 0.738 0.645 0.580 0.02

MA 1000 0.868 1.011 1.028 0.08

MM 100 0.866 0.966 0.938 0.1

MM 1000 0.996 1.141 1.149 4.9

I have run these programs several times and they seem to be pretty consistent.

Are these typical launch times for a clEnqueueNDRangeKernel() and where does the differences between e.g. MA 100 and MA 1000 come from (mostly on machines 2 and 3)?

I attached my code and kernels.

Note: change ELEMENTS to adjust matrix size and NAME and KERNELNAME to switch between MA and MM.

Note: My init routine currently uses platform 0 and device 0. You may need to adjust to your system.

dipak · ‎05-29-2014

Hi,

Thanks for sharing the code. We'll run this program in our lab and compare with your stat.

Regards,

dipak · ‎05-30-2014

Hi,

I am able to run your sample code and have observed similar pattern of output (not exactly same as my lab m/c is different from your one). I'm experimenting few points to understand the behavior. Meanwhile can you do the following tests and share your observations:

Note: Modified/new lines are marked by bold latter.

1.

....

T1 = GET_TIME(); // cpu time function

// Enqueue 100000 times

for (int i=0;i<100000;i++){

status = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, Global_Work_Size, Local_Work_Size, NULL, NULL, &event_k);

status |= clWaitForEvents(1,&event_k);

if (status != CL_SUCCESS){printf("Execution error (Enqueue)");exit(-1);}

// get_event_profile(event_k,"execute", avg, avg2); // BLOCK THIS LINE TO IGNORE TIME SPENT FOR CONSOLE PRINTING AND CALCULATIONS

}

T2 = GET_TIME(); // cpu time function

T_WAIT = T2 - T1;

...

What is value of T_WAIT?

2.

T1 = GET_TIME(); // cpu time function

// Enqueue 100000 times

for (int i=0;i<100000;i++){

status = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, Global_Work_Size, Local_Work_Size, NULL, NULL, &event_k);

// status |= clWaitForEvents(1,&event_k); // BLOCK THIS LINE

if (status != CL_SUCCESS){printf("Execution error (Enqueue)");exit(-1);}

// get_event_profile(event_k,"execute", avg, avg2); // BLOCK THIS LINE TO IGNORE PRINT AND CALCULATIONS

}

clFinish(); // WAIT FOR THE KERNELS TO COMPLETE

T2 = GET_TIME(); // cpu time function

T_BATCH = T2 - T1;

What is value of T_BATCH?

Regards,

fv123 · ‎06-02-2014

Thanks for your effort,

I have done the tests.

First my results for T_WAIT (times did differ on several runs and I don't really have the correct average or something like this calculated, its more or less the estimated time)

in seconds

MA 100 53 53 42

MA 1000 97 97 98

MM 100 95 97 95

MM 1000 600 617 617

T_BATCH is roughly 0.0625 seconds for all machines and program versions.

I did compare the run time of the programs though.

All versions expect for MM 1000 would be completed in less than 3-5 seconds overall.

MM 1000 itself had a runtime of 497 seconds. With the average time of 4.9 ms to run one kernel it adds up correctly over 100000 calls.

A very simple approach would be to say one clWaitForEvents() equals 1ms queue time to explain the differences between T_WAIT and program time.

It just seems like I greatly underestimated the effect of the clWaitForEvents() in this case.

What are the results of your tests/experiments?

dipak · ‎06-03-2014

Hi,
Thanks for your reply.

Your numbers are a bit confusing. But, similar to you, my observation is that T_BATCH is several times faster than T_WAIT specially for MA/MA 100 as kernels itself have low execution time. Typically, in your case, you can say:

Total time taken to complete all kernels = Avg. kernel execution time * Number of times called + Total kernel launch overhead time.

That's why, for a fixed number of kernel launch, the realization of kernel launch overhead decreases as the avg. kernel execution time increases.

Individual kernel launch overhead time is typically in the range of some microseconds but actual number depends on the device and driver (that's why you cannot do much). But, there are some general ways to reduce the total kernel launch overhead time when launching a lot of small kernels. The guides are as follows:
1. Avoid any clWaitForEvents between kernel launches such that several kernels can be batched together and sent to the device, minimizing the kernel launch delay.
2. If not required, don't enable the profiling (e.g. don't set CL_QUEUE_PROFILING_ENABLE flag) when creating an command queue because profiling adds some extra overhead.

But most important point is, always try to avoid large number of kernel launching. Instead, if possible, modify the algorithm and merge small kernels and launch fewer number of kernels.

Thanks and Regards,

fv123 · ‎06-03-2014

Thanks for your help and effort you put into this, even though you already said the correct answer in the beginning

It really helped me to understand the impact of the clWaitForEvents() calls.

Archives Discussions

Large delay time before executing gpu kernels