cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

endoerner
Adept I

Huge difference between the kernels execution time and program total execution time

Hi to everyone,

I am facing a weird problem with an OpenCL code that I am developing.

In short words, I have a certain number of kernels that are called inside a loop. The problem that I am facing is that the total time of the program is like two o three times the execution time of the kernels, measured with events profiling. For example, for a AMD FX8350 CPU + AMD Radeon HD7970 GPU (OpenCL running on the GPU) I obtained:

total time : 3275117 ms

kernel time : 1415446 ms

I use events for profiling, and when I remove the clWaitForEvents() function I obtain:

total time : 1857250 ms

I tried also removing from inside the loop the clSetKernelArg() functions, but the time gain was minimal. Also, there write/read  to/from the device is minimal, so it should not be the source of this problem (I have tested it) Anyways, that seems quite weird for me, as I have never seen such overhead from clWaitForEvents(). Moreover, if I run the program with an Intel CPU I obtain not such difference.

Any clue about this behavior? Thanks for your help in advance!

0 Likes
1 Solution

Hi,

Good to know that helped.

Do you have the clWaitForEvent within the loop? That explains why you get better results when you don't call it. In that case many EnqueueKernel commands are batched together and sent to the device, minimizing the kernel launch delay. When you wait for the kernel to complete within the loop, things become more serial.

If your algorithm permits, I would recommend removing the wait for event from inside the loop. You can call clFinish() after the loop to make sure all the kernels finish execution.

Things will be much better in HSA where the queues are in user mode.

Regards,

Ravi

View solution in original post

0 Likes
6 Replies
ravkum
Staff

Hi,

Could you please paste the code snippet where you measure the total time and kernel execution times?

Also, you may find this thread useful: OpenCL Profiler End times

Regards,

Ravi

0 Likes

Hi ravkum,

first I am sorry for this delayed answer, but I was on holidays (here in Chile is summer).

The total time is measured using the Unix "time" console command, I originally used a C scoring procedure to obtain the total time, but for some reason the obtained time was absolutely different to the real execution time, anyway I put the expression formerly used in the program:

clock_t begin, end;

begin = clock();

/* ... */

end = clock();

rk4_timers.total_time = (double)(end - begin) / CLOCKS_PER_SEC;

printf("total execution time = %f ms\n", rk4_timers.total_time * 1e3);

That code game me an absolutely wrong value, so finally I stick to the Unix "time" command. If you know the error of the above code please tell me...

in order to measure the kernel execution times I used the clGetEventProfilingInfo() function inside a "get_time" function defined as:

double get_time(cl_event *event){

    cl_ulong start, end;

    clGetEventProfilingInfo(*event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);

    clGetEventProfilingInfo(*event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);

    return (end - start) * 1.0e-9;

}

so, I use the clWaitForEvents() function to assure that the kernel is executed and then I accumulate the execution time in a variable:

err = clEnqueueNDRangeKernel(queue, rk4_kernels->derivs, 1, NULL, &global_size, &local_size, 0, NULL, &derivs_event[0]);

clWaitForEvents(1,&derivs_event[0]);

rk4_timers.derivs_time += get_time(&derivs_event[0]);

I will check the suggested reading, thanks for your post!

0 Likes

Hi,

Summer vacation in Feb sounds really good. Hope you had a nice time.

The Unix time command measures the whole program execution time, including the time it takes for the system to load your binary and all its libraries, and the time it takes to clean up everything once your program is finished. It also will include the time OpenCL compiler takes to compile the cl files. So it won't give you the correct results in this scenario.

Please try using gettimeofday instead of clock(). Use gettimeofday around the code where you enqueue the kernel and wait for it to finish execution. There will be small difference there because the commands are batched and dispatched together. Please read more about it in the Programming guide.

Regards,

Ravi

0 Likes

Hi, I followed your suggestion and realized the time measurements with gettimeoftheday() function. I also used two platforms (Intel and AMD) and two devices (2 GPU) and obtained the following results:

Platform 1: AMD FX8350 + AMD HD7970 (Ubuntu 12.04)

With clWaitForEvents() profiler:

     loop time (where kernels are enqueue and executed) : 847.34 s

     kernel time (measured with clWaitForEvents()) : 735.79 s

Without clWaitForEvents():

     loop time : 738.01 s

Platform 2: Intel Core i5 + Intel HD4000 (Windows 8.1)

With clWaitForEvents():

     loop time : 3677.00 s

     kernel time : 3474.02 s

Without clWaitForEvents():

     loop time : 3481.46 s

Well, I also executed the programs with the CPUs, but the times with and without clWaitForEvents() were very similar. The AMD platform is more affected with this issue. I must clarify that I used in this case different initial conditions as in the fist post in order to better occupy the GPUs, therefore it seems that the problem is not so big now. The time difference for the AMD platform is like 12% and for the Intel platform is 1.4%. Well, I just used a DEBUG flag in order to deactivate the clWaitForEvents() use, and I could live with that, but it would be nice to know the reason of this behavior. Thanks for your help

0 Likes

Hi,

Good to know that helped.

Do you have the clWaitForEvent within the loop? That explains why you get better results when you don't call it. In that case many EnqueueKernel commands are batched together and sent to the device, minimizing the kernel launch delay. When you wait for the kernel to complete within the loop, things become more serial.

If your algorithm permits, I would recommend removing the wait for event from inside the loop. You can call clFinish() after the loop to make sure all the kernels finish execution.

Things will be much better in HSA where the queues are in user mode.

Regards,

Ravi

0 Likes

Yes, they are called inside the loop...

well, that explains all... and yes, I can remove the clWaitForEvents() without affecting the algorithm...

Thanks for your help!

0 Likes