6 Replies Latest reply on Mar 6, 2014 12:31 PM by endoerner

    Huge difference between the kernels execution time and program total execution time

    endoerner

      Hi to everyone,

       

      I am facing a weird problem with an OpenCL code that I am developing.

       

      In short words, I have a certain number of kernels that are called inside a loop. The problem that I am facing is that the total time of the program is like two o three times the execution time of the kernels, measured with events profiling. For example, for a AMD FX8350 CPU + AMD Radeon HD7970 GPU (OpenCL running on the GPU) I obtained:

       

      total time : 3275117 ms

      kernel time : 1415446 ms

       

      I use events for profiling, and when I remove the clWaitForEvents() function I obtain:

       

      total time : 1857250 ms

       

      I tried also removing from inside the loop the clSetKernelArg() functions, but the time gain was minimal. Also, there write/read  to/from the device is minimal, so it should not be the source of this problem (I have tested it) Anyways, that seems quite weird for me, as I have never seen such overhead from clWaitForEvents(). Moreover, if I run the program with an Intel CPU I obtain not such difference.

       

      Any clue about this behavior? Thanks for your help in advance!

        • Re: Huge difference between the kernels execution time and program total execution time
          ravkum

          Hi,

           

          Could you please paste the code snippet where you measure the total time and kernel execution times?

           

          Also, you may find this thread useful: OpenCL Profiler End times

           

          Regards,

          Ravi

            • Re: Huge difference between the kernels execution time and program total execution time
              endoerner

              Hi ravkum,

               

              first I am sorry for this delayed answer, but I was on holidays (here in Chile is summer).

               

              The total time is measured using the Unix "time" console command, I originally used a C scoring procedure to obtain the total time, but for some reason the obtained time was absolutely different to the real execution time, anyway I put the expression formerly used in the program:

               

              clock_t begin, end;

              begin = clock();

               

              /* ... */

               

              end = clock();

              rk4_timers.total_time = (double)(end - begin) / CLOCKS_PER_SEC;

              printf("total execution time = %f ms\n", rk4_timers.total_time * 1e3);

               

              That code game me an absolutely wrong value, so finally I stick to the Unix "time" command. If you know the error of the above code please tell me...

               

              in order to measure the kernel execution times I used the clGetEventProfilingInfo() function inside a "get_time" function defined as:

               

              double get_time(cl_event *event){

                  cl_ulong start, end;

                  clGetEventProfilingInfo(*event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);

                  clGetEventProfilingInfo(*event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);

                  return (end - start) * 1.0e-9;

              }

               

              so, I use the clWaitForEvents() function to assure that the kernel is executed and then I accumulate the execution time in a variable:

               

              err = clEnqueueNDRangeKernel(queue, rk4_kernels->derivs, 1, NULL, &global_size, &local_size, 0, NULL, &derivs_event[0]);

              clWaitForEvents(1,&derivs_event[0]);

              rk4_timers.derivs_time += get_time(&derivs_event[0]);

               

              I will check the suggested reading, thanks for your post!

                • Re: Re: Huge difference between the kernels execution time and program total execution time
                  ravkum

                  Hi,

                   

                  Summer vacation in Feb sounds really good. Hope you had a nice time.

                   

                  The Unix time command measures the whole program execution time, including the time it takes for the system to load your binary and all its libraries, and the time it takes to clean up everything once your program is finished. It also will include the time OpenCL compiler takes to compile the cl files. So it won't give you the correct results in this scenario.

                   

                  Please try using gettimeofday instead of clock(). Use gettimeofday around the code where you enqueue the kernel and wait for it to finish execution. There will be small difference there because the commands are batched and dispatched together. Please read more about it in the Programming guide.

                   

                  Regards,

                  Ravi

                    • Re: Re: Huge difference between the kernels execution time and program total execution time
                      endoerner

                      Hi, I followed your suggestion and realized the time measurements with gettimeoftheday() function. I also used two platforms (Intel and AMD) and two devices (2 GPU) and obtained the following results:

                       

                      Platform 1: AMD FX8350 + AMD HD7970 (Ubuntu 12.04)

                      With clWaitForEvents() profiler:

                           loop time (where kernels are enqueue and executed) : 847.34 s

                           kernel time (measured with clWaitForEvents()) : 735.79 s

                      Without clWaitForEvents():

                           loop time : 738.01 s

                       

                      Platform 2: Intel Core i5 + Intel HD4000 (Windows 8.1)

                      With clWaitForEvents():

                           loop time : 3677.00 s

                           kernel time : 3474.02 s

                      Without clWaitForEvents():

                           loop time : 3481.46 s

                       

                      Well, I also executed the programs with the CPUs, but the times with and without clWaitForEvents() were very similar. The AMD platform is more affected with this issue. I must clarify that I used in this case different initial conditions as in the fist post in order to better occupy the GPUs, therefore it seems that the problem is not so big now. The time difference for the AMD platform is like 12% and for the Intel platform is 1.4%. Well, I just used a DEBUG flag in order to deactivate the clWaitForEvents() use, and I could live with that, but it would be nice to know the reason of this behavior. Thanks for your help