2 Replies Latest reply on Jun 18, 2014 2:43 AM by pinform

    strange delays appear in CPU kernel execution when adding GPU


      I'm working on a multi-threaded C++ application that uses OpenCL to solve differential equations. The application is intended to utilize both the CPU and GPU OpenCL devices when available. This dual device, and dual platform use case on my target machine is experiencing a strange slow down in expected performance. The CPU is AMD 16-core Opteron 6274 (AMD APP SDK 2.9) and GPU is NVIDIA K20X.


      I've been using the OpenCL event profiling functionality to troubleshoot the performance. The problematic behavior I am asking about today is a mysterious 40ms delay that starts happening between on the CPU following the first couple of kernel ND range executions.  The GPU performance is as expected.

      When executing on CPU device only: performance is as expected. Time per-unit of computation is observed.


      When adding the GPU to the simulation, performance for the first few kernel ND range executions on the CPU behaves as expected, matching the same per-unit performance as for CPU only.


      After the sixth CPU kernel ND range execution, all subsequent CPU kernel ND range events have an extra 40 ms that appears relative to the baseline events.  This 40 ms either inserts between the kernel event 'CL_PROFILING_COMMAND_START' and 'CL_PROFILING_COMMAND_STOP' or 'CL_PROFILING_COMMAND_SUBMIT' and 'CL_PROFILING_COMMAND_START'.


      There are no task dependencies that would explain a delay. Later cycles just repeat the earliest one anyway, so I don't understand why later behavior would have the anomalous 40ms delay.


      This can be best illustrated graphically, so I plotted the first few cycles of my application in a SVG format.

      >>>> CPU +GPU Event Profile Visualization <<<<

      Each event has several properties:


      Horizontal position is time in milliseconds along the x-axis.

      Line color distinguishes different devices and command queues:  The purple shades are the two CPU queues, the yellow varieties are the four GPU queues.

      Each event has four markers:

      CL_PROFILING_COMMAND_QUEUED - vertical bar

      CL_PROFILING_COMMAND_SUBMIT - light blue diamond

      CL_PROFILING_COMMAND_START  - green arrow

      CL_PROFILING_COMMAND_STOP   - red square


      If you are viewing the SVG in a compatible viewer (Chrome and Firefox), hovering the mouse over the symbols should reveal a link that is actually just a short text description of the event.

      If you see no horizontal or vertical lines, then your SVG viewer has a bug.  (e.g. Safari browser)


      For the CPU+GPU case, each cycle the CPU computes thee ND range of the same kernel and same kernel arguments.  The first two ND ranges are significantly smaller -- about 1% of the final range.


      Referring to the SVG image, the first three CPU process kernel events between 0 and 250 ms look just as I expect.  Same goes for the next three between 910 ms and 1150 ms.  The next three CPU kernel events queued at 1150 ms and all following sets of three CPU kernel events demonstrate the ~40 ms delays.


      What's going on?

      I'm not sure what else to do for troubleshooting.

      Any advice would be appreciated!

        • Re: strange delays appear in CPU kernel execution when adding GPU


          It is very difficult to reach any conclusion merely by looking at the graph. It should however be kept in mind that on CPU side, there are many processes apart from your process that are vying for CPU time. The OS level scheduler decides which processes to be given CPU time slot. One conjecture then could be that the delay may be due to OS level scheduling.


          It can be observed from the graph that the pattern for the GPU remains the same. Another conjecture is that OpenCL requires some CPU time to schedule job on the GPU, and in first two cycles it is hidden because OpenCL runtime has already scheduled GPU jobs before starting executing CPU processes.

          The following could be done to probe this further-

          1. Observe the pattern for the long time. Is 40 ms constant over the long time? What happens if OS starts other processes while this process is still running?

          2. Off/On code execution on the GPU (since CPU/GPU processes are independent, this should not be a problem). What kind of pattern is observed?