I'm working on a multi-threaded C++ application that uses OpenCL to solve differential equations. The application is intended to utilize both the CPU and GPU OpenCL devices when available. This dual device, and dual platform use case on my target machine is experiencing a strange slow down in expected performance. The CPU is AMD 16-core Opteron 6274 (AMD APP SDK 2.9) and GPU is NVIDIA K20X.
I've been using the OpenCL event profiling functionality to troubleshoot the performance. The problematic behavior I am asking about today is a mysterious 40ms delay that starts happening between on the CPU following the first couple of kernel ND range executions. The GPU performance is as expected.
When executing on CPU device only: performance is as expected. Time per-unit of computation is observed.
When adding the GPU to the simulation, performance for the first few kernel ND range executions on the CPU behaves as expected, matching the same per-unit performance as for CPU only.
After the sixth CPU kernel ND range execution, all subsequent CPU kernel ND range events have an extra 40 ms that appears relative to the baseline events. This 40 ms either inserts between the kernel event 'CL_PROFILING_COMMAND_START' and 'CL_PROFILING_COMMAND_STOP' or 'CL_PROFILING_COMMAND_SUBMIT' and 'CL_PROFILING_COMMAND_START'.
There are no task dependencies that would explain a delay. Later cycles just repeat the earliest one anyway, so I don't understand why later behavior would have the anomalous 40ms delay.
This can be best illustrated graphically, so I plotted the first few cycles of my application in a SVG format.
>>>> CPU +GPU Event Profile Visualization <<<<
Each event has several properties:
Horizontal position is time in milliseconds along the x-axis.
Line color distinguishes different devices and command queues: The purple shades are the two CPU queues, the yellow varieties are the four GPU queues.
Each event has four markers:
CL_PROFILING_COMMAND_QUEUED - vertical bar
CL_PROFILING_COMMAND_SUBMIT - light blue diamond
CL_PROFILING_COMMAND_START - green arrow
CL_PROFILING_COMMAND_STOP - red square
If you are viewing the SVG in a compatible viewer (Chrome and Firefox), hovering the mouse over the symbols should reveal a link that is actually just a short text description of the event.
If you see no horizontal or vertical lines, then your SVG viewer has a bug. (e.g. Safari browser)
For the CPU+GPU case, each cycle the CPU computes thee ND range of the same kernel and same kernel arguments. The first two ND ranges are significantly smaller -- about 1% of the final range.
Referring to the SVG image, the first three CPU process kernel events between 0 and 250 ms look just as I expect. Same goes for the next three between 910 ms and 1150 ms. The next three CPU kernel events queued at 1150 ms and all following sets of three CPU kernel events demonstrate the ~40 ms delays.
What's going on?
I'm not sure what else to do for troubleshooting.
Any advice would be appreciated!