OpenCL

smato2018 · ‎02-28-2019

Doing recently some benchmarks and wonder if my host-device latencies are
bound to my older hardware or are similar on newer systems?

OS: Ubuntu 18.04 x86-64
Device: AMD Radeon HD 7750

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~8K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~3K calls per second

Note that my machine is a bit outdated:

- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

Thanks in advance,
Srdja

smato2018 · ‎03-01-2019

Got this answer on Nvidia developer forum,
maybe it is of interest for others...

I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.

https://devtalk.nvidia.com/default/topic/1047965/cuda-programming-and-performance/host-device-latenc...

--

Srdja

dipak · ‎03-01-2019

Just to add some more information in this regard, here is what AMD OpenCL optimization guide says about the kernel launch overhead:

Section "Measuring Execution Time" in OPENCL Optimization — ReadTheDocs-Breathe 1.0.0 documentation
---------------------------------------------------------------------------------------------------------------------------
Another interesting metric to track is the kernel launch time (Start – Queue). The kernel launch time includes both the time spent in the user application (after enqueuing the command, but before it is submitted to the device), as well as the time spent in the runtime to launch the kernel. For CPU devices, the kernel launch time is fast (tens of μs), but for discrete GPU devices it can be several hundred μs. Enabling profiling on a command queue adds approximately 10 μs to 40 μs overhead to all clEnqueue calls. Much of the profiling overhead affects the start time; thus, it is visible in the launch time. Be careful when interpreting this metric. To reduce the launch overhead, the AMD OpenCL runtime combines several command submissions into a batch. Commands submitted as batch report similar start times and the same end time.

Thanks.

OpenCL

host-device latencies?