I'm queuing kernels that modify a buffer over and over again and am wondering if there's a more efficient way to do what I'm doing.
Here's pseudocode:
for (int q = 0; q < iterations; q++)
{
clEnqueueNDRangeKernel(cq, kernelA, args...);
clEnqueueNDRangeKernel(cq, kernelB, args...);
}
In my case, kernelB must wait for kernelA, however, no arguments for them change and no communication with the host needs to happen until the for loop completes. The problem is that for my data set the for loop needs to iterate thousands of times...and it seems that clEnqueuNDRangeKernel has a non-trivial cost when called enough times, so a lot of time is spent queuing the kernels themselves, when it seems like it would be easier to just somehow tell OpenCL to re-run them N times? Is that possible?
My global_work_size is in the millions and B must wait for A, so I don't think it's possible to do something clever like put iteration loop inside the kernel or something.
If OpenCL 2.0 is supported by the device, you may try device-side enqueue (enqueue_kernel) feature to launch another kernel (code represented by Block syntax) to the same device, without any host interaction. In general, device-side enqueue is a low cost operation compared to host-side enqueue. It can reduce the number of clEnqueueNDRangeKernel calls, thus may improve the overall performance.
Another point to note. When there are multiple commands to launch, try to avoid calling clFlush or clFinish after each clEnqueue<> call. Rather use event objects to set the dependency and enqueue them as batch of commands. For example:
Replace this:
clEnqueueNDRangeKernel(q, kernelA,...);
clFinish(q) / clFlush(q);
clEnqueueNDRangeKernel(q, kernelB);
clFinish(q) / clFlush(q);
clEnqueueNDRangeKernel(q, kernelC);
clFinish(q) / clFlush(q);
...
By:
clEnqueueNDRangeKernel(q, kernelA, ..); // set events as required
clEnqueueNDRangeKernel(q, kernelB, ...); // set events as required
clEnqueueNDRangeKernel(q, kernelC, ..); // set events as required
...
clFinish(q) / clFlush(q); // enqueue as batch of commands
Hi dipak, thank you for the quick response. I have a few more questions regarding your advice:
Is device-side enqueue functionally equivalent to clEnqueueNDRangeKernel? I was under the impression that clEnqueueNDRangeKernel could submit work to multiple work groups, but enqueue_kernel will only submit work to the work group of the thread executing the kernel from which it was called. Is that assumption incorrect? Also, for large groups of work items (let's say a global_work_size of 1 million), clEnqueueNDRangeKernel will choose the optimal local work size if local_work_size is set to NULL. But I don't see any information in the enqueue_kernel API about the device making a similar estimation on its own. Can I simply pass values about those sizes from the parent kernel to the ndrange_t argument? In other words, can I construct ndrange_t as (get_global_size(0), get_local_size(0)) for a 1-dimensional task?
Basically, instead of submitting host-side kernel calls to the queue, can I launch a single kernel call with clEnqueueTask, and then do my enqueue loop in that kernel with enqueue_kernel, with the expectation that if my parameters are correct, I will get identical overall performance (with the added benefit of increase queuing performance as well)? Or is it not that simple?
Furthermore, how do I synchronize the completion of those device-side queues, since it seems they must be queued to an out of order queue?
Also you mentioned using events instead of calls to clFlush/clFinish. Does that have a performance benefit or is it moreso a stylistic choice? It's worth noting that right now I don't call either clFlush/clFinish until my main iteration loop has completed, since I use an in-order host-side queue which entails no kernel can execute until it's first in the queue....so I'm already not calling those after each call to clEnqueueNDRangeKernel....only after every single queue entry has been added.
Sorry for all the questions...any help is appreciated!
Is device-side enqueue functionally equivalent to clEnqueueNDRangeKernel?...
The device-side enqueue does provide similar functionalities as clEnqueueNDRangeKernel, except it enqueues new workload to the device where the parent kernel is working on. The enqueue_kernel built-in can be called from any work-item within the parent nd-range. Even, multiple device queues can be used to launch the workloads to the device.
clEnqueueNDRangeKernel will choose the optimal local work size if local_work_size is set to NULL. But ...
There are various built-in functions to build the ndrange_t object with required global and local work size. For example:
ndrange_t ndrange_1D (size_t global_work_size); ----> builds 1-D nd-range without any explicit local work-group size
ndrange_t ndrange_1D ( size_t global_work_size, size_t local_work_size); ---> builds 1-D nd-range with explicit local work-group size
Basically, instead of submitting host-side kernel calls to the queue, can I launch a single kernel call with clEnqueueTask...
Theoretically, yes. However, it is always better to do some experiments and bench-marking before finalizing any design.
By the way, clEnqueueTask was deprecated in OpenCL 2.0. Instead, clEnqueueNDRangeKernel can be used with global work size = 1.
how do I synchronize the completion of those device-side queues, since it seems they must be queued to an out of order queue?
Similar to host-side code, events and markers can be used inside the parent kernel to synchronize the child tasks/kernels. Also, kernel_enqueue_flags_t argument can be used to specify when the child kernel begins execution w.r.t its parent.
To know about more about the device-side enqueue, please refer the section "6.13.17 Enqueuing Kernels " in opencl-2.0-openclc.pdf
Also you mentioned using events instead of calls to clFlush/clFinish. Does that have a performance benefit or is it moreso a stylistic choice?
Yes, there is a performance benefit when a batch of commands is submitted at once compared to submitting them individually.
Thanks.