I am utilizing OpenCL’s enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelBif (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}kernel void kernelB(args)
{
//Do some stuff//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}kernel void kernelD(args)
{//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. I am only enqueuing a single kernel at a time, so my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning. Is there some way to clear the queue or something?
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)
Thank you for reporting it. We will look into this and get back to you.
Also, I have whitelisted you and moved this post to OpenCL forum.
Thanks.
From the above example, it's difficult to say why this is happening, however I would like to point out a few things here if it helps.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue.
A parent kernel will be considered to be complete when it and all its child kernels have finished execution. The OpenCL spec says that:
Determining when a parent kernel has finished execution
A parent kernel’s execution status is considered to be complete when it and all its child kernels
have finished execution. The execution status of a parent kernel will be CL_COMPLETE if this kernel
and all its child kernels finish execution successfully. The execution status of the kernel will be an
error code (given by a negative integer value) if it or any of its child kernels encounter an error, or
are abnormally terminated.
For example, assume that the host enqueues a kernel k for execution on a device. Kernel k when
executing on the device enqueues kernels A and B to a device queue(s). The enqueue_kernel call to
enqueue kernel B specifies the event associated with kernel A in the event_wait_list argument, i.e.
wait for kernel A to finish execution before kernel B can begin execution. Let’s assume kernel A
enqueues kernels X, Y and Z. Kernel A is considered to have finished execution, i.e. its execution
status is CL_COMPLETE, only after A and the kernels A enqueued (and any kernels these enqueued
kernels enqueue and so on) have finished execution.
Also, it looks like some kernels (e.g. kernelA and kernelC) are recursively called. The recursive kernel calling might generate a long chain of parent-child kernels. Please check if this could be one of the reasons behind the above issue.
Thanks.
This is the answer!
I must have overlooked that section of the specification, whoops. I did not realize that even the children of the children of the children (...etc.) had to be completed for the parent kernel to be considered complete. So of course the queue would not clear an incomplete kernel. I ended up having to restructure the algorithm a bit to one in which the host interacts every so often to start a new enqueue_kernel() chain, and it works very well.
Thanks so much!
Another point is, the default size of the device queue is usually set to a preferred value (CL_DEVICE_QUEUE_ON_ DEVICE_PREFERRED_SIZE or "Queue on device preferred size" in clinfo) which is lower than its max. limit (CL_DEVICE_QUEUE_ON_ DEVICE_MAX_SIZE or "Queue on device max size" in clinfo). Please try to increase the queue size to see if it resolves the above issue.
When creating the device queue using clCreateCommandQueueWithProperties(), you can specify the size of the device queue by param "CL_QUEUE_SIZE.
For example,
int maxQueueSize = NEW_Q_SIZE ; // where default size < NEW_Q_SIZE <= CL_DEVICE_QUEUE_ON_ DEVICE_MAX_SIZE )
cl_queue_properties prop[] = {
CL_QUEUE_PROPERTIES, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE|CL_QUEUE_ON_DEVICE_DEFAULT,
CL_QUEUE_SIZE, maxQueueSize, 0 };
clCreateCommandQueueWithProperties(..,prop,..)
Thanks.