I am utilizing OpenCL’s enqueue_kernel() function to enqueue kernels dynamically from the GPU to reduce unnecessary host interactions. Here is a simplified example of what I am trying to do in the kernels:
kernel void kernelA(args)
{
//This kernel is the one that is enqueued from the host, with only one work item. This kernel
//could be considered the "master" kernel that controls the logic of when to enqueue tasks
//First, it checks if a condition is met, then it enqueues kernelB
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelB(args);});
}
else
{
//do other things
}
}
kernel void kernelB(args)
{
//Do some stuff
//Only enqueue the next kernel with the first work item. I do this because the things
//occurring in kernelC rely on the things that kernelB does, so it must take place after kernelB is completed,
//hence, the CLK_ENQUEUE_FLAGS_WAIT_KERNEL
if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
}
kernel void kernelC(args)
{
//Do some stuff. This one in particular is one step in a sorting algorithm
//This kernel will enqueue kernelD if a condition is met, otherwise it will
//return to kernelA
if (get_global_id(0) == 0 && other requirements)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelD(args);});
}
else if (get_global_id(0) == 0)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
kernel void kernelD(args)
{
//Do some stuff
//Finally, if some condition is met, enqueue kernelC again. What this will do is it will
//bounce back and forth between kernelC and kernelD until the condition is
//no longer met. If it isn't met, go back to kernelA
if (some condition)
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(some amount, 256), ^{kernelC(args);});
}
else
{
enqueue_kernel(get_default_queue(), CLK_ENQUEUE_FLAGS_WAIT_KERNEL, ndrange_1D(1, 1), ^{kernelA(args);});
}
}
So that is the general flow of the program, and it works perfectly and does exactly as I intended it to do, in the exact order I intended it to do it in, except for one issue. In certain cases when the workload is very high, a random one of the enqueue_kernel()s will fail to enqueue and halt the program. This happens because the device queue is full, and it cannot fit another task into it. But I cannot for the life of me figure out why this is, even after extensive research.
I thought that once a task in the queue (a kernel for instance) is finished, it would free up that spot in the queue. I am only enqueuing a single kernel at a time, so my queue should really only reach a max of like 1 or 2 tasks at a time. But this program will literally fill up the entire 262,144 byte size of the device command queue, and stop functioning. Is there some way to clear the queue or something?
I would greatly appreciate some potential insight as to why this is happening if anyone has any ideas. I am sort of stuck and cannot continue until I get past this issue.
Thank you in advance!
(BTW I am running on a Radeon RX 590 card, and am using the AMD APP SDK 3.0 to use with OpenCL 2.0)