Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.

Discussion created by tugrul_512bit on Jul 2, 2017
Latest reply on Jul 15, 2017 by tugrul_512bit

Here is a simple algorithm that makes a sphere's surface wavy:




                        __kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)


                            int threadId=get_global_id(0);

                                    float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1];

                                    float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy));

                                    xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices

                                    xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices

                                    xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices




                        __kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)


                            int threadId=get_global_id(0);



                                    queue_t q= get_default_queue();

                                    ndrange_t ndrange = ndrange_1D(threadId*1,1,1);

                                    void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};

                                    int ctr=0;

                                    enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);







as you can see, each parent workitem spawns only 1 child kernel with only 1 workitem with only 1 local range value.


Sphere has 224*256 points(same as parent kernel workitems) to compute a wave equation each but only nearly half of them are updated.


If I spawn 2 workitems(in 1 child kernel) per parent workitem(but with only %50 total workitems this time), %90 of sphere is rendered correctly and also the edge points of %90 level looks garbage.


If I spawn 4 child workitems per parent workitem (again, by 1 child kernel but with %25 total parent workitems, making 224*256 points total again) then sphere is fully rendered correctly.


Isn't device-side queue(dynamic parallelism of OpenCL 2.0) a dynamic thing? Why does it randomly drop child kernels if queue is full?


RX550(drivers 17.6.2(latest for now)), OpenCL 2.0. Windows 10. 64-bit build. clBuildProgram compile options include 2.0 compile tags and also -g option is included for extra detailed errors and no error is returned. enqueue_kernel does not return any erroneous code (I didn't include that long error checking here, for readability).


The device side queue is created with:




cl_uint qsMax = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE>();

cl_uint qsPref = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();

//cl_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();



cl_queue_properties qprop[] = { CL_QUEUE_SIZE,((qsMax+qsPref))/3, CL_QUEUE_PROPERTIES, (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT | CL_QUEUE_PROFILING_ENABLE), 0 };

device_queue = clCreateCommandQueueWithProperties(context.get(), device.get(), qprop, &err);



as it is seen, I'm using C++ bindings from khronos site which points to a github project where I downloaded the cl2.hpp.


Just to "fix" my sphere rendering issue, I pick queue-size as


( (qsMax+qsPref)) / 2




(2 * (qsMax+qsPref))/3


but  if I choose only qsMax, it returns invalid queue and sometimes resource error so I pick between preferred value and max value. Choosing a very close value to the max value makes it very slow like 1/50 speed of without dynamical parallelism.


Could you reproduce this please? If not, can you post that working code piece please?


Are 60k child kernels too high for dynamic-parallelism or RX550? Should I expect higher values for bigger GPUs? If there is not any error, could this be a synchronization issue? How can I wait for device-queue to finish its job by host-side command?(I don't know, maybe device-side queue always syncs child kernels with their parents but it doesn't write in documents). Clinfo query by AMD APP SDK gives 8.3 MB of queue size. Also I don't know how many bytes are allocated in queue per child kernel with N number of parameters and M number of work groups and O number of kernel string bytes.


More importantly, what if I add ray-tracing to the program? I can't know how many times a ray will be refracted(spawns child kernel) or reflected(spawns another child kernel) per workitem. Another example could be mandelbrot generator. I can't know how many places would need extra parallelism/detail beforehand(without checking borders first as in examples but it increases complexity too much). It would be great if gpu could block the "enqueue_kernel" requests if commands are queal to queue size. If there were any dead-lock, it wouldn't let a kernel finish right? It finishes kernel always except when I add events in the kernel (release_event(evt) and evented version of enqueue_kernel)(but also waiting for child kernel in the parent workitem would make it slow, with only 1024 max events(clinfo query), defy meaning of dynamic parallelism, wouldn't it?).



Could there be another issue like: queue becomes full but to empty it, it needs to enqueue a kernel which needs resources but there is no resource left in GPU, silently consumes the issue since there is no error except waiting in (live?) lock. How can I know exact maximum queue size to stay away from such resource related locks? What happens if I just set queue size to 100? Would it give an error randomly in another type of GPU? All I can query from GPU is preferred and max values of queue sizes.