13 Replies Latest reply on Jul 15, 2017 7:32 AM by tugrul_512bit

    Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.

    tugrul_512bit

      Here is a simple algorithm that makes a sphere's surface wavy:

       

      [code]

       

                              __kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)

                              {

                                  int threadId=get_global_id(0);

                                          float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1];

                                          float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy));

                                          xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices

                                          xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices

                                          xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices

                              }

       

       

                              __kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)

                              {

                                  int threadId=get_global_id(0);

                                  if(threadId<arguments[4])

                                  {

                                          queue_t q= get_default_queue();

                                          ndrange_t ndrange = ndrange_1D(threadId*1,1,1);

                                          void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};

                                          int ctr=0;

                                          enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

                                  }

       

                              }

       

      [/code]

       

      as you can see, each parent workitem spawns only 1 child kernel with only 1 workitem with only 1 local range value.

       

      Sphere has 224*256 points(same as parent kernel workitems) to compute a wave equation each but only nearly half of them are updated.

       

      If I spawn 2 workitems(in 1 child kernel) per parent workitem(but with only %50 total workitems this time), %90 of sphere is rendered correctly and also the edge points of %90 level looks garbage.

       

      If I spawn 4 child workitems per parent workitem (again, by 1 child kernel but with %25 total parent workitems, making 224*256 points total again) then sphere is fully rendered correctly.

       

      Isn't device-side queue(dynamic parallelism of OpenCL 2.0) a dynamic thing? Why does it randomly drop child kernels if queue is full?

       

      RX550(drivers 17.6.2(latest for now)), OpenCL 2.0. Windows 10. 64-bit build. clBuildProgram compile options include 2.0 compile tags and also -g option is included for extra detailed errors and no error is returned. enqueue_kernel does not return any erroneous code (I didn't include that long error checking here, for readability).

       

      The device side queue is created with:

       

      [code]

       

      cl_uint qsMax = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE>();

      cl_uint qsPref = device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();

      //cl_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();

       

       

      cl_queue_properties qprop[] = { CL_QUEUE_SIZE,((qsMax+qsPref))/3, CL_QUEUE_PROPERTIES, (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE | CL_QUEUE_ON_DEVICE_DEFAULT | CL_QUEUE_PROFILING_ENABLE), 0 };

      device_queue = clCreateCommandQueueWithProperties(context.get(), device.get(), qprop, &err);

      [/code]

       

      as it is seen, I'm using C++ bindings from khronos site which points to a github project where I downloaded the cl2.hpp.

       

      Just to "fix" my sphere rendering issue, I pick queue-size as

       

      ( (qsMax+qsPref)) / 2

       

      or

       

      (2 * (qsMax+qsPref))/3

       

      but  if I choose only qsMax, it returns invalid queue and sometimes resource error so I pick between preferred value and max value. Choosing a very close value to the max value makes it very slow like 1/50 speed of without dynamical parallelism.

       

      Could you reproduce this please? If not, can you post that working code piece please?

       

      Are 60k child kernels too high for dynamic-parallelism or RX550? Should I expect higher values for bigger GPUs? If there is not any error, could this be a synchronization issue? How can I wait for device-queue to finish its job by host-side command?(I don't know, maybe device-side queue always syncs child kernels with their parents but it doesn't write in documents). Clinfo query by AMD APP SDK gives 8.3 MB of queue size. Also I don't know how many bytes are allocated in queue per child kernel with N number of parameters and M number of work groups and O number of kernel string bytes.

       

      More importantly, what if I add ray-tracing to the program? I can't know how many times a ray will be refracted(spawns child kernel) or reflected(spawns another child kernel) per workitem. Another example could be mandelbrot generator. I can't know how many places would need extra parallelism/detail beforehand(without checking borders first as in examples but it increases complexity too much). It would be great if gpu could block the "enqueue_kernel" requests if commands are queal to queue size. If there were any dead-lock, it wouldn't let a kernel finish right? It finishes kernel always except when I add events in the kernel (release_event(evt) and evented version of enqueue_kernel)(but also waiting for child kernel in the parent workitem would make it slow, with only 1024 max events(clinfo query), defy meaning of dynamic parallelism, wouldn't it?).

       

       

      Could there be another issue like: queue becomes full but to empty it, it needs to enqueue a kernel which needs resources but there is no resource left in GPU, silently consumes the issue since there is no error except waiting in (live?) lock. How can I know exact maximum queue size to stay away from such resource related locks? What happens if I just set queue size to 100? Would it give an error randomly in another type of GPU? All I can query from GPU is preferred and max values of queue sizes.

       

      Regards.

        • Re: Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.
          dipak

          Thanks for reporting this. I'm not sure if there is any such limitation for max. number of device kernels can be en-queued. I'll check with the related team and get back to you.

           

          Regards,

            • Re: Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.
              tugrul_512bit

              Thank you very much. If there is any way to know queue state(or its remaining bytes) in the kernel, it would help much.

               

              Regards.

                • Re: Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.
                  dipak

                  In the above scenario where the device queue is full,  enqueue_kerenel()  should return an error. The program behaviour you mentioned above is not unexpected. It would be helpful for the team if you could share the repro.

                   

                  Regards,

                    • Re: Device-queue is not blocking if it is full. Some child kernels don't work. get_default_queue does not return error nor does enqueue_kernel.
                      tugrul_512bit

                      This is the exact scenario

                       

                      Local size of parent = 256

                      Global size of parent = 256*224

                       

                      arguments[4] is 224*256 just to make sure not overflows.

                       

                      __kernel void test( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)
                      {
                        int threadId=get_global_id(0);
                        float dx=xyz[threadId*3]-arguments[2];float dy=xyz[threadId*3+1]-arguments[3];float t=arguments[1];
                        float ctr=arguments[0];float wave=0.02f*ctr*sin(40.0f*t+100.0f*sqrt(dx*dx+dy*dy));
                        xyzo[threadId*3]=xyz[threadId*3]+xyzn[threadId*3]*wave; // wave equation for all surface vertices
                        xyzo[threadId*3+1]=xyz[threadId*3+1]+xyzn[threadId*3+1]*wave; // wave equation for all surface vertices
                        xyzo[threadId*3+2]=xyz[threadId*3+2]+xyzn[threadId*3+2]*wave; // wave equation for all surface vertices
                      }

                      __kernel void waveEquation( __global float *xyz,__global float *xyzn,__global float *xyzo,__global float * arguments)
                      {
                        int threadId=get_global_id(0);
                        if(threadId<arguments[4])
                        {
                        queue_t q = get_default_queue();
                        ndrange_t ndrange = ndrange_1D(threadId,1,1);
                        void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};
                        enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A);

                        }

                      }

                       

                      but I tested against errors (with -g build option ofcourse)

                       

                      queue_t q = get_default_queue();
                      ndrange_t ndrange = ndrange_1D(threadId,1,1);
                      void (^my_block_A)(void) = ^{test(xyz,xyzn,xyzo,arguments);};
                      int ctr=0;
                      while((enqueue_kernel(q, CLK_ENQUEUE_FLAGS_NO_WAIT,ndrange,my_block_A)&
                        ( CLK_DEVICE_QUEUE_FULL|
                        CLK_EVENT_ALLOCATION_FAILURE|
                        CLK_OUT_OF_RESOURCES |
                        CLK_INVALID_NDRANGE |
                        CLK_INVALID_QUEUE |
                        CLK_INVALID_EVENT_WAIT_LIST |
                        CLK_INVALID_ARG_SIZE
                        ))>0 )
                      {
                      }

                       

                      and tested against not being CL_SUCCESS but it doesn't return error and doesn't get stuck in infinite loop neither. It silently consumes enqueue kernels or enqueued kernels overflow something that I don't understand. If it gives error, it should stuck in while loop forever but it doesn't.

                       

                      I also tried get_default_queue() in the loop as if it is failing but it is not returning error neighter.

                       

                      If it helps, I'm constructing the default command queue as:

                       

                      l_uint qs=device.getInfo<CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE>();
                      cl_queue_properties qprop[] = { CL_QUEUE_SIZE, qs, CL_QUEUE_PROPERTIES,
                        (cl_command_queue_properties)(CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE |
                        CL_QUEUE_ON_DEVICE |
                        CL_QUEUE_ON_DEVICE_DEFAULT |
                        CL_QUEUE_PROFILING_ENABLE), 0 };
                      device_queue = clCreateCommandQueueWithProperties(context.get(),
                        device.get(), qprop, &err);

                       

                      Lastly, it runs perfectly if number of child kernels are less than queue size, thats why I choose some value between preferred and max. Wheter those child kernels have different local range value than parent's local range value, doesn't matter, it runs erroneously whenever queue size is not enough. (I guess queue is not dynamic, unlike producer - consumer style, or should I remove it from while loop and cancel that workitem altogether and write "failed" in some shared array and check it if it is there from another workitem to re-try?)

                       

                      Regards,