8 Replies Latest reply on Oct 14, 2015 12:41 PM by dipak

    Enqueuing to device side queue in a loop issue

    doqtor

      In my code I have kernelA and kernelB. kernelB depends on kernelA results. I am iterating over this kernels thousand of times and each iteration depends on the results from the previous iteration.

       

       

      Below is the minimal version reproducing the issue - enqueue_kernel is either returning -1 or hanging when size parameter from kernelLauncher is set to >= 513. Everything seems to be OK when size is set to <= 512. Is this code OK? Am I hitting some hardware limit here? (1025th kernel enqueue is failing).

       

      __kernel  __attribute__((reqd_work_group_size(256, 1, 1)))
      void kernelA(int index)
      {}
      
      __kernel  __attribute__((reqd_work_group_size(256, 1, 1)))
      void kernelB(int index)
      {}
      
      __kernel  __attribute__((reqd_work_group_size(1, 1, 1)))
      __kernel void kernelLauncher(int size,  __global int *err)
      {
          queue_t default_queue = get_default_queue();
          clk_event_t ev1, ev2;
          int ret;
      
          for (int index = 0; index < size; ++index)
          {
              void(^fnKernelA)(void) = ^{ kernelA(index); };
      
              if (index == 0)
              {
                  ret = enqueue_kernel(default_queue,
                      CLK_ENQUEUE_FLAGS_NO_WAIT,
                      ndrange_1D(3*256, 256),
                      0, NULL, &ev1,
                      fnKernelA);
              }
              else
              {
                  ret = enqueue_kernel(default_queue,
                      CLK_ENQUEUE_FLAGS_NO_WAIT,
                      ndrange_1D(3 * 256, 256),
                      1, &ev2, &ev1,
                      fnKernelA);
              }
      
              if (ret != CLK_SUCCESS)
              {
                  *err = ret;
                  return;
              }
      
              void(^fnKernelB)(void) = ^{ kernelB(index); };
      
              ret = enqueue_kernel(default_queue,
                  CLK_ENQUEUE_FLAGS_NO_WAIT,
                  ndrange_1D(256, 256),
                  1, &ev1, &ev2,
                  fnKernelB);
      
              if (ret != CLK_SUCCESS)
              {
                  *err = ret;
                  return;
              }
          }
      }
      

       

      My set up:

      Ubuntu 14.04, R9 295, fglrx 15.20.3, AMD APP SDK 3.0

        • Re: Enqueuing to device side queue in a loop issue
          dipak
          Am I hitting some hardware limit here?

          May be. There is a size limit for the device queue which can be queried by clGetDeviceInfo  with param CL_DEVICE_QUEUE_ON_ DEVICE_MAX_SIZE.

          [Same can be found from clinfo marked by "Queue on device max size"]

          Usually, the queue size is set to a lower value (or preferred value)  than the max. limit ( CL_DEVICE_QUEUE_ON_ DEVICE_PREFERRED_SIZE or see "Queue on device preferred size" in clinfo). During the device queue creation, one can modify the size value using clCreateCommandQueueWithProperties  with param CL_QUEUE_SIZE.

           

          For example,

                  cl_queue_properties prop[] = {

            CL_QUEUE_PROPERTIES, CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE | CL_QUEUE_ON_DEVICE|CL_QUEUE_ON_DEVICE_DEFAULT,

            CL_QUEUE_SIZE, maxQueueSize, 0 };

           

          You may try this above.

           

          Regards,

           

           

            • Re: Enqueuing to device side queue in a loop issue
              doqtor

              For R9 295 that is:

              CL_DEVICE_QUEUE_ON_DEVICE_PREFERRED_SIZE: 262,144

              CL_DEVICE_QUEUE_ON_DEVICE_MAX_SIZE: 524,288

               

              I set the max size but that didn't change anything, it's still failing on 1025th enqueue. I tried calling release_event - no difference.

              Note that this code is working fine on Intel CPU.

                • Re: Enqueuing to device side queue in a loop issue
                  dipak

                  Thanks for sharing your observation.

                  From your observation, it seems that the queue size has no effect on this number. I'll try the code at my end and let you know my findings.

                   

                  Regards

                  • Re: Enqueuing to device side queue in a loop issue
                    dipak

                     

                    it's still failing on 1025th enqueue. I tried calling release_event - no difference.
                    Note that this code is working fine on Intel CPU.

                    I guess, in case of R9 295, the value of CL_DEVICE_MAX_ON_DEVICE_EVENTS or "Max on device events" (clinfo) is  1024. So, falling after that value seems logical. However as you said, you tried to release the events but still not working. Could you please share that code?

                    Also, please check the value of CL_DEVICE_MAX_ON_DEVICE_EVENTS for the Intel CPU and then try to exceed that value.

                      • Re: Enqueuing to device side queue in a loop issue
                        doqtor

                        On Intel CPU the limit is

                        CL_DEVICE_MAX_ON_DEVICE_EVENTS: 4,294,967,295

                        and I don't really want to try to go over that limit

                         

                        On R9 295 that is:

                        CL_DEVICE_MAX_ON_DEVICE_EVENTS:        1,024

                         

                        It's hard to try to figure out this because frequently it hangs the kernel and I have to go for reboot.

                         

                        Code with release_event:

                         

                        __kernel  __attribute__((reqd_work_group_size(1, 1, 1)))
                        __kernel void kernelLauncher(
                            int size,
                            __global int *err
                            )
                        {
                            queue_t default_queue = get_default_queue();
                            clk_event_t ev1, ev2;
                            int ret;
                        
                        
                            for (int index = 0; index < size; ++index)
                            {
                                void(^fnKernelA)(void) = ^{ kernelA(
                                    index
                                    ); };
                        
                        
                                if (index == 0)
                                {
                                    ret = enqueue_kernel(default_queue,
                                        CLK_ENQUEUE_FLAGS_NO_WAIT,
                                        ndrange_1D(3*256, 256),
                                        0, NULL, &ev1,
                                        fnKernelA);
                                }
                                else
                                {
                                    ret = enqueue_kernel(default_queue,
                                        CLK_ENQUEUE_FLAGS_NO_WAIT,
                                        ndrange_1D(3 * 256, 256),
                                        1, &ev2, &ev1,
                                        fnKernelA);
                                }
                        
                        
                                if (ret != CLK_SUCCESS)
                                {
                                    *err = ret;
                                    return;
                                }
                                if(is_valid_event(ev2))
                                    release_event(ev2);
                        
                        
                                void(^fnKernelB)(void) = ^{ kernelB(
                                    index
                                    ); };
                        
                        
                                ret = enqueue_kernel(default_queue,
                                    CLK_ENQUEUE_FLAGS_NO_WAIT,
                                    ndrange_1D(256, 256),
                                    1, &ev1, &ev2,
                                    fnKernelB);
                        
                        
                                if (ret != CLK_SUCCESS)
                                {
                                    *err = ret;
                                    return;
                                }
                                if(is_valid_event(ev1))
                                    release_event(ev1);
                            }
                        }
                        

                         

                        Another different attempt:

                         

                        __kernel  __attribute__((reqd_work_group_size(1, 1, 1)))
                        __kernel void kernelLauncher(
                            int size,
                            __global int *err
                            )
                        {
                            queue_t default_queue = get_default_queue();
                            clk_event_t ev1, ev2;
                            int ret;
                        
                        
                            for (int index = 0; index < size; ++index)
                            {
                                void(^fnKernelA)(void) = ^{ kernelA(
                                    index
                                    ); };
                        
                        
                                if (index == 0)
                                {
                                    ret = enqueue_kernel(default_queue,
                                        CLK_ENQUEUE_FLAGS_NO_WAIT,
                                        ndrange_1D(3*256, 256),
                                        0, NULL, &ev1,
                                        fnKernelA);
                                }
                                else
                                {
                                    ret = enqueue_kernel(default_queue,
                                        CLK_ENQUEUE_FLAGS_NO_WAIT,
                                        ndrange_1D(3 * 256, 256),
                                        1, &ev2, &ev1,
                                        fnKernelA);
                                    if(ret == CLK_SUCCESS)
                                        release_event(ev2);
                                }
                        
                        
                                if (ret != CLK_SUCCESS)
                                {
                                    *err = index*2;
                                    return;
                                }
                        
                        
                                void(^fnKernelB)(void) = ^{ kernelB(
                                    index
                                    ); };
                        
                        
                                ret = enqueue_kernel(default_queue,
                                    CLK_ENQUEUE_FLAGS_NO_WAIT,
                                    ndrange_1D(256, 256),
                                    1, &ev1, &ev2,
                                    fnKernelB);
                        
                        
                                if (ret != CLK_SUCCESS)
                                {
                                    *err = index*2+1;
                                    return;
                                }
                                else
                                    release_event(ev1);
                            }
                        }
                        
                          • Re: Enqueuing to device side queue in a loop issue
                            dipak

                            Hi,

                            I was doing some experiments with the above code and got few interesting observations.

                            Firstly, it seems that the limit is not linked with events, instead number of kernels that can be enqueued at most. These limits depends on the size of the queue. On a Hawaii, the max. value seems 1024 when queue size is set to maximum.

                            In your case, as I guess, it reaches the limit when value of "size" greater than 512. After that the device stop responding. This can be realize if enqueue_marker  is used at a certain point. For example, the modified the code shown below was working fine for a large value of "size" (and even without the calling of release event):

                             

                            1. ret = enqueue_kernel(default_queue, 
                            2.             CLK_ENQUEUE_FLAGS_NO_WAIT, 
                            3.             ndrange_1D(256, 256), 
                            4.             1, &ev1, &ev2, 
                            5.             fnKernelB); 
                            6.  
                            7.         if (ret != CLK_SUCCESS) 
                            8.         { 
                            9.             *err = ret; 
                            10.             return
                            11.         } 
                            12. if(index == SYNPOINT) {  // SYNPOINT = any suitable value within limit e.g. 128 or 256
                            13.         enqueue_marker(default_queue, 1, &ev2, 0);
                            14. }

                             

                            Could you please check and share your findings?

                             

                            Regards,

                              • Re: Enqueuing to device side queue in a loop issue
                                doqtor

                                Hi dipak,

                                 

                                I tried your suggestion and when I set synpoint to index % 128 == 0 then there were first 1020 enqueue_kernel and 4 enqueue_marker successful and after that the 1021th enqueue_kernel failed which was the 1025th submission to the queue so yes it looks like that is a problem with the queue being full but marker is not releasing it. Also kernel doesn't hang anymore when the marker was added. I also tried releasing events but nothing changed.

                                 

                                Does that mean that there is a bug in the driver I'm using (fglrx 15.20.3)? Any possible workarounds?

                                 

                                Regards,