4 Replies Latest reply on Feb 14, 2017 1:17 PM by patrickchew1234

    OpenCL hangs in clfinish after queueing up more then 12 kernels

    patrickchew1234

      I am a OpenCL newbie, so if I am doing something which is just apparently wrong, do correct me.

       

      I have a multi-window app, and I am using one OpenCL context/commandqueue/kernel per window. Things works fine right when I open up to like 12 windows, but if I open like 16 windows, it will eventually crash within 5mins - few hours.  What happens is that the kernel "seems" to get stuck  and clfinish never returns. Eventually I see a DX11 device reset, and all my OpenCL windows turn black..  I am running Win10 Aniversary, and 16.15 version of AMD driver. (The system is not hung, I can kill the process, and redo the test again). I had followed the suggestions to disable TDRreset but it doesn't fix the problem.

       

      I am not sure how I get to debugging this. Any help/suggestsions would be appreciated  I have a very simple kernel.

       

      const char* pKernelSrc = "__kernel void imgProcess(__global uint * pSrcData, __write_only image2d_t pData, int2 imgDim)          \n \

                                   {                                                                \n \

                                    int posX = get_global_id(0); \n \

                                       int posY = get_global_id(1); \n \

                                       float fBlue0, fGreen0, fRed0; \n \

                                       uint    ulData = pSrcData[posX + (posY * imgDim.x) ]; \n \

                                       fBlue0  = (float) (ulData & 0x1f) ; \n \

                                       fGreen0 = (float) (ulData & 0x7e0) ; \n \

                                       fRed0   = (float) (ulData & 0xf800) ; \n \

                                       fBlue0  = fBlue0/ 32.0f; \n \

                                       fGreen0 = fGreen0/ 2048.0f; \n \

                                       fRed0  = fRed0/ 65768.0f; \n \

                                    write_imagef(pData, (int2) (posX, posY), (float4)(fRed0, fGreen0, fBlue0, 1.0f)); \n \

                                   } ";

       

      The crash doesn't seem associated with the amount of source/destination data. If I open 12 BIG windows (larger textures) it works. The issue seems more related to the amount of kernels I queue up ...  so if I open 16 windows, it will eventually crash after 5mins- few hours. If I just  use a dummy kernel or just do write_image on a static color, and remove anything dealing with the input, then things work fine for larger amount of windows,  As soon as I make use of pSrcData, (which is just a clmem surface derived from a DirectGMA surface), then  I see the problem when > 12 windows are opened.

       

              I get no error messages when making all the Opencl-calls.

       

       

             I am wondering whether it's due to the multiple instances of kernels/programs, and  I was going to try to see if I only have a single enqueue a single kernel which can operate on N ClMem locations, but I don't think it's valid to pass in an array of CLMems which is created across multiple different ClContexts. ??

       

       

      //// function which gets called N times, where N = number of windows opened

       

      int            nStatus;
      static unsigned int   ulFrameCnt = 0;
      cl_int2     vDim = { (long)pDGMAObj->m_ulGLTextureWidth * 2, (long)pDGMAObj->m_ulGLTextureHeight };
      size_t      uiGlobalWorkSize[2] ;
      size_t      uiLocalWorkSize[2] = { 16, 16 };


      uiGlobalWorkSize[0] = (pDGMAObj->m_ulGLTextureWidth  / 32) * 32;
      uiGlobalWorkSize[1] = (pDGMAObj->m_ulGLTextureHeight / 16) * 16;


      nStatus = clEnqueueAcquireGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0); // locks GL buffer so CL can use it
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueAcquireGLObjects");

       

      nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 0, sizeof(cl_mem), (void*)&pDGMAObj->m_pBuffer);
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG0");

       

      nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 1, sizeof(cl_mem), (void*)&pDGMAObj->m_clBindedImage);
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG1");


      // Argument 2: Dimension of buffer
      nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 2, sizeof(cl_int2), (cl_int2*)&vDim);
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG2");


      nStatus = clEnqueueNDRangeKernel(pDGMAObj->m_clCmdQueue,       //  Cmd Queue
            pDGMAObj->m_clKernel,       //  kernel
            2,                //  Work dimention >0  but less then 3
            NULL,             //  Global Work offset
            uiGlobalWorkSize, //  global work size
            uiLocalWorkSize,  //  Local work size
            0,                //  Num events in wait list
            NULL,             //  event in wait list
            NULL);             // events

      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueNDRangeKernel");

       

      nStatus = clEnqueueReleaseGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0);
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueReleaseGLObjects");

       

      nStatus = clFlush(pDGMAObj->m_clCmdQueue); // start this work ASAP
      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFlush");

       

      nStatus = clFinish(pDGMAObj->m_clCmdQueue); //

      if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFinish");

      ///////

        • Re: OpenCL hangs in clfinish after queueing up more then 12 kernels
          dipak

          Hi Patrick,

          Sorry for this delayed reply.

          I would suggest you to write a simple test-case to enqueue more than 12 kernels and see whether you still observe the same issue or not. If reproducible, please share the repro with us.

           

          Regards,

            • Re: OpenCL hangs in clfinish after queueing up more then 12 kernels
              patrickchew1234

              Did many different tests, and found one that solution that works. But here's my data....

               

              Note: The kernels/programs  themselves are duplicate kernels in the above code.  copy which I created per "window" to associate with their respective gl-renderContext. All the stuff I mentioned in GPU related, and I am not testing it on CPU.

               

              (1) If I create 8 windows (8kernels/8commandQs) and loop then 4x (so 32 enqueues one after another)  it works fine. So the issue doesn't seem related to the number of back-to-back enqueues.

               

              (2) The exact number of failure seems to be 16+ enqueues. I tried 14, and it seems to work fine. 16 enqueues of different pairs pf CommandQ's/kernels back to back doesn't work. basically lasting from 5 mins to several hours, before clfinish doesn't return, and I see DX11 resetting the device

               

              (3) If the kernel is "devoid" of the source .. i.e. just making blue/red/green variables point to PosX and PosY instead of reading from a source  it works fine.

               

              (4) The solution which I am leaning with (seems to last over 3+ days of testing) is when doing OpenCl Processing, I just use a singleCommandQueue/kernel, and use the cl-mem associated with OTHER clContexts for the enqueue the works into this single CommandQ. These different clContexts are all belonging to the same clDevice. I am not sure whether this is valid OpenCl usage, but it seems to work. Tested with 32 windows, and all of them are still working fine over  3+ days. Assuming that it's "valid" to share OpenCL's cl_mems created from different clContexts (of the same clDevice), then this actually seems to be a better and cleaner solution for me

               

              If some of the things above "seems/sounds" strange, just note that I am a OpenCL newbie, so my understanding/description of stuff might be a little off..

                • Re: OpenCL hangs in clfinish after queueing up more then 12 kernels
                  dipak

                  It's good that you've found a solution. However, I don't understand the benefit of multiple contexts here as a workaround. Also, as per the OpenCL spec, it is expected that objects such as command-queues, memory, program and kernel objects must belong to the same context. Actually, OpenCL runtime uses the context as a container to manage all these objects.

                   

                  Regards,

                    • Re: OpenCL hangs in clfinish after queueing up more then 12 kernels
                      patrickchew1234

                      The code I have is mostly derived" from the AMD sample code, which was when a window is created, an OpenGL rendering context is created, and an OpenCL context is then associated with it. I just extended this for multiple windows so each window's having its own hwnd, own GL-renderContext and own OpenCL rendercontext. My app basically dynamically creates/ destroys windows based on user input dynamically, so the original idea of keeping each window having it's separate context/command/etc seems to be a clean way to do it,  but it has performance implications on the OpenGL side due to context switches,

                       

                      For performance, it would be better if I can use a single OpenGL/CL context for  N windows, but my knowledge of stuff is too shallow. I don't know how to dynamically combine/bind different windows hwnds into a single openGL/CL Context. The main issue being these windows are created and destroyed "dynamically", and ideally a window is opened/destroyed without "affecting" the other windows.

                       

                      I'll need to revisit this again, since you mentioned the way I am doing is not-quite-correct for OpenCL. Testing on this is tedious ... because it sometimes can take many hours before the problem manifests itsel, and I get no error message besides noticing clFinish getting stuck, and eventually I get a  DX11 device reset. Maybe I need to first spend some time looking up how to get more feedback from the GPU, maybe by enabling profiling or something ?? There's some debugging tool from AMD, but I am not sure if it will help me out in my issue.

                       

                      -----

                      Another datapoint. If the GPU outputs are 1080p60 (monitor resolution is 1080p).

                      If the GPU outputs are set to 4K (monitor resolution 4K@60/4K@30) then the issue manifests itself.

                       

                      The test application itself doesn't care about monitor resolution, as it by default just opens a 960x540 window ... (which user can then later "resize" if needed). But all my testing are done identically with N windows of size (960x540).

                       

                       

                      ------