AnsweredAssumed Answered

OpenCL hangs in clfinish after queueing up more then 12 kernels

Question asked by patrickchew1234 on Feb 7, 2017
Latest reply on Feb 14, 2017 by patrickchew1234

I am a OpenCL newbie, so if I am doing something which is just apparently wrong, do correct me.


I have a multi-window app, and I am using one OpenCL context/commandqueue/kernel per window. Things works fine right when I open up to like 12 windows, but if I open like 16 windows, it will eventually crash within 5mins - few hours.  What happens is that the kernel "seems" to get stuck  and clfinish never returns. Eventually I see a DX11 device reset, and all my OpenCL windows turn black..  I am running Win10 Aniversary, and 16.15 version of AMD driver. (The system is not hung, I can kill the process, and redo the test again). I had followed the suggestions to disable TDRreset but it doesn't fix the problem.


I am not sure how I get to debugging this. Any help/suggestsions would be appreciated  I have a very simple kernel.


const char* pKernelSrc = "__kernel void imgProcess(__global uint * pSrcData, __write_only image2d_t pData, int2 imgDim)          \n \

                             {                                                                \n \

                              int posX = get_global_id(0); \n \

                                 int posY = get_global_id(1); \n \

                                 float fBlue0, fGreen0, fRed0; \n \

                                 uint    ulData = pSrcData[posX + (posY * imgDim.x) ]; \n \

                                 fBlue0  = (float) (ulData & 0x1f) ; \n \

                                 fGreen0 = (float) (ulData & 0x7e0) ; \n \

                                 fRed0   = (float) (ulData & 0xf800) ; \n \

                                 fBlue0  = fBlue0/ 32.0f; \n \

                                 fGreen0 = fGreen0/ 2048.0f; \n \

                                 fRed0  = fRed0/ 65768.0f; \n \

                              write_imagef(pData, (int2) (posX, posY), (float4)(fRed0, fGreen0, fBlue0, 1.0f)); \n \

                             } ";


The crash doesn't seem associated with the amount of source/destination data. If I open 12 BIG windows (larger textures) it works. The issue seems more related to the amount of kernels I queue up ...  so if I open 16 windows, it will eventually crash after 5mins- few hours. If I just  use a dummy kernel or just do write_image on a static color, and remove anything dealing with the input, then things work fine for larger amount of windows,  As soon as I make use of pSrcData, (which is just a clmem surface derived from a DirectGMA surface), then  I see the problem when > 12 windows are opened.


        I get no error messages when making all the Opencl-calls.



       I am wondering whether it's due to the multiple instances of kernels/programs, and  I was going to try to see if I only have a single enqueue a single kernel which can operate on N ClMem locations, but I don't think it's valid to pass in an array of CLMems which is created across multiple different ClContexts. ??



//// function which gets called N times, where N = number of windows opened


int            nStatus;
static unsigned int   ulFrameCnt = 0;
cl_int2     vDim = { (long)pDGMAObj->m_ulGLTextureWidth * 2, (long)pDGMAObj->m_ulGLTextureHeight };
size_t      uiGlobalWorkSize[2] ;
size_t      uiLocalWorkSize[2] = { 16, 16 };

uiGlobalWorkSize[0] = (pDGMAObj->m_ulGLTextureWidth  / 32) * 32;
uiGlobalWorkSize[1] = (pDGMAObj->m_ulGLTextureHeight / 16) * 16;

nStatus = clEnqueueAcquireGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0); // locks GL buffer so CL can use it
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueAcquireGLObjects");


nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 0, sizeof(cl_mem), (void*)&pDGMAObj->m_pBuffer);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG0");


nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 1, sizeof(cl_mem), (void*)&pDGMAObj->m_clBindedImage);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG1");

// Argument 2: Dimension of buffer
nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 2, sizeof(cl_int2), (cl_int2*)&vDim);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG2");

nStatus = clEnqueueNDRangeKernel(pDGMAObj->m_clCmdQueue,       //  Cmd Queue
      pDGMAObj->m_clKernel,       //  kernel
      2,                //  Work dimention >0  but less then 3
      NULL,             //  Global Work offset
      uiGlobalWorkSize, //  global work size
      uiLocalWorkSize,  //  Local work size
      0,                //  Num events in wait list
      NULL,             //  event in wait list
      NULL);             // events

if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueNDRangeKernel");


nStatus = clEnqueueReleaseGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueReleaseGLObjects");


nStatus = clFlush(pDGMAObj->m_clCmdQueue); // start this work ASAP
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFlush");


nStatus = clFinish(pDGMAObj->m_clCmdQueue); //

if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFinish");