cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

OpenCL hangs in clfinish after queueing up more then 12 kernels

I am a OpenCL newbie, so if I am doing something which is just apparently wrong, do correct me.

I have a multi-window app, and I am using one OpenCL context/commandqueue/kernel per window. Things works fine right when I open up to like 12 windows, but if I open like 16 windows, it will eventually crash within 5mins - few hours.  What happens is that the kernel "seems" to get stuck  and clfinish never returns. Eventually I see a DX11 device reset, and all my OpenCL windows turn black..  I am running Win10 Aniversary, and 16.15 version of AMD driver. (The system is not hung, I can kill the process, and redo the test again). I had followed the suggestions to disable TDRreset but it doesn't fix the problem.

I am not sure how I get to debugging this. Any help/suggestsions would be appreciated  I have a very simple kernel.

const char* pKernelSrc = "__kernel void imgProcess(__global uint * pSrcData, __write_only image2d_t pData, int2 imgDim)          \n \

                             {                                                                \n \

                              int posX = get_global_id(0); \n \

                                 int posY = get_global_id(1); \n \

                                 float fBlue0, fGreen0, fRed0; \n \

                                 uint    ulData = pSrcData[posX + (posY * imgDim.x) ]; \n \

                                 fBlue0  = (float) (ulData & 0x1f) ; \n \

                                 fGreen0 = (float) (ulData & 0x7e0) ; \n \

                                 fRed0   = (float) (ulData & 0xf800) ; \n \

                                 fBlue0  = fBlue0/ 32.0f; \n \

                                 fGreen0 = fGreen0/ 2048.0f; \n \

                                 fRed0  = fRed0/ 65768.0f; \n \

                              write_imagef(pData, (int2) (posX, posY), (float4)(fRed0, fGreen0, fBlue0, 1.0f)); \n \

                             } ";

The crash doesn't seem associated with the amount of source/destination data. If I open 12 BIG windows (larger textures) it works. The issue seems more related to the amount of kernels I queue up ...  so if I open 16 windows, it will eventually crash after 5mins- few hours. If I just  use a dummy kernel or just do write_image on a static color, and remove anything dealing with the input, then things work fine for larger amount of windows,  As soon as I make use of pSrcData, (which is just a clmem surface derived from a DirectGMA surface), then  I see the problem when > 12 windows are opened.

        I get no error messages when making all the Opencl-calls.

       I am wondering whether it's due to the multiple instances of kernels/programs, and  I was going to try to see if I only have a single enqueue a single kernel which can operate on N ClMem locations, but I don't think it's valid to pass in an array of CLMems which is created across multiple different ClContexts. ??

//// function which gets called N times, where N = number of windows opened

int            nStatus;
static unsigned int   ulFrameCnt = 0;
cl_int2     vDim = { (long)pDGMAObj->m_ulGLTextureWidth * 2, (long)pDGMAObj->m_ulGLTextureHeight };
size_t      uiGlobalWorkSize[2] ;
size_t      uiLocalWorkSize[2] = { 16, 16 };


uiGlobalWorkSize[0] = (pDGMAObj->m_ulGLTextureWidth  / 32) * 32;
uiGlobalWorkSize[1] = (pDGMAObj->m_ulGLTextureHeight / 16) * 16;


nStatus = clEnqueueAcquireGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0); // locks GL buffer so CL can use it
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueAcquireGLObjects");

nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 0, sizeof(cl_mem), (void*)&pDGMAObj->m_pBuffer);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG0");

nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 1, sizeof(cl_mem), (void*)&pDGMAObj->m_clBindedImage);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG1");


// Argument 2: Dimension of buffer
nStatus = clSetKernelArg(pDGMAObj->m_clKernel, 2, sizeof(cl_int2), (cl_int2*)&vDim);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail ARG2");


nStatus = clEnqueueNDRangeKernel(pDGMAObj->m_clCmdQueue,       //  Cmd Queue
      pDGMAObj->m_clKernel,       //  kernel
      2,                //  Work dimention >0  but less then 3
      NULL,             //  Global Work offset
      uiGlobalWorkSize, //  global work size
      uiLocalWorkSize,  //  Local work size
      0,                //  Num events in wait list
      NULL,             //  event in wait list
      NULL);             // events

if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueNDRangeKernel");

nStatus = clEnqueueReleaseGLObjects(pDGMAObj->m_clCmdQueue, 1, &pDGMAObj->m_clBindedImage, 0, 0, 0);
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clEnqueueReleaseGLObjects");

nStatus = clFlush(pDGMAObj->m_clCmdQueue); // start this work ASAP
if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFlush");

nStatus = clFinish(pDGMAObj->m_clCmdQueue); //

if (nStatus != CL_SUCCESS) OutputDebugStringA("Fail clFinish");

///////

0 Likes
4 Replies
dipak
Big Boss

Hi Patrick,

Sorry for this delayed reply.

I would suggest you to write a simple test-case to enqueue more than 12 kernels and see whether you still observe the same issue or not. If reproducible, please share the repro with us.

Regards,

0 Likes

Did many different tests, and found one that solution that works. But here's my data....

Note: The kernels/programs  themselves are duplicate kernels in the above code.  copy which I created per "window" to associate with their respective gl-renderContext. All the stuff I mentioned in GPU related, and I am not testing it on CPU.

(1) If I create 8 windows (8kernels/8commandQs) and loop then 4x (so 32 enqueues one after another)  it works fine. So the issue doesn't seem related to the number of back-to-back enqueues.

(2) The exact number of failure seems to be 16+ enqueues. I tried 14, and it seems to work fine. 16 enqueues of different pairs pf CommandQ's/kernels back to back doesn't work. basically lasting from 5 mins to several hours, before clfinish doesn't return, and I see DX11 resetting the device

(3) If the kernel is "devoid" of the source .. i.e. just making blue/red/green variables point to PosX and PosY instead of reading from a source  it works fine.

(4) The solution which I am leaning with (seems to last over 3+ days of testing) is when doing OpenCl Processing, I just use a singleCommandQueue/kernel, and use the cl-mem associated with OTHER clContexts for the enqueue the works into this single CommandQ. These different clContexts are all belonging to the same clDevice. I am not sure whether this is valid OpenCl usage, but it seems to work. Tested with 32 windows, and all of them are still working fine over  3+ days. Assuming that it's "valid" to share OpenCL's cl_mems created from different clContexts (of the same clDevice), then this actually seems to be a better and cleaner solution for me

If some of the things above "seems/sounds" strange, just note that I am a OpenCL newbie, so my understanding/description of stuff might be a little off..

0 Likes

It's good that you've found a solution. However, I don't understand the benefit of multiple contexts here as a workaround. Also, as per the OpenCL spec, it is expected that objects such as command-queues, memory, program and kernel objects must belong to the same context. Actually, OpenCL runtime uses the context as a container to manage all these objects.

Regards,

0 Likes

The code I have is mostly derived" from the AMD sample code, which was when a window is created, an OpenGL rendering context is created, and an OpenCL context is then associated with it. I just extended this for multiple windows so each window's having its own hwnd, own GL-renderContext and own OpenCL rendercontext. My app basically dynamically creates/ destroys windows based on user input dynamically, so the original idea of keeping each window having it's separate context/command/etc seems to be a clean way to do it,  but it has performance implications on the OpenGL side due to context switches,

For performance, it would be better if I can use a single OpenGL/CL context for  N windows, but my knowledge of stuff is too shallow. I don't know how to dynamically combine/bind different windows hwnds into a single openGL/CL Context. The main issue being these windows are created and destroyed "dynamically", and ideally a window is opened/destroyed without "affecting" the other windows.

I'll need to revisit this again, since you mentioned the way I am doing is not-quite-correct for OpenCL. Testing on this is tedious ... because it sometimes can take many hours before the problem manifests itsel, and I get no error message besides noticing clFinish getting stuck, and eventually I get a  DX11 device reset. Maybe I need to first spend some time looking up how to get more feedback from the GPU, maybe by enabling profiling or something ?? There's some debugging tool from AMD, but I am not sure if it will help me out in my issue.

-----

Another datapoint. If the GPU outputs are 1080p60 (monitor resolution is 1080p).

If the GPU outputs are set to 4K (monitor resolution 4K@60/4K@30) then the issue manifests itself.

The test application itself doesn't care about monitor resolution, as it by default just opens a 960x540 window ... (which user can then later "resize" if needed). But all my testing are done identically with N windows of size (960x540).

------

0 Likes