3 Replies Latest reply on Jun 18, 2014 2:08 AM by dipak

    possible bug found: *single* threaded race condition with clSetKernelArg + clEnqueueNDRangeKernel

    madshi

      First of all let me say a *BIG* thanks for finally adding support for the OpenCL 1.2 Direct3D9 interop extension. I've been waiting for this for years! And it seems to work just fine. Finally I can use OpenCL in my DirectShow video renderer. So I'm pretty happy...

       

      Found one potential AMD OpenCL bug. Consider the following code:

       

      [code]  bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&

                       (clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1   ) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&

                       (clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);[/code]

       

      I'm using this to copy the Y and CbCr channels of a NV12 D3D9 DXVA surface to separate D3D9 textures/surfaces for further processing. This code works fine most of the time. However, once in a while this code produces corrupted image output. After some digging I found that adding a call to "clFinish()" seems to fix the issue:

       

      [code]  bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&

                       (clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&

                       (clFinish(ClDxvaQueue) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1   ) == CL_SUCCESS) &&

                       (clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&

                       (clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);[/code]

       

      It seems that sometimes the clSetKernelArg() calls for the 2nd kernel call are already applied before the first kernel has run through, messing everything up. I can understand why this could happen. But the OpenCL documentation doesn't mention this anywhere. It only mentions that there could be a race condition with clSetKernelArg() when using multi-threading. So this looks like a bug?

       

      Thoughts?

        • Re: possible bug found: *single* threaded race condition with clSetKernelArg + clEnqueueNDRangeKernel
          fenglanghai

          I found the same question. i test the following code. if the Finish() is out of the loop, the image is wrong, but in profiler i can see that the kernels are run by the order.

          the Finish()s taked too long time.

           

              cl_mem _big_ = NULL, _small_ = NULL;

              cl_kernel kernel = mKernel->getKernel("Gauss", NULL);

              _big_ = _small_ = mMemory->GetImage2D(Img, 0, OCL_MEMORY_SOUR);

              size_t gg[NUM_LK_pyramid][2] = {0};

              int ii = 0;

              for(size_t i = 1; i < NUM_LK_pyramid; i ++){

                  _big_ = _small_;

                  ii = 0;

                  err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_big_);

                  if(0 != CL_SUCCESS){

                      goto LEAVE;

                  }

                  ii ++;

                  _small_ = mMemory->GetImage2D(Img, i, OCL_MEMORY_SOUR);

                  err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_small_);

          LEAVE:

                  if(CL_SUCCESS != err){

                      Debug_Err("set kernel arg err\ni: %d error code: 0x%x, index: %d", i, err, ii);;

                      return -1;

                  }

                  gg[i][0] = (size_t)(IWid / pow(2.0f, (int)i));

                  gg[i][1] = (size_t)(IHei / pow(2.0f, (int)i));

                  err = flEnqueueNDRangeKernel(mCommandQueue->Get(0), kernel, 2, 0, gg[i], 0, 0, NULL, NULL);

                  if(0 != err){

                      Debug_Err("enqueue nd range kernel error: %d\n", err);

                      return -2;

                  }

                  Finish();

              }

          • Re: possible bug found: *single* threaded race condition with clSetKernelArg + clEnqueueNDRangeKernel
            ravkum

            Hi,

             

            This should work without a clFinish between the two clEnqueueNDRangeKernel.  I have code written this way too and I haven’t seen any issue.

            This could be a bug in our runtime.  Another possibility is that there’s a synchronization bug in your program and the clFinish somehow fixes that.


            Is it possible for you to give us a test case?


            Regards,