Archives Discussions

madshi · ‎01-06-2014

First of all let me say a *BIG* thanks for finally adding support for the OpenCL 1.2 Direct3D9 interop extension. I've been waiting for this for years! And it seems to work just fine. Finally I can use OpenCL in my DirectShow video renderer. So I'm pretty happy...

Found one potential AMD OpenCL bug. Consider the following code:

  bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&

(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1 ) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&

(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);

I'm using this to copy the Y and CbCr channels of a NV12 D3D9 DXVA surface to separate D3D9 textures/surfaces for further processing. This code works fine most of the time. However, once in a while this code produces corrupted image output. After some digging I found that adding a call to "clFinish()" seems to fix the issue:

  bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&

(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&

(clFinish(ClDxvaQueue) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1 ) == CL_SUCCESS) &&

(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&

(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);

It seems that sometimes the clSetKernelArg() calls for the 2nd kernel call are already applied before the first kernel has run through, messing everything up. I can understand why this could happen. But the OpenCL documentation doesn't mention this anywhere. It only mentions that there could be a race condition with clSetKernelArg() when using multi-threading. So this looks like a bug?

Thoughts?

fenglanghai · ‎01-06-2014

I found the same question. i test the following code. if the Finish() is out of the loop, the image is wrong, but in profiler i can see that the kernels are run by the order.

the Finish()s taked too long time.

cl_mem _big_ = NULL, _small_ = NULL;

cl_kernel kernel = mKernel->getKernel("Gauss", NULL);

_big_ = _small_ = mMemory->GetImage2D(Img, 0, OCL_MEMORY_SOUR);

size_t gg[NUM_LK_pyramid][2] = {0};

int ii = 0;

for(size_t i = 1; i < NUM_LK_pyramid; i ++){

_big_ = _small_;

ii = 0;

err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_big_);

if(0 != CL_SUCCESS){

goto LEAVE;

}

ii ++;

_small_ = mMemory->GetImage2D(Img, i, OCL_MEMORY_SOUR);

err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_small_);

LEAVE:

if(CL_SUCCESS != err){

Debug_Err("set kernel arg err\ni: %d error code: 0x%x, index: %d", i, err, ii);;

return -1;

}

gg[0] = (size_t)(IWid / pow(2.0f, (int)i));

gg[1] = (size_t)(IHei / pow(2.0f, (int)i));

err = flEnqueueNDRangeKernel(mCommandQueue->Get(0), kernel, 2, 0, gg, 0, 0, NULL, NULL);

if(0 != err){

Debug_Err("enqueue nd range kernel error: %d\n", err);

return -2;

}

Finish();

}

ravkum · ‎02-19-2014

Hi,

This should work without a clFinish between the two clEnqueueNDRangeKernel. I have code written this way too and I haven’t seen any issue.

This could be a bug in our runtime. Another possibility is that there’s a synchronization bug in your program and the clFinish somehow fixes that.

Is it possible for you to give us a test case?

Regards,

dipak · ‎06-18-2014

Can you confirm whether the problem still exists, and provide us with a test case if it does?

Archives Discussions

possible bug found: *single* threaded race condition with clSetKernelArg + clEnqueueNDRangeKernel

possible bug found: single threaded race condition with clSetKernelArg + clEnqueueNDRangeKernel