First of all let me say a *BIG* thanks for finally adding support for the OpenCL 1.2 Direct3D9 interop extension. I've been waiting for this for years! And it seems to work just fine. Finally I can use OpenCL in my DirectShow video renderer. So I'm pretty happy...
Found one potential AMD OpenCL bug. Consider the following code:
bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&
(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1 ) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&
(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);
I'm using this to copy the Y and CbCr channels of a NV12 D3D9 DXVA surface to separate D3D9 textures/surfaces for further processing. This code works fine most of the time. However, once in a while this code produces corrupted image output. After some digging I found that adding a call to "clFinish()" seems to fix the issue:
bool success = (clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane0) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &YSurfaceCL) == CL_SUCCESS) &&
(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsY, NULL, 0, NULL, NULL) == CL_SUCCESS) &&
(clFinish(ClDxvaQueue) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 0, sizeof(cl_mem), &nv12plane1 ) == CL_SUCCESS) &&
(clSetKernelArg(KernelSplitNv12, 1, sizeof(cl_mem), &CbCrSurfaceCL) == CL_SUCCESS) &&
(clEnqueueNDRangeKernel(ClDxvaQueue, KernelSplitNv12, 2, NULL, globalWsCbCr, NULL, 0, NULL, NULL) == CL_SUCCESS);
It seems that sometimes the clSetKernelArg() calls for the 2nd kernel call are already applied before the first kernel has run through, messing everything up. I can understand why this could happen. But the OpenCL documentation doesn't mention this anywhere. It only mentions that there could be a race condition with clSetKernelArg() when using multi-threading. So this looks like a bug?
Thoughts?
I found the same question. i test the following code. if the Finish() is out of the loop, the image is wrong, but in profiler i can see that the kernels are run by the order.
the Finish()s taked too long time.
cl_mem _big_ = NULL, _small_ = NULL;
cl_kernel kernel = mKernel->getKernel("Gauss", NULL);
_big_ = _small_ = mMemory->GetImage2D(Img, 0, OCL_MEMORY_SOUR);
size_t gg[NUM_LK_pyramid][2] = {0};
int ii = 0;
for(size_t i = 1; i < NUM_LK_pyramid; i ++){
_big_ = _small_;
ii = 0;
err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_big_);
if(0 != CL_SUCCESS){
goto LEAVE;
}
ii ++;
_small_ = mMemory->GetImage2D(Img, i, OCL_MEMORY_SOUR);
err = flSetKernelArg(kernel, ii, sizeof(cl_mem), &_small_);
LEAVE:
if(CL_SUCCESS != err){
Debug_Err("set kernel arg err\ni: %d error code: 0x%x, index: %d", i, err, ii);;
return -1;
}
gg[0] = (size_t)(IWid / pow(2.0f, (int)i));
gg[1] = (size_t)(IHei / pow(2.0f, (int)i));
err = flEnqueueNDRangeKernel(mCommandQueue->Get(0), kernel, 2, 0, gg, 0, 0, NULL, NULL);
if(0 != err){
Debug_Err("enqueue nd range kernel error: %d\n", err);
return -2;
}
Finish();
}
Hi,
This should work without a clFinish between the two clEnqueueNDRangeKernel. I have code written this way too and I haven’t seen any issue.
This could be a bug in our runtime. Another possibility is that there’s a synchronization bug in your program and the clFinish somehow fixes that.
Is it possible for you to give us a test case?
Regards,
Can you confirm whether the problem still exists, and provide us with a test case if it does?