I am trying to figure out why debug version and release version of my app written in c++ using opencl c++ api act differently. Basically what my app does is:
1. create a few kernels
2. create a set of buffers
3. set kernel arguments
then execute these kernels in a loop by repeatedly call clEnqueueNDRangeKernel with a single in order command queue(oddly enough, if I use a new command queue for each command, the outcome changes. commands are properly synchronized of course). There are some read/write commands spread in the queue.
I understand that c++ debug / release version behave differently may happen, but it seems to be pretty simple code in c++ side. I think I may have missed something in host code...
Could any one give me some hints please ? Thanks in advance.
From your above description, it's difficult to suggest anything without knowing what type of difference you're observing or referring. It would be great if you can share your code (host + kernel) and point out the exact problem you are observing. Please also specify your setup details (OS, GPU, driver etc.).
Thanks for the reply. below are some system info:
OS: Windows 10 64bit
GPU: amd radeon R5 230
sdk: amd app sdk 2.9
there are 4 kernels, simple ones like initializing a buffer work fine. code of the 2 main kernels are pretty lengthy, which are adopted from c code. I have carefully checked them to make sure there are no pointers in struct (I will double check them again later)...
I think the difference between debug / release version is about host code, which should have nothing to do with kernel code since the later is compiled by opencl runtime. therefore, I suspect there are hidden bugs in my host code cause the problem. Basically, in debug version I see approximately correct image produced (but not exactly as I expected), while I only see some dots in blank background in release version. one kernel is running in an iteration many many times depending on input (from thousands, to millions.. kernel code are too long to post here, below are host code that execute the problematic kernel:
It first writes to a buffer to update its content for the kernel to process, then executes the kernel, which will update data in another buffer.
err = pClObj->cmdQ.enqueueWriteBuffer(pClObj->clBuffer, CL_TRUE, 0, numOfItems * sizeof(CL_DATA_INFO), pBufferData);
//error check and etc.
err = pClObj->cmdQ.enqueueNDRangeKernel(Kernel, cl::NullRange, cl::NDRange(numOfItems));
// error check and etc.
err = pClObj->cmdQ.finish();
// error check and etc.
the above host code may be executed in a loop like thousands or millions times. each time the content of clBuffer gets updated first, then the kernel is executed. pClObj is a class that compile kernel source, create kernels and set kernel arguments and etc. numOfItems is a configurable number that determines how many data items are to be passed to gpu for processing at a time, which is set to 256.
I use finish() to wait for execution completion is just for updating progress bar, removing it does't change anything. I am wondering if one command queue can be used like this ? Put some commands in it and call finish() to wait until queued commands finish, then put similar commands in it again and wait for queued commands to finish, and repeat like thousands or millions times ? Is there a limit in the number of commands to be queued in the same command queue ?
Put some commands in it and call finish() to wait until queued commands finish, then put similar commands in it again and wait for queued commands to finish, and repeat like thousands or millions times ?
Yes, you can. However, in that case, the host thread will be blocked by the clFinish() until all the previous commands completed. So, commands for next iteration can not be enqueued unless commands from previous iteration are completed
for(int i = 0; i < N; i++)
In this case, host thread will be blocked at the end of the loop and thus, commands for the 2nd iteration can not be enqueued unless commands from the 1st loop are completed
Whereas, you may use event objects to enqueue next series of commands without blocking the host thread. You've to describe the dependency accordingly.
[As you know, commands are executed in-order if the command queue is itself an in-order queue.]
for(int i = 0; i < N; i++)
enqueue(q, command1, event[i - 1]); // depends on event generated from previous iteration
// may use clFlush() at this point or after a certain number of commands
Is there a limit in the number of commands to be queued in the same command queue ?
Yes, a runtime limit is there. The limit depends on the runtime command queue size and type of the command (i.e. command packet size).
Now coming to your mismatch problem. As you suspect a host-side bug, it would be good if you can narrow down the problematic region or suspected APIs. At this point, it's not clear whether the issue is in OpenCL APIs or inside your own code. Or may be any other C/C++ build issue completely unrelated to OpenCL. Unless that information is available, it's difficult to suggest/help anything. You may try different host-side compilers (say, GCC, VC++) to cross check the mismatch.
Note: SDK 2.9 is quite old. Please use the latest SDK and driver compatible with your setup.
Many thanks for the detailed clarifications. I am updating SDK to the latest version....
Both of debug and release version of the Windows executable compiled with VS2010 generate a completely empty image under nv graphic card (macbook pro with GT650M). The original app was a Windows Service. Since CodeXL won't attach to a running process for debugging, the service was temporarily re-written as a normal executable so CodeXL can load it and debug its kernels. So far with no luck in using CodeXL, when trying to step into the suspected kernel, CodeXL pops up a message box saying "IL instruction uav_read_cmp_xchg is not yet supported", but stepping into other kernels is fine...
Thanks again for your hints and I will post here when I solve it....