Standard CL1.x buffers provide no guarantee they'll be syncronized with the real values, if not immediately after a sync point.
You likely need to force a sync through clFinish or Map/Copy.
For CL2 you can use SVM to have the buffers "updated coherently" as processing goes on... but I would be careful with that functionality.
clEnqueueNDRangeKernel will most likely NOT cause anything at all to run but just add a command to a command stream. In my experience, the command stream will be flushed much later, unless you force a sync.
I've tried with clFinish, and with a user event too. Even if OpenCL does not gets stucked, I still have random numbers accessed by CodeXL (it is not the address memory qualifier).
Should it still be a sync problem? (I have OpenCL 1.2 as implementation)
CodeXL is a debugger and Visual studio, too. Try not running both at the same time. Yeah, I agree with maxdz8. You have a sync problem in your host code. I use also ocl 1.2. Bear in mind that clFlush and clFinish take a performance hit.
You still seem to have some sync errors. Remember clFlush doesn't sync anything, and clFinish is good only for the kernel. Best use events,if async, to be sure, or synchronous operation. Maps in async mode are a handfull to correctly sync without major delays :-(
Hi guys, thanks for answering. I solved that problem mapping data going into the kernel, and now everything's fine.
Now I have another problem unfortunately. I execute all the operations inside the kernel exploting input buffers, I check with CodeXL and the final result that I obtain is correct, but just inside the kernel, once I download data (eight cl_ushort4 values) on the host I have not what I expect. I've tried downloading data with:
1) zero copy device buffer (CL_MEM_PERSISTENT_MEM_AMD), mapping with CL_FALSE, kernel execution, flushing, then copying data on a host buffer (CL_MEM_ALLOC_HOST_PTR) and using memcpy for direct access (one of the options suggested in the optimization guide). Last the unmap step and cl_Finish, then using downloaded data (but uncorrect values)
2) mapping with CL_TRUE
3) cl_Finish instead of cl_Flush after kernel execution
4) Unmapping memory not immediatly (after memcpy step) but after other commands
5) declaring the host buffer with CL_MEM_READ_ONLY | CL_MEM_COPY HOST_BUFFER and using clEnqueueReadBuffer instead of mapping
Every kernel command is scheduled with events to establish the correct execution order. Anyway, none of the previous options work. I'm wondering why. At least the clEnqueueReadBuffer option, even if unconvenient from time point of view, should work. Is it still a sync problem? I was thinking to try also the async_work_group_copy inside the kernel (global size = 128*8, local size = 128), but data to download are just 8 cl_ushort4 values, so few data.
Do you have some suggestions about the problem? Or am I just loosing on a straight path?
Thanks in advance
I have pretty much the same setup, ocl1.2 with map/unmap to the kernel. It would be useful to check the optimization guide, if you haven't done so already.
With 0-copy you shouldn't use memcpy. In 0-copy, kernel will write results directly to your host output buffer.
You should map output just before NDRangeKernel (waiting for the map to initialize). You leave map through kernel execution and when kernel is done (event or clFinish) just use your host buffer and unmap. In this case the kernel writes directly to your host buffer. Therefore, when kernel is done, all data is secure, synchronization is not needed.
If you pass to the map host memory you should use CL_MEM_USE_HOST_PTR. If asking map to allocate from host memory use CL_MEM_ALLOC_HOST_PTR. If asking map to allocate from device memory use CL_MEM_PERSISTENT_MEM_AMD. In your case, it seems that you need CL_MEM_USE_HOST_PTR.
For me it works without any problems. Make sure that if you are reading results as cl_uint4, kernel writes them as cl_uint4.