Archives Discussions

boxerab · ‎11-04-2015

I have a pipeline of kernels:

1) kernel A writes data into buffer X

2) buffer X is copied to host via clEnqueueReadBuffer

3) host data is processed, in callback triggered by clEnqueueReadBuffer

repeat above

Buffer X is created with the following flags :

CL_MEM_USE_HOST_PTR | CL_MEM_READ_WRITE; | CL_MEM_HOST_READ_ONLY

My question: once clEnqueueReadBuffer is complete (I have an event triggered by CL_COMPLETE), is it safe for kernel A to run again

without overwriting data being processed on the host ?

Or should I process the data on the host before I allow kernel A to run again?

Because I am seeing a bug in my code indicating that it is not safe for kernel A to run until I process the data on the host.

Thanks!

.

nibal · ‎11-04-2015

I'm not sure you intend to do this. You use CL_MEM_USE_HOST_PTR with clEnqueueReadBuffer. Kernel A goes into all the trouble to write into your host allocated memory, and then you read Buffer from host memory to host memory. Yes, it will be safe when Read Buffer completes, but you don't need read buffer at all. You can process data directly from your host allocated memory, or even use map/unmap. You can always memcpy that pointer for safe keeping before processing. I don't see your pipeline, where is it?

The error you see probably depends on your ReadBuffer call.

boxerab · ‎11-04-2015

@nibal I believe I still need to call clEnqueueReadBuffer, even if the buffer is created with CL_MEM_USE_HOST_PTR. I don't think it is safe to skip the read.

nibal · ‎11-04-2015

boxerab wrote:
@nibal I believe I still need to call clEnqueueReadBuffer, even if the buffer is created with CL_MEM_USE_HOST_PTR. I don't think it is safe to skip the read.

You can try reading from the host pointer given to buffer X. You will find your results there. Using map/unmap will ensure your data is synchronized.

If you insist on calling ReadBuffer, maybe using CL_MEM_USE_PERSISTENT_MEM_AMD would be better than CL_MEM_USE_HOST_PTR. This way kernel will do a fast write to GPU's global memory and then you can copy to host memory. Where is your pipeline?

boxerab · ‎11-04-2015

Thanks. I will look into CL_MEM_USE_PERSISTENT_MEM_AMD.

nibal · ‎11-04-2015

boxerab wrote:
Thanks. I will look into CL_MEM_USE_PERSISTENT_MEM_AMD.

Actually in terms of performance, the optimization guide recommends using CL_MEM_USE_HOST_PTR with clEnqueueMapBuffer/Unmap. With a recent catalyst driver this results to zero-copy. You save by having the kernel writing directly to host memory and skipping writing to GPU memory.

boxerab · ‎11-05-2015

Thanks, nibal. I tried this; no difference in perf on my HD7700, but I will leave the code in and test when I upgrade to newer card.

nibal · ‎11-05-2015

boxerab wrote:
Thanks, nibal. I tried this; no difference in perf on my HD7700, but I will leave the code in and test when I upgrade to newer card.

You can read about it in the Optimization guide, 1.3.3 "Memory Allocation" p 1-9. There is a little trick to it. You start the map b4 the NDRangeKernel. You let it through kernel running, and the kernel slowly fills your host buffer. When kernel is done (event) you process the data on the host (no sync needed) and when done you unmap.

Granted the time saved is a very fast write to GPU Global memory, but it is time saved. Sometimes it is difficult to notice the performance gain. If you have your host code running for 30" (as in a typical frequency scan) and your ocl code running 300x for .4" total off your video card, that is also updating your display, you are mostly observing random timings. CodeXL profiler is a better option.

You should know that you can use kernel pipelines in ocl2.0, where kernel A sends output to kernel B, while it still runs. In the beginning i thought you were asking for help with that

boxerab · ‎11-06-2015

Thanks for this, nibal. I tried this trick, but it didn't seem to work correctly for my application. That is fine, because I am pretty happy

with performance with my current design.

dipak · ‎11-04-2015

Yes, CL_MEM_USE_HOST_PTR does not itself guarantee that the host will always have the updated data. As per spec., OpenCL implementations are allowed to cache the buffer contents in device memory. This cached copy can be used when kernels are executed on a device. That's why, in order to access the latest contents on host, it is recommended to use either clEnqueueRead or clEnqueueMap command.

Regards,

boxerab · ‎11-04-2015

Thanks, Dipak. Is there any performance difference in this case between clEnqueueReadBuffer and clEnqueueMapBuffer ?

dipak · ‎11-04-2015

Once read is complete, you can start the kernel and also, process the data on host as long as data pointers point to different memory regions. Only restriction is when you pass the same host pointer to the clEnqueueReadBuffer which one is also used to create the buffer. Because as per the clEnqueueReadBuffer :

Calling clEnqueueReadBuffer to read a region of the buffer object with the ptr argument value set to host_ptr + offset, where host_ptr is a pointer to the memory region specified when the buffer object being read is created with CL_MEM_USE_HOST_PTR, must meet the following requirements in order to avoid undefined behavior:
All commands that use this buffer object or a memory object (buffer or image) created from this buffer object have finished execution before the read command begins execution.
The buffer object or memory objects created from this buffer object are not mapped.
The buffer object or memory objects created from this buffer object are not used by any command-queue until the read command has finished execution.

Regards,

Archives Discussions

Kernel pipeline