Not sure whether this is still required in APP SDK 2.7, but I tried to set GPU_ASYNC_MEM_COPY to 2 with no effect. May be I'm using it wrong. What are the other conditions for this to work? (except having transfer and compute in different queues). E. g., is it required to pin the host memory from which data is transferred? Should any specific method of transfer be used?
I'm also confused by overlapping DMA and GPU computation kernel. The SDK sample only overlapped memcpy, and that is not what I want. I want to overlap Read/Write Rect with GPU kernel.
Another update: after figuring out that profiling needs to be disabled in order to allow DMA transfers and I can't use GPU events to collect performance data, I tried the following simple example:
hostBuffer1 (CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR), memory pointer passed to clCreateBuffer is 4K-aligned, length is 4K aligned as well
hostBuffer2 -- same parameters as for hostBuffer1
src1, dst1, src2, dst2 -- device buffers, CL_MEM_READ_WRITE
2. Queue #1:
clEnqueueCopyBuffer (hostBuffer1 --> src1)
clEnqueueNDRangeKernel (read src1, write to dst1)
clEnqueueNDRangeKernel (read dst1, write to src1)
3. Queue #2:
clEnqueueCopyBuffer (hostBuffer2 --> src2)
clEnqueueNDRangeKernel (read src2, write to dst2)
clEnqueueNDRangeKernel (read dst2, write to src2)
4. Flush Queue #1 and Queue #2
5. Call clFinish for both queues
6. Measure CPU time for steps 2 to 5
I expected that the execution of kernels in Queue #1 would overlap with transferring data in Queue #2, but it looks like this is not the case.
The example above takes 63 ms on my computer (HD 5850, 32M buffer). The time does not change if I set GPU_ASYNC_MEM_COPY equal to 2.
If I issue all commands to a single queue (for example, to Queue #1), the execution time is 55 ms.
What am I missing? Why is not it working as expected? Full sample code is attached.
opencl_overlap.cpp.zip 2.9 KB