Archives Discussions

timchist · ‎11-22-2012

My application runs the same sequence of kernels several times for different chunks of data.

I'm trying to overlap transfer with running kernels, specifically, I want to transfer the next piece of data while the current one is being processed.

The input data is stored in 4K-aligned host memory for which clCreateBuffer was called with CL_MEM_USE_HOST_PTR flag (i. e. in pinned memory). The transfers to GPU are done using clCopyBuffer calls which are indeed faster than calling clEnqueueWriteBuffer for regular host memory blocks.

In order to overlap transfer with compute I'm trying to use two different queues (in the same manner as streams are used in CUDA). However, this results in sequential execution. The only sample in SDK I found on the subject is TransferOverlap, but using CL_MEM_USE_PERSISTENT_MEM_AMD does not seem to be a viable option for my case. There is no way to get input data in such buffers straight away, so I'll have to copy to that memory on the host first. This results in CPU load spike, in addition, it doubles memory consumption. Also, host->gpu transfer rate will be suboptimal.

On CUDA the same technique works nicely: there are two queues on devices with compute capability 1.1-3.0, one of them executes transfers from and to host, the other one executes kernels. Commands from two queues can run in parallel (and this works in fact).

Are AMD GPUs capable of transferring data to GPU parallel to executing kernels when input data is not stored in CL_MEM_USE_PERSISTENT_MEM_AMD buffers?

timchist · ‎11-23-2012

Not sure whether this is still required in APP SDK 2.7, but I tried to set GPU_ASYNC_MEM_COPY to 2 with no effect. May be I'm using it wrong. What are the other conditions for this to work? (except having transfer and compute in different queues). E. g., is it required to pin the host memory from which data is transferred? Should any specific method of transfer be used?

shawnccx · ‎11-26-2012

I'm also confused by overlapping DMA and GPU computation kernel. The SDK sample only overlapped memcpy, and that is not what I want. I want to overlap Read/Write Rect with GPU kernel.

timchist · ‎11-29-2012

Another update: after figuring out that profiling needs to be disabled in order to allow DMA transfers and I can't use GPU events to collect performance data, I tried the following simple example:

1. Allocation:

hostBuffer1 (CL_MEM_READ_WRITE | CL_MEM_USE_HOST_PTR), memory pointer passed to clCreateBuffer is 4K-aligned, length is 4K aligned as well

hostBuffer2 -- same parameters as for hostBuffer1

src1, dst1, src2, dst2 -- device buffers, CL_MEM_READ_WRITE

2. Queue #1:

clEnqueueCopyBuffer (hostBuffer1 --> src1)

clEnqueueNDRangeKernel (read src1, write to dst1)

clEnqueueNDRangeKernel (read dst1, write to src1)

3. Queue #2:

clEnqueueCopyBuffer (hostBuffer2 --> src2)

clEnqueueNDRangeKernel (read src2, write to dst2)

clEnqueueNDRangeKernel (read dst2, write to src2)

4. Flush Queue #1 and Queue #2

5. Call clFinish for both queues

6. Measure CPU time for steps 2 to 5

I expected that the execution of kernels in Queue #1 would overlap with transferring data in Queue #2, but it looks like this is not the case.

The example above takes 63 ms on my computer (HD 5850, 32M buffer). The time does not change if I set GPU_ASYNC_MEM_COPY equal to 2.

If I issue all commands to a single queue (for example, to Queue #1), the execution time is 55 ms.

What am I missing? Why is not it working as expected? Full sample code is attached.

Archives Discussions

Transfer from host to GPU via DMA parallel to kernel execution