cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

timchist
Elite

Any improvements in compute-transfer overlap in APP SDK 2.8?

In APP SDK 2.6 async copy preview was implemented (via setting GPU_ASYNC_MEM_COPY to 2), but I failed to make it work (no clear examples and documentation on the feature; I'm probably doing something wrong, but can't figure out what exactly).

No changes related to async copy were announced neither in SDK 2.7 nor in SDK 2.8 release notes.

Could anyone from AMD please comment on the status of the feature? Is it possible to overlap DMA transfer with the execution of a compute kernel? (without using CPU as well, i. e. not as it is demonstrated in the TransferOverlap SDK sample). If it is possible, which hardware supports it (Evergreen? Norther Islands? Southern Islands?) and what are the exact instructions to make it work? Is it possible to see overlap in APP Profiler and what's the best way to test and debug it?

0 Likes
13 Replies
binying
Challenger

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

http://devgurus.amd.com/thread/159452

This thread might be helpful for you.

0 Likes
timchist
Elite

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

Thanks binying, I have already read that thread. It's not really about overlapping with DMA engine, it's CPU-GPU overlap.

0 Likes
himanshu_gautam
Grandmaster

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

Checked the APP programming guide on DMA Transfer.

"

Direct Memory Access (DMA) memory transfers can be executed separately from

the command queue using the DMA engine on the GPU compute device. DMA

calls are executed immediately; and the order of DMA calls and command queue

flushes is guaranteed.

DMA transfers can occur asynchronously. This means that a DMA transfer is

executed concurrently with other system or GPU compute operations when there

are no dependencies. However, data is not guaranteed to be ready until the DMA

engine signals that the event or transfer is completed. The application can query

the hardware for DMA event completion. If used carefully, DMA transfers are

another source of parallelization.

"

So, a DMA from a "ALLOC_HOST_PTR" (through clEnqueueCopyBuffer API) followed by a kernel execution that does not depend on the DMA above can execute simultaneously. This is on the same command queue.

Alternatively, from what I hear from AMD engg -- DMA, Kernel execution overlap can happen among multiple command queues as well.

0 Likes
timchist
Elite

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

Thanks Himanshu, I'm aware of this passage. My own experiments however show that double-buffering using two different queues (sending data for the next chunk while the previous one is being processed) works worse than seemingly sequential code. Without profiler support for asynchronous operations it is extremely difficult to tell what is happening inside in both cases.

0 Likes
himanshu_gautam
Grandmaster

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

http://devgurus.amd.com/message/1286737#1286737 - This thread has info on how to disable ASYNC DMA.

You can use this to probably test your application. But, the developer has also warned about possible issues with this setting.

For your convenience, i am reproducing the answer here.

<german andryeyev>

There is a possibility to disable the accelerated DMA transfers with DRMDMA/SDMA engines. However that code path may have other issues and in general lower performance. You can try it with "set CAL_ENABLE_ASYNC_DMA=0".

</german andryeyev>

0 Likes
himanshu_gautam
Grandmaster

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

Please use caution while using the envmt variable. It disables a lot of other things as well. So, it may lead to other issues - You never know. Just be aware.

btw,

I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.

But then, the overlap time and kernel exec time should reasonably match. Otherwise, there is no point in doing this exercise.

0 Likes
timchist
Elite

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.

How did you find out that DMA is overlapped with execution?

I use CL_MEM_USE_HOST_PTR. ALLOC_HOST_PTR is not suitable for my purposes.

0 Likes
german
Staff
Staff

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

  • The feature is enabled by default. You don't have to use the environment variable.
  • Yes, it's possible to overlap a DMA transfer with a kernel execution.
  • All AMD ASICs(Evergreen, NI, SI) support async transfers. On top of that SI has 2 DMA engines, so you may be able to have simultaneous bidirectional transfers over PCIE bus - hence double the bandwidth. The app
    will need 3 OCL queues for that(kernel execution, read and write)
  • I can provide the basic concepts. However the real behavior may depend on the application's logic and GPU/CPU speed, since the performance bottlenecks can migrate from transfers to kernels.

           Here are the tips how to get started:

    • Allocate at least 2 command queues
    • Use prepinned mechanism for the buffers in host memory. You can use UHP/AHP(USE_HOST_PTR/ALLOC_HOST_PTR) buffer allocations. Remember UHP is cacheable always, AHP can be USWC/cacheable and can be controlled with CL_MEM_HOST_WRITE_ONLY and CL_MEM_HOST_READ_ONLY
    • Submit transfers to one queue and kernel executions to another queue.
    • Start from independent commands to the both queues. You can add the dependencies between commands in 2 queues later with OCL events.
    • Measure performance of your test with CPU counters. Don't use OCL profiling. To find if the application gets ASYNC - build a dependent execution with OCL events. That's a "generic" solution, but there is an
      exception when you can enable profiling and have overlap transfers (I will mention it below).
    • Unfortunately DRMDMA engines don't support timestamps("GPU counters"). In order to get OCL profiling data runtime has to synchronize the main command processor (CP) with DMA engine and that disables overlap. However SI has 2 independent main CPs and runtime pairs them with DMA engines. So the application can still execute kernels on one CP, while another will be synced with a DRM engine for profiling and you should be able to profile it with APP or OCL profiling.

Basically if you know the execution time of every command, then time_with_overlap < sum_time_of_each_command.

himanshu_gautam
Grandmaster

Re: Any improvements in compute-transfer overlap in APP SDK 2.8?

Hi German,

Thanks for the details.

Regarding UHP, the developer really cannot control/predict when the Run-time will transfer the data.

Thats where the confusion starts.

So, I may have to avoid such a UHP buffer as a kernel argument.

Instead, I will have to use it just to transfer - like "clEnqueueCopyBuffer() or clEnqueueWriteBuffer()"

I hope this is what you are also trying to say.

Experiments show that UHP can be a lot slower.

I will post a sample code for this.

I was transferrring 16MB of data and it takes quite a lot of time.

This is quite understandable because UHP is pinned but physicall discontinuous memory (correct me here)

So, the Driver has to either use scatter-gather lists (or) a series of DMA operations.(correct me here)

Thanks for all other details. They are very useful.

Please continue to share these tips with us so that we get to understand things better.

0 Likes