In APP SDK 2.6 async copy preview was implemented (via setting GPU_ASYNC_MEM_COPY to 2), but I failed to make it work (no clear examples and documentation on the feature; I'm probably doing something wrong, but can't figure out what exactly).
No changes related to async copy were announced neither in SDK 2.7 nor in SDK 2.8 release notes.
Could anyone from AMD please comment on the status of the feature? Is it possible to overlap DMA transfer with the execution of a compute kernel? (without using CPU as well, i. e. not as it is demonstrated in the TransferOverlap SDK sample). If it is possible, which hardware supports it (Evergreen? Norther Islands? Southern Islands?) and what are the exact instructions to make it work? Is it possible to see overlap in APP Profiler and what's the best way to test and debug it?