Archives Discussions

timchist · ‎12-06-2012

In APP SDK 2.6 async copy preview was implemented (via setting GPU_ASYNC_MEM_COPY to 2), but I failed to make it work (no clear examples and documentation on the feature; I'm probably doing something wrong, but can't figure out what exactly).

No changes related to async copy were announced neither in SDK 2.7 nor in SDK 2.8 release notes.

Could anyone from AMD please comment on the status of the feature? Is it possible to overlap DMA transfer with the execution of a compute kernel? (without using CPU as well, i. e. not as it is demonstrated in the TransferOverlap SDK sample). If it is possible, which hardware supports it (Evergreen? Norther Islands? Southern Islands?) and what are the exact instructions to make it work? Is it possible to see overlap in APP Profiler and what's the best way to test and debug it?

binying · ‎12-07-2012

http://devgurus.amd.com/thread/159452

This thread might be helpful for you.

timchist · ‎12-07-2012

Thanks binying, I have already read that thread. It's not really about overlapping with DMA engine, it's CPU-GPU overlap.

himanshu_gautam · ‎02-05-2013

Checked the APP programming guide on DMA Transfer.

"

Direct Memory Access (DMA) memory transfers can be executed separately from

the command queue using the DMA engine on the GPU compute device. DMA

calls are executed immediately; and the order of DMA calls and command queue

flushes is guaranteed.

DMA transfers can occur asynchronously. This means that a DMA transfer is

executed concurrently with other system or GPU compute operations when there

are no dependencies. However, data is not guaranteed to be ready until the DMA

engine signals that the event or transfer is completed. The application can query

the hardware for DMA event completion. If used carefully, DMA transfers are

another source of parallelization.

"

So, a DMA from a "ALLOC_HOST_PTR" (through clEnqueueCopyBuffer API) followed by a kernel execution that does not depend on the DMA above can execute simultaneously. This is on the same command queue.

Alternatively, from what I hear from AMD engg -- DMA, Kernel execution overlap can happen among multiple command queues as well.

timchist · ‎02-05-2013

Thanks Himanshu, I'm aware of this passage. My own experiments however show that double-buffering using two different queues (sending data for the next chunk while the previous one is being processed) works worse than seemingly sequential code. Without profiler support for asynchronous operations it is extremely difficult to tell what is happening inside in both cases.

himanshu_gautam · ‎02-05-2013

http://devgurus.amd.com/message/1286737#1286737 - This thread has info on how to disable ASYNC DMA.

You can use this to probably test your application. But, the developer has also warned about possible issues with this setting.

For your convenience, i am reproducing the answer here.

There is a possibility to disable the accelerated DMA transfers with DRMDMA/SDMA engines. However that code path may have other issues and in general lower performance. You can try it with "set CAL_ENABLE_ASYNC_DMA=0".

</german andryeyev>

himanshu_gautam · ‎02-07-2013

Please use caution while using the envmt variable. It disables a lot of other things as well. So, it may lead to other issues - You never know. Just be aware.

btw,

I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.

But then, the overlap time and kernel exec time should reasonably match. Otherwise, there is no point in doing this exercise.

timchist · ‎02-07-2013

I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.

How did you find out that DMA is overlapped with execution?

I use CL_MEM_USE_HOST_PTR. ALLOC_HOST_PTR is not suitable for my purposes.

german · ‎02-07-2013

The feature is enabled by default. You don't have to use the environment variable.
Yes, it's possible to overlap a DMA transfer with a kernel execution.
All AMD ASICs(Evergreen, NI, SI) support async transfers. On top of that SI has 2 DMA engines, so you may be able to have simultaneous bidirectional transfers over PCIE bus - hence double the bandwidth. The app
will need 3 OCL queues for that(kernel execution, read and write)
I can provide the basic concepts. However the real behavior may depend on the application's logic and GPU/CPU speed, since the performance bottlenecks can migrate from transfers to kernels.

Here are the tips how to get started:

Allocate at least 2 command queues
Use prepinned mechanism for the buffers in host memory. You can use UHP/AHP(USE_HOST_PTR/ALLOC_HOST_PTR) buffer allocations. Remember UHP is cacheable always, AHP can be USWC/cacheable and can be controlled with CL_MEM_HOST_WRITE_ONLY and CL_MEM_HOST_READ_ONLY
Submit transfers to one queue and kernel executions to another queue.
Start from independent commands to the both queues. You can add the dependencies between commands in 2 queues later with OCL events.
Measure performance of your test with CPU counters. Don't use OCL profiling. To find if the application gets ASYNC - build a dependent execution with OCL events. That's a "generic" solution, but there is an
exception when you can enable profiling and have overlap transfers (I will mention it below).
Unfortunately DRMDMA engines don't support timestamps("GPU counters"). In order to get OCL profiling data runtime has to synchronize the main command processor (CP) with DMA engine and that disables overlap. However SI has 2 independent main CPs and runtime pairs them with DMA engines. So the application can still execute kernels on one CP, while another will be synced with a DRM engine for profiling and you should be able to profile it with APP or OCL profiling.

Basically if you know the execution time of every command, then time_with_overlap < sum_time_of_each_command.

himanshu_gautam · ‎02-07-2013

Hi German,

Thanks for the details.

Regarding UHP, the developer really cannot control/predict when the Run-time will transfer the data.

Thats where the confusion starts.

So, I may have to avoid such a UHP buffer as a kernel argument.

Instead, I will have to use it just to transfer - like "clEnqueueCopyBuffer() or clEnqueueWriteBuffer()"

I hope this is what you are also trying to say.

Experiments show that UHP can be a lot slower.

I will post a sample code for this.

I was transferrring 16MB of data and it takes quite a lot of time.

This is quite understandable because UHP is pinned but physicall discontinuous memory (correct me here)

So, the Driver has to either use scatter-gather lists (or) a series of DMA operations.(correct me here)

Thanks for all other details. They are very useful.

Please continue to share these tips with us so that we get to understand things better.

german · ‎02-08-2013

Experiments show that UHP can be a lot slower.

UHP can't be slower than AHP. Something is wrong in your test. Check the pointer's alignment. I believe in the publicly available drivers runtime has a restriction on alignment - 256 bytes at least. With the latest runtime alignment was relaxed to the data type size.

timchist · ‎02-07-2013

German Andryeyev wrote:
I can provide the basic concepts. However the real behavior may depend on the application's logic and GPU/CPU speed, since the performance bottlenecks can migrate from transfers to kernels.

Thanks German. I was experimenting with 2 queues in my production code (which is too complex to be posted here), but the execution time only got worse. As it is currently not possible to take a look inside the GPU and find out whether overlap actually takes place and where the things go wrong if it does not, I just had to undo all my changes.

I basically had the following classic double-buffering scenario:

Queue 1:

- enqueue transfer of data chunk 1 for a kernel execution

Queue 2:

- enqueue transfer of data chunk 2 for a kernel execution

Queue 1:

- enqueue execution of a kernel on data chunk 1 (hoped that it will overlap with data transfer 2)

- enqueue transfer of data chunk 3 for a kernel execution

Queue 2:

- enqueue execution of a kernel on data chunk 2 (hoped that it will overlap with data transfer 3)

- enqueue transfer of data chunk 4 for a kernel execution

...

I will re-attempt my experiments soon and will try to start with something simple and gradually increase complexity.

timchist · ‎02-07-2013

German Andryeyev wrote:
However SI has 2 independent main CPs and runtime pairs them with DMA engines. So the application can still execute kernels on one CP, while another will be synced with a DRM engine for profiling and you should be able to profile it with APP or OCL profiling.

I did not know that. Please correct me if I am wrong: if I dedicate one queue exclusively to transfers, the other one only to executions and use OCL events so that execution only starts when corresponding data transfers are complete, then I can enable profiling for both queues and actually see overlaps in APP.

german · ‎02-08-2013

timchist wrote:
I did not know that. Please correct me if I am wrong: if I dedicate one queue exclusively to transfers, the other one only to executions and use OCL events so that execution only starts when corresponding data transfers are complete, then I can enable profiling for both queues and actually see overlaps in APP.

That's correct, but it's not a requirement. It just allows to simplify the logic. Also in the first step of your experiments don't sync 2 queues and have times for each command. Let's say: transfer - 50ms, kernel - 100ms, so total = 150 ms, expected time with overlap = 100ms.

Basically you can submit kernels to the second queue as well. Everything will be running asynchronously on SI and you should be able to see it in APP. Even your original test should work fine on SI under APP and you should see overlaps. Make sure UHP pointers are aligned and you have the latest driver.

Archives Discussions

Any improvements in compute-transfer overlap in APP SDK 2.8?