In APP SDK 2.6 async copy preview was implemented (via setting GPU_ASYNC_MEM_COPY to 2), but I failed to make it work (no clear examples and documentation on the feature; I'm probably doing something wrong, but can't figure out what exactly).
No changes related to async copy were announced neither in SDK 2.7 nor in SDK 2.8 release notes.
Could anyone from AMD please comment on the status of the feature? Is it possible to overlap DMA transfer with the execution of a compute kernel? (without using CPU as well, i. e. not as it is demonstrated in the TransferOverlap SDK sample). If it is possible, which hardware supports it (Evergreen? Norther Islands? Southern Islands?) and what are the exact instructions to make it work? Is it possible to see overlap in APP Profiler and what's the best way to test and debug it?
http://devgurus.amd.com/thread/159452
This thread might be helpful for you.
Thanks binying, I have already read that thread. It's not really about overlapping with DMA engine, it's CPU-GPU overlap.
Checked the APP programming guide on DMA Transfer.
"
Direct Memory Access (DMA) memory transfers can be executed separately from
the command queue using the DMA engine on the GPU compute device. DMA
calls are executed immediately; and the order of DMA calls and command queue
flushes is guaranteed.
DMA transfers can occur asynchronously. This means that a DMA transfer is
executed concurrently with other system or GPU compute operations when there
are no dependencies. However, data is not guaranteed to be ready until the DMA
engine signals that the event or transfer is completed. The application can query
the hardware for DMA event completion. If used carefully, DMA transfers are
another source of parallelization.
"
So, a DMA from a "ALLOC_HOST_PTR" (through clEnqueueCopyBuffer API) followed by a kernel execution that does not depend on the DMA above can execute simultaneously. This is on the same command queue.
Alternatively, from what I hear from AMD engg -- DMA, Kernel execution overlap can happen among multiple command queues as well.
Thanks Himanshu, I'm aware of this passage. My own experiments however show that double-buffering using two different queues (sending data for the next chunk while the previous one is being processed) works worse than seemingly sequential code. Without profiler support for asynchronous operations it is extremely difficult to tell what is happening inside in both cases.
http://devgurus.amd.com/message/1286737#1286737 - This thread has info on how to disable ASYNC DMA.
You can use this to probably test your application. But, the developer has also warned about possible issues with this setting.
For your convenience, i am reproducing the answer here.
<german andryeyev>
There is a possibility to disable the accelerated DMA transfers with DRMDMA/SDMA engines. However that code path may have other issues and in general lower performance. You can try it with "set CAL_ENABLE_ASYNC_DMA=0".
</german andryeyev>
Please use caution while using the envmt variable. It disables a lot of other things as well. So, it may lead to other issues - You never know. Just be aware.
btw,
I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.
But then, the overlap time and kernel exec time should reasonably match. Otherwise, there is no point in doing this exercise.
I made some a sample yesterday and it looks like DMAs involving ALLOC_HOST_PTR can be overlapped with independent kernel executions. Did you use ALLOC_HOST_PTR in your samples? You may want to check out.
How did you find out that DMA is overlapped with execution?
I use CL_MEM_USE_HOST_PTR. ALLOC_HOST_PTR is not suitable for my purposes.
Here are the tips how to get started:
Basically if you know the execution time of every command, then time_with_overlap < sum_time_of_each_command.
Hi German,
Thanks for the details.
Regarding UHP, the developer really cannot control/predict when the Run-time will transfer the data.
Thats where the confusion starts.
So, I may have to avoid such a UHP buffer as a kernel argument.
Instead, I will have to use it just to transfer - like "clEnqueueCopyBuffer() or clEnqueueWriteBuffer()"
I hope this is what you are also trying to say.
Experiments show that UHP can be a lot slower.
I will post a sample code for this.
I was transferrring 16MB of data and it takes quite a lot of time.
This is quite understandable because UHP is pinned but physicall discontinuous memory (correct me here)
So, the Driver has to either use scatter-gather lists (or) a series of DMA operations.(correct me here)
Thanks for all other details. They are very useful.
Please continue to share these tips with us so that we get to understand things better.
Experiments show that UHP can be a lot slower.
UHP can't be slower than AHP. Something is wrong in your test. Check the pointer's alignment. I believe in the publicly available drivers runtime has a restriction on alignment - 256 bytes at least. With the latest runtime alignment was relaxed to the data type size.
German Andryeyev wrote:
I can provide the basic concepts. However the real behavior may depend on the application's logic and GPU/CPU speed, since the performance bottlenecks can migrate from transfers to kernels.
Thanks German. I was experimenting with 2 queues in my production code (which is too complex to be posted here), but the execution time only got worse. As it is currently not possible to take a look inside the GPU and find out whether overlap actually takes place and where the things go wrong if it does not, I just had to undo all my changes.
I basically had the following classic double-buffering scenario:
Queue 1:
- enqueue transfer of data chunk 1 for a kernel execution
Queue 2:
- enqueue transfer of data chunk 2 for a kernel execution
Queue 1:
- enqueue execution of a kernel on data chunk 1 (hoped that it will overlap with data transfer 2)
- enqueue transfer of data chunk 3 for a kernel execution
Queue 2:
- enqueue execution of a kernel on data chunk 2 (hoped that it will overlap with data transfer 3)
- enqueue transfer of data chunk 4 for a kernel execution
...
I will re-attempt my experiments soon and will try to start with something simple and gradually increase complexity.
German Andryeyev wrote:
- However SI has 2 independent main CPs and runtime pairs them with DMA engines. So the application can still execute kernels on one CP, while another will be synced with a DRM engine for profiling and you should be able to profile it with APP or OCL profiling.
I did not know that. Please correct me if I am wrong: if I dedicate one queue exclusively to transfers, the other one only to executions and use OCL events so that execution only starts when corresponding data transfers are complete, then I can enable profiling for both queues and actually see overlaps in APP.
timchist wrote:
I did not know that. Please correct me if I am wrong: if I dedicate one queue exclusively to transfers, the other one only to executions and use OCL events so that execution only starts when corresponding data transfers are complete, then I can enable profiling for both queues and actually see overlaps in APP.
That's correct, but it's not a requirement. It just allows to simplify the logic. Also in the first step of your experiments don't sync 2 queues and have times for each command. Let's say: transfer - 50ms, kernel - 100ms, so total = 150 ms, expected time with overlap = 100ms.
Basically you can submit kernels to the second queue as well. Everything will be running asynchronously on SI and you should be able to see it in APP. Even your original test should work fine on SI under APP and you should see overlaps. Make sure UHP pointers are aligned and you have the latest driver.