cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

reynmorris
Journeyman III

DMA with AMD A10 APU?

I have a functioning OpenCL application right now that uses 2 command queues so that I can run a kernel and DMA transfer data concurrently. It works with multiple systems that use discreet GPUs (NVidia and AMD).


However, when I try to run it on my system with an AMD A10 APU, the kernel locks up and freezes. Is this just not possible with this architecture or is there some kind of exception I need to use?

I can provide an example program privately if an AMD developer can help.

Thanks!

0 Likes
1 Solution

I think the 32MB limit comes from Table 4.2 in AMD APP Programming Guide. This is the case for normal regular buffers (which are not pinned and stored in device usually) and the guide is talking about behaviour of "clEnqueueMap"

But, if you want to use DMA - you got to Pin the buffer. Pinning usually happens when you use "USE_HOST_PTR". Either the host application pages are directly pinned (or) the host application pages are copied to a temporary pinned buffer for one-shot transfer (or) Transferred chunk by chunk using DMA and double-buffering. The run-time will decide the time of transfer (depending on first time usage mostly.) Until you MAP that buffer, the OpenCL runtime will own your host-ptr. When you map, you own it - you can write to it.. When you UNMAP, control returns to OpenCL run-time.

When you use ALLOC_HOST_PTR, if zero-copy is supported, pinned memory is allocated. The KERNEL can directly read this data using a pointer and hence data-transfer and kernel execution occur together -- which is not a great way to overlap data-transfer and kernel execution (GPU is too fast and will often stall waiting for data to arrive from system memory across PCIe)

When you use PERSISTENT_AMD flag, the buffer is allocated inside the GPU and the CPU gets a pointer (that read/writes across the PCIe bus). In this case, memcpy and kernel execution can happen together. But the memcpy is PIO and cannot be called as DMA.

The best way to overlap a transfer with kernel execution is to first allocate Pinned buffer (using ALLOC_HOST_PTR), Map it to get the pointer and write something onto the buffer. Allocate another normal buffer (which sits on GPU). Now, do a clEnqueueWrite* from pinned buffer to the normal buffer. This is just DMA.

It is this DMA that I would like to overlap with Kernel execution. I am still investigating whether this is possible or not.

Will post an update next week.

View solution in original post

0 Likes
12 Replies
himanshu_gautam
Grandmaster

Please attach your test case here & I will try to reproduce it at my end. I can also forward it to relevant AMD Engg team if the bug is found valid.

I would also suggest to go through Transfer Overlap SDK sample for some directions.

0 Likes
himanshu_gautam
Grandmaster

If I recall correctly, CUDA requires Multiple Streams (within a CUDA context) for overlapping DMA with Kernel Execution.

However, I think, in AMD - You really dont need multiple command queues. Just make sure that the kernel and the buffer copy are  enqueued one after another and that they dont have depndency and that the buffer uses pinned memory. This should suffice.

Please give me sometime while I experiment with the same and let you know of.

0 Likes

So do I take it correctly, that DMA is only used when pinned memory is used? Do I remember correctly, that pinned memory is used only, when a buffer is smaller than 32MB and is moved by clEnqueueMapBuffer? I recall reading about this a while back, and if I remember correctly mapping buffers return pointers to pinned memory, if they are small enough. I only ask because I'm writing a prototype of GPU cluster capable physics simulation with MPI, and CUDA has RDMA implemented with CUDA (most likely not ported to OpenCL), so my best chance with AMD is using pinned buffers.

Plus, does AMD plan on implementing something similar on the Red side of the force? (RDMA namely with Infini, or simply within a host)

0 Likes

I really hope there isn't a hard cap that small on the size of pinned memory, I haven't checked.  And I'm also curious about whether or not there are plans for RDMA in OpenCL, but not very hopeful as that is probably an architecture-specific thing that nVidia is doing (as it only appears to be available on newer Tesla models).

Himanshu - Sorry I haven't responded to the main replies here, I've had to move forward with an alternate approach but I am still curious whether or not this can be done on APU hardware (concurrent DMA and kernel execution). If you come up with a very simple example that works on Trinity hardware I'd be very appreciative to see it. Thanks for your time

0 Likes

I think the 32MB limit comes from Table 4.2 in AMD APP Programming Guide. This is the case for normal regular buffers (which are not pinned and stored in device usually) and the guide is talking about behaviour of "clEnqueueMap"

But, if you want to use DMA - you got to Pin the buffer. Pinning usually happens when you use "USE_HOST_PTR". Either the host application pages are directly pinned (or) the host application pages are copied to a temporary pinned buffer for one-shot transfer (or) Transferred chunk by chunk using DMA and double-buffering. The run-time will decide the time of transfer (depending on first time usage mostly.) Until you MAP that buffer, the OpenCL runtime will own your host-ptr. When you map, you own it - you can write to it.. When you UNMAP, control returns to OpenCL run-time.

When you use ALLOC_HOST_PTR, if zero-copy is supported, pinned memory is allocated. The KERNEL can directly read this data using a pointer and hence data-transfer and kernel execution occur together -- which is not a great way to overlap data-transfer and kernel execution (GPU is too fast and will often stall waiting for data to arrive from system memory across PCIe)

When you use PERSISTENT_AMD flag, the buffer is allocated inside the GPU and the CPU gets a pointer (that read/writes across the PCIe bus). In this case, memcpy and kernel execution can happen together. But the memcpy is PIO and cannot be called as DMA.

The best way to overlap a transfer with kernel execution is to first allocate Pinned buffer (using ALLOC_HOST_PTR), Map it to get the pointer and write something onto the buffer. Allocate another normal buffer (which sits on GPU). Now, do a clEnqueueWrite* from pinned buffer to the normal buffer. This is just DMA.

It is this DMA that I would like to overlap with Kernel execution. I am still investigating whether this is possible or not.

Will post an update next week.

0 Likes

I am still working on creating a sample. Will get this across at the earliest. Apologies for the delay.

Thanks,

0 Likes
himanshu_gautam
Grandmaster

Hi,

Your e-mail is private. So, I am not sure how I can contact you to get the repro-case sources.

I have sent you a friend request. please accept it. Let us see if that opens the door for some private message exchange.

We will work with you to resolve your issue.

Thanks,

himanshu_gautam
Grandmaster

Working on to enable private message communications. Hopefully, once this fixed, You can feel free to send in your code. We will test it out and see why it is crashing..

0 Likes

Hi,

I am afraid... but I think Private messaging may not be possible at the moment.

Can you confirm, if you are still having the issue?

We will appreciate if you could post a simple test-case that shows the crash/hang.

Thanks,

0 Likes

Here is a Sample Code to showcase Asynchronous DMA using AMD GPUs. It should compile for both windows and linux.

Looked at the sample. Great work! I'll make sure to make use of this in my next app.

0 Likes

Thanks for the response.

0 Likes