Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Adept I

Zero copy optimization on AMD A-Series APU


I am wondering if there is any known method for utilizing zero copy on AMD A-Series APUs (e.g. A6-3650) without incurring the performance penalty of either (a) having the GPU use the CPU's cache coherency protocol to access data or (b) having the CPU access the data in an uncached manner.

Consider the following example.  I currently have a buffer called data that is allocated using the CL_MEM_USE_PERSISTENT_MEM_AMD flag.  My algorithm has two steps - call them stepA and stepB.  I am trying to use data partitioning to distribute the work of each step across the CPU and the GPU.  Note that the data partitioning is not necessarily the same for the two steps.  I have therefore created sub-buffers out of data that are used by each device for each step: dataCpuStepA, dataCpuStepB, dataGpuStepA, and dataGpuStepB.  These sub-buffers were declared using the same CL_MEM_USE_PERSISTENT_MEM_AMD flag.  The sub-buffers for the same step do not overlap.

I have noticed an odd phenomenon when I try to use these sub-buffers in my algorithm:

  • If I use the CPU and GPU as OpenCL devices, I wind up with incoherent memory.  That is to say, if I make sure both devices have completed stepA before moving on to stepB and set up the proper event dependencies, the GPU does not get an updated copy of the portion of data that the CPU wrote to sub-buffer dataCpuStepA.  Similarly, the CPU does not get an updated copy of the portion of data that the GPU wrote to sub-buffer dataGpuStepA.  This is especially problematic if the data partitioning in stepA is not the same data partitioning as stepB.
  • I can get around this by mapping/unmapping the parent buffer to the host using one of the devices (I picked the CPU, for example).  When this mapping and unmapping takes place, a great deal of data copying occurs.  No zero copy functionality is taking place in spite of the fact that the CL_MEM_USE_PERSISTENT_MEM_AMD flag was used when declaring all buffers and sub-buffers (contrary to what Table 4.2 of the AMD APP OpenCL Programming Guide would have me believe).

Alternatively, if I implement the CPU portion in host code and access the sub buffers by mapping them to the host, I incur the expected performance penalty by not accessing the sub-buffers via the CPU's cache.

My questions are therefore:

  1. Am I incurring this coherence problem because zero copy isn't (well) defined if there are multiple OpenCL devices?  Section 4.4.2 of the AMD APP OpenCL Programming Guide says "multiple GPU devices are not supported".  Should this be revised to "multiple OpenCL devices are not supported in combination with zero copy"?
  2. Buffers are defined by the programmer within a context, not on a per-device basis.  When buffers are defined with the CL_MEM_USE_PERSISTENT_MEM_AMD flag on a fusion system where there is a CPU and GPU device, does the GPU's device buffer point to the same physical memory as the CPU's device buffer?  If not, that would help explain why copying takes place when mapping occurs.
  3. Is there a way to avoid this copying of data when using the CPU and GPU as OpenCL devices?  It would be convenient if this flag essentially "turned off" the cache if the GPU is writing to the sub-buffer, but enabled it for writes from the CPU.  This would give the best of both worlds - zero copying of data as well as optimized access to memory for each device.  As long as the programmer sets up their sub-buffers properly (i.e. making sure sub-buffers don't overlap or split a cache line), things would work.
  4. I think I already know the answer to this one, but is it possible to avoid the performance penalty of having the CPU access uncached data or forcing the GPU to use the CPU's cache coherency protocol when using the CPU in the context of host code?  It seems that the current Fusion architecture facilitates inexpensive copying of data to/from the GPU, however the performance penalty the CPU incurs by working with uncached data still doesn't make it very attractive to use the CPU and GPU simultaneously.



3 Replies

If you are testing Linux, then you can't bind zero copy buffers to kernels and can only use them as staging buffers for copies.  Persistent buffers don't have any meaning either.  SI family GPUs, such as HD7970, have zero copy support under Linux.

That said, it seems like a bug if the buffers aren't staying coherent when you mix CPU and GPU devices.  Can you provide a small test case to demonstrate the issue?

There, currently, isn't a solution to 4 other than to pipeline data transfers and kernel execution to hide the cost of moving data back to the CPU side.



Hi Jeff,

Thanks for your response.  I'm developing for Windows using the A6-3650 APU.

I'll work on trying to provide a small test case of the memory incoherence I'm experiencing when the GPU and CPU are used as OpenCL devices.

Yeah, that's what I figured for 4.  It seems that the current zero copy implementation provides an easy method for pipelining the movement and processing of of large amounts of data from the CPU to the GPU, but little else in terms of sharing large amounts of data for collaborative processing between the two devices.

Does anyone have any insight as to how/where zero copy buffers get allocated when using both the CPU and GPU as an OpenCL device?  Any general advice on how to use zero copy when both the GPU and CPU are target devices in the same context would be useful/appreciated as well.




The way OpenCL is written makes it difficult to share buffers between devices because only one device can "own" a buffer at a time.  This means the developer is unable to partition buffers for use by multiple devices simultaneously, which is limiting.

When resources are created, we defer allocation until first use then we determine the "optimal" placement based on that.  There are some exceptions, such as when USE_HOST_PTR is specified.  Also, if COPY_HOST_PTR is specified, we must create a temp buffer for contents of the user pointer.  Future additions to the API will allow the developer to more easily control which device should get preferential placement of allocated buffers in contexts with multiple devices.

On APUs, I recommend using different buffers for the CPU and GPU devices so each can get their preferred memory pool.  The GPU device prefers device memory or read-only host buffers.  The CPU device prefers device memory (i.e. system memory) or read-write host buffers.

Hope this helps,