I am wondering if there is any known method for utilizing zero copy on AMD A-Series APUs (e.g. A6-3650) without incurring the performance penalty of either (a) having the GPU use the CPU's cache coherency protocol to access data or (b) having the CPU access the data in an uncached manner.
Consider the following example. I currently have a buffer called data that is allocated using the CL_MEM_USE_PERSISTENT_MEM_AMD flag. My algorithm has two steps - call them stepA and stepB. I am trying to use data partitioning to distribute the work of each step across the CPU and the GPU. Note that the data partitioning is not necessarily the same for the two steps. I have therefore created sub-buffers out of data that are used by each device for each step: dataCpuStepA, dataCpuStepB, dataGpuStepA, and dataGpuStepB. These sub-buffers were declared using the same CL_MEM_USE_PERSISTENT_MEM_AMD flag. The sub-buffers for the same step do not overlap.
I have noticed an odd phenomenon when I try to use these sub-buffers in my algorithm:
Alternatively, if I implement the CPU portion in host code and access the sub buffers by mapping them to the host, I incur the expected performance penalty by not accessing the sub-buffers via the CPU's cache.
My questions are therefore:
If you are testing Linux, then you can't bind zero copy buffers to kernels and can only use them as staging buffers for copies. Persistent buffers don't have any meaning either. SI family GPUs, such as HD7970, have zero copy support under Linux.
That said, it seems like a bug if the buffers aren't staying coherent when you mix CPU and GPU devices. Can you provide a small test case to demonstrate the issue?
There, currently, isn't a solution to 4 other than to pipeline data transfers and kernel execution to hide the cost of moving data back to the CPU side.
Thanks for your response. I'm developing for Windows using the A6-3650 APU.
I'll work on trying to provide a small test case of the memory incoherence I'm experiencing when the GPU and CPU are used as OpenCL devices.
Yeah, that's what I figured for 4. It seems that the current zero copy implementation provides an easy method for pipelining the movement and processing of of large amounts of data from the CPU to the GPU, but little else in terms of sharing large amounts of data for collaborative processing between the two devices.
Does anyone have any insight as to how/where zero copy buffers get allocated when using both the CPU and GPU as an OpenCL device? Any general advice on how to use zero copy when both the GPU and CPU are target devices in the same context would be useful/appreciated as well.
The way OpenCL is written makes it difficult to share buffers between devices because only one device can "own" a buffer at a time. This means the developer is unable to partition buffers for use by multiple devices simultaneously, which is limiting.
When resources are created, we defer allocation until first use then we determine the "optimal" placement based on that. There are some exceptions, such as when USE_HOST_PTR is specified. Also, if COPY_HOST_PTR is specified, we must create a temp buffer for contents of the user pointer. Future additions to the API will allow the developer to more easily control which device should get preferential placement of allocated buffers in contexts with multiple devices.
On APUs, I recommend using different buffers for the CPU and GPU devices so each can get their preferred memory pool. The GPU device prefers device memory or read-only host buffers. The CPU device prefers device memory (i.e. system memory) or read-write host buffers.
Hope this helps,