I am wondering if there is any known method for utilizing zero copy on AMD A-Series APUs (e.g. A6-3650) without incurring the performance penalty of either (a) having the GPU use the CPU's cache coherency protocol to access data or (b) having the CPU access the data in an uncached manner.
Consider the following example. I currently have a buffer called data that is allocated using the CL_MEM_USE_PERSISTENT_MEM_AMD flag. My algorithm has two steps - call them stepA and stepB. I am trying to use data partitioning to distribute the work of each step across the CPU and the GPU. Note that the data partitioning is not necessarily the same for the two steps. I have therefore created sub-buffers out of data that are used by each device for each step: dataCpuStepA, dataCpuStepB, dataGpuStepA, and dataGpuStepB. These sub-buffers were declared using the same CL_MEM_USE_PERSISTENT_MEM_AMD flag. The sub-buffers for the same step do not overlap.
I have noticed an odd phenomenon when I try to use these sub-buffers in my algorithm:
- If I use the CPU and GPU as OpenCL devices, I wind up with incoherent memory. That is to say, if I make sure both devices have completed stepA before moving on to stepB and set up the proper event dependencies, the GPU does not get an updated copy of the portion of data that the CPU wrote to sub-buffer dataCpuStepA. Similarly, the CPU does not get an updated copy of the portion of data that the GPU wrote to sub-buffer dataGpuStepA. This is especially problematic if the data partitioning in stepA is not the same data partitioning as stepB.
- I can get around this by mapping/unmapping the parent buffer to the host using one of the devices (I picked the CPU, for example). When this mapping and unmapping takes place, a great deal of data copying occurs. No zero copy functionality is taking place in spite of the fact that the CL_MEM_USE_PERSISTENT_MEM_AMD flag was used when declaring all buffers and sub-buffers (contrary to what Table 4.2 of the AMD APP OpenCL Programming Guide would have me believe).
Alternatively, if I implement the CPU portion in host code and access the sub buffers by mapping them to the host, I incur the expected performance penalty by not accessing the sub-buffers via the CPU's cache.
My questions are therefore:
- Am I incurring this coherence problem because zero copy isn't (well) defined if there are multiple OpenCL devices? Section 4.4.2 of the AMD APP OpenCL Programming Guide says "multiple GPU devices are not supported". Should this be revised to "multiple OpenCL devices are not supported in combination with zero copy"?
- Buffers are defined by the programmer within a context, not on a per-device basis. When buffers are defined with the CL_MEM_USE_PERSISTENT_MEM_AMD flag on a fusion system where there is a CPU and GPU device, does the GPU's device buffer point to the same physical memory as the CPU's device buffer? If not, that would help explain why copying takes place when mapping occurs.
- Is there a way to avoid this copying of data when using the CPU and GPU as OpenCL devices? It would be convenient if this flag essentially "turned off" the cache if the GPU is writing to the sub-buffer, but enabled it for writes from the CPU. This would give the best of both worlds - zero copying of data as well as optimized access to memory for each device. As long as the programmer sets up their sub-buffers properly (i.e. making sure sub-buffers don't overlap or split a cache line), things would work.
- I think I already know the answer to this one, but is it possible to avoid the performance penalty of having the CPU access uncached data or forcing the GPU to use the CPU's cache coherency protocol when using the CPU in the context of host code? It seems that the current Fusion architecture facilitates inexpensive copying of data to/from the GPU, however the performance penalty the CPU incurs by working with uncached data still doesn't make it very attractive to use the CPU and GPU simultaneously.