I have a question about the memory model on AMD Fusion devices:
As far as I understand on Llano the CPU and GPU work on separate areas of the same physical memory. So a copy is still needed if the CPU works a buffer first and then the GPU works on the same buffer. Is that right?
Has this situation changed with AMD Trinity? On Intel's Ivy Bridge platform it seems that copying data is not necessary any more. Afaik you can simply create a buffer in the CPU-GPU context which can then be accessed by both the CPU and the GPU device without being copied. Is this similar on Trinity?
This feature is relevant to all platforms supporting zero copy: APUs and discrete GPUs.
Discrete GPU access to host memory is slower than APUs since it passes through pci-e bus.
What about the performance on the APUs? I thought on Llano there's still some overhead when the buffer is not explicitly copied to GPU memory. Is that right? If so, has that situation changed on Trinity?
This is the state of Llano found on page 35 of the presentation Memory System on Fusion APUs:
|Llano Memory State||Local||Uncached||Cacheable|
|GPU Read||17 GB/s||6-12 GB/s||4.5 GB/s|
|GPU Write||12 GB/s||6-12 GB/s||5.5 GB/s|
|CPU Read||< 1GB/s||< 1GB/s||8-13 GB/s|
|CPU Write||8 GB/s||8-13 GB/s||8-13 GB/s|
What a programmer would like is a state that has full bi-directional bandwidth performance for both the CPU and GPU so the programmer isn't constantly worried about this performance table and which flags they need to use on buffer allocation.
I don't think dominik_g's question was fully answered. Dominik_g is correct that there is a penalty on Llano if not copied from CPU Cacheable memory to GPU Local memory.
Does Trinity perform differently? Is there another chart like the one above?
If you find the (attached) presentation "Assessing the relevance of APU for high performance scientific computing" from AFDS12, all of the benchmarks listed for Trinity still use the same memory system found in Llano.
The highest performance option for the benchmarks is to explicitly copy input data from "CPU memory" to "GPU memory" and then copy the output data from "GPU memory" to "CPU memory". This appears to be no different than what is done with discrete GPUs.
So it appears that nothing has changed between Llano and Trinity. What a disappointment.
I'm actually performing some benchmarks (matrix addition, multiplication, reduction, convolution) using an A8 APU and a 5870 GPU.
I get the completion time by varying the device and the allocation strategy (ALLOC_HOST, USE_PERSISTENT_MEM, no flags, ...), where completion time include allocating and initializing the input, executing kernel and retrieving the output.
I determined that for both the discrete and the integrated GPU, using zero-copy input allocation on the host (ALLOC_HOST | READ_ONLY) or zero-copy on the visible device memory lead to best performances.
I think that the best allocation/data-transfer strategy strictly depends on the memory access patterns of the the kernel and of the host.
Resources create with 'CL_MEM_ALLOC_HOST_PTR' are accessed directly by the GPU and CPU with no copy in between.
When you say CPU, is that limited to accesses through an OpenCL kernel running on the CPU device or is the zero copy also true when accessing the same memory from C/C++ host code after a clEnqueueMap?