I want to be able to use the direct memory access capability of the on-chip GPU inside the Fusion APU to access CPU resident memory directly without having to copy it over to the GPU memory partition first. I have gone through the slides on: http://amddevcentral.com/afds/assets/presentations/1004_final.pdf.
I am running Linux on my Trintiy APU. I have a CPU memory resident buffer (32 MB in experiments) that I want the GPU to access directly without copying it over to its own global memory first, and I have created a CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR OpenCL buffer for this. From what I understand from the OpenCL manual, this would pin the host-resident buffer and the GPU should then be able to access it directly without copying over to its RAM partition (= device global memory).
But from my experiments, I don't believe I am seeing a direct access / zero-copy behaviour. This is what I do: I create a host resident buffer, then create an OpenCL buffer out of it as above. My OpenCL kernel simply copies this buffer contents into another, which I then explicitly read back and compare against the host resident buffer to see what the GPU saw. The first kernel call runs fine, I am able to verify buffer contents. But I immediately follow that up with another kernel call after changing the host-resident buffer contents. But the buffer that I read back after this second kernel call does not match my modified buffer. If I make a map/unmap call before making the second kernel call, both the buffers match but I suspect there is a copying going on from the CPU -> GPU memory partitions during map/unmap.
I have experiment with three different configurations for creating the host-resident buffer while observing different behaviours:
1. I malloc a buffer, create an OpenCL buffer out of it and make the kernel call. The buffer read back from GPU matches with the host buffer. I change the buffer contents and make another kernel. The GPU does not see the updated buffer unless I wrap the host buffer update between map/unmap calls.
2. I create a shared memory segment on process#1 and attach to it in process#2 which creates an OpenCL buffer out of it and then makes the kernel call. I change the buffer contents on process#1, and make another kernel call on process#2 without map/unmap and reusing the same openCL buffer as in the first kernel call. The GPU sees the updated buffer !
3. I create a memory mapped file on process#1 and map the same file in process#2 which creates an OpenCL buffer out of it and then makes the kernel call. I change the buffer(file) contents on process#1, and make another kernel call on process#2 reusing the same openCL buffer as in the first kernel call. The GPU does not see the updated buffer unless I make a map and unmap call before making the second kernel call. The behaviour is the same irrespective of the mmap flavour: mmaping /dev/mem, or mmaping disk resident file, or mmaping ramfs resident file
So map/unmap does have the effect I desire but I think it causes a CPU->GPU copy underneath and not GPU direct access to the host-resident buffer. How do I ascertain this? Timing information collected from CodeXL shows the clfinish() call after the unmap() call taking more time than what an explicit write transfer to the gpu memory partition would take.