I want to be able to use the direct memory access capability of the on-chip GPU inside the Fusion APU to access CPU resident memory directly without having to copy it over to the GPU memory partition first. I have gone through the slides on: http://amddevcentral.com/afds/assets/presentations/1004_final.pdf.
I am running Linux on my Trintiy APU. I have a CPU memory resident buffer (32 MB in experiments) that I want the GPU to access directly without copying it over to its own global memory first, and I have created a CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR OpenCL buffer for this. From what I understand from the OpenCL manual, this would pin the host-resident buffer and the GPU should then be able to access it directly without copying over to its RAM partition (= device global memory).
But from my experiments, I don't believe I am seeing a direct access / zero-copy behaviour. This is what I do: I create a host resident buffer, then create an OpenCL buffer out of it as above. My OpenCL kernel simply copies this buffer contents into another, which I then explicitly read back and compare against the host resident buffer to see what the GPU saw. The first kernel call runs fine, I am able to verify buffer contents. But I immediately follow that up with another kernel call after changing the host-resident buffer contents. But the buffer that I read back after this second kernel call does not match my modified buffer. If I make a map/unmap call before making the second kernel call, both the buffers match but I suspect there is a copying going on from the CPU -> GPU memory partitions during map/unmap.
I have experiment with three different configurations for creating the host-resident buffer while observing different behaviours:
1. I malloc a buffer, create an OpenCL buffer out of it and make the kernel call. The buffer read back from GPU matches with the host buffer. I change the buffer contents and make another kernel. The GPU does not see the updated buffer unless I wrap the host buffer update between map/unmap calls.
2. I create a shared memory segment on process#1 and attach to it in process#2 which creates an OpenCL buffer out of it and then makes the kernel call. I change the buffer contents on process#1, and make another kernel call on process#2 without map/unmap and reusing the same openCL buffer as in the first kernel call. The GPU sees the updated buffer !
3. I create a memory mapped file on process#1 and map the same file in process#2 which creates an OpenCL buffer out of it and then makes the kernel call. I change the buffer(file) contents on process#1, and make another kernel call on process#2 reusing the same openCL buffer as in the first kernel call. The GPU does not see the updated buffer unless I make a map and unmap call before making the second kernel call. The behaviour is the same irrespective of the mmap flavour: mmaping /dev/mem, or mmaping disk resident file, or mmaping ramfs resident file
So map/unmap does have the effect I desire but I think it causes a CPU->GPU copy underneath and not GPU direct access to the host-resident buffer. How do I ascertain this? Timing information collected from CodeXL shows the clfinish() call after the unmap() call taking more time than what an explicit write transfer to the gpu memory partition would take.
First of all i would like you test your code with CL_MEM_ALLOC_HOST_PTR and CL_MEM_COPY_HOST_PTR.Let us know the result. If you have already tried with this option and stil getting the same result please do share the test case. "If CL_MEM_COPY_HOST_PTR is specified in the memory access qualifier values associated with buffer it does not imply any additional copies when the sub-buffer is created from buffer." Please check the openCL spec 1.2.
I originally wanted the GPU to operate on an already existing CPU memory buffer (say X) and that is why I was experimenting with CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR. With that use case in mind, I experimented with CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR but it does not suffice. A new buffer (say Y) ends up being created as a result of these flags, and irrespective of any subsequent maps/unmaps, this buffer is different than the one used to create it (buffer X). So any changes to X are invisible to Y.
But relaxing the original CPU memory resident buffer requirement for a while, if I create an opencl buffer directly with CL_MEM_ALLOC_HOST_PTR only and immediately map/unmap it, I obtain a (cpu memory resident?) buffer (Z) that seems to be directly accessible by CPU and the GPU. That is, any modifications I make to this buffer Z on the host side are reflected on the GPU side without any further maps/unmaps. CodeXL also doesn't show any extra copies or longer kernel execution or clfinish times that could have suggested a copy underneath. So this might work if I instead use this mapped buffer Z as my CPU memory resident buffer. But the problem here is that I am limited by the size restrictions for the GPU memory partition on my system -> even though I have 12GB of memory, the GPU only has about 200MB of global memory and about 134MB max memory allocation size. exporting GPU_MAX_HEAP_SIZE=100 and GPU_MAX_ALLOC_PERCENT=100 has no effect and my BIOS has no option to increase the GPU memory partition size.
So, either I have to have a direct access / zero copy CL_MEM_READ_ONLY | CL_MEM_USE_HOST_PTR working or I need to be able to increase my GPU memory partition to fit my needs.
GPU memory partition for APUs can be increased using BIOS settings. It is generally named VRAM memory. But then it is BIOS feature, and may or maynot be present. As I understand your problem is that you need to process a very large buffer in the GPU, and creating copies of it are very expensive. In that case I suggest you to use CL_MEM_USE_HOST_PTR flag only. After this you can try creating sub-buffers out of this big buffer. Then: 1. you can try copying these sub-buffer to device buffers and process them: Probably good if buffer is READ_ONLY 2. Directly pass them as kernel arguments: probably for READ_WRITE buffers to avoid extra copies. Note the bandwidth of accessing Cacheable memory from GPU in the presentation attached above. It is not a recommended path, but may be the only one feasible for you 🙂 All the best in your efforts. We would love to hear you experiences.
Thanks for your suggestions.
I got a new motherboard with the required BIOS support and was able to increase the VRAM size to 2G. I am able to observe the desired zero copy / direct GPU memory access behaviour with CL_MEM_USE_HOST_PTR and CL_MEM_ALLOC_HOST_PTR flags (separately) but only with catalyst beta driver and neither of the stable linux drivers. The driver (being beta, I guess) crashes on certain kinds of host resident buffers but works for the 3 configurations I mentioned in my original post. The error is a kernel OOPS or BUG inside put_page + 0x9/0x40 being called from certain fglrx code that deals with lock_memory, lockPageableMemory, unlockUserPages.
I am also trying to work with zero copy in Linux an an APU (A10-6800k)
I am having trouble achieving the desired results, if you could post your code that would be very helpful.
I am particularly interested in a buffer in CPU memory, that the GPU can read from and modify.
thank you, i was not mapping the pointer correctly (i was mapping it after
kernel execution, where i should have been doing it before)
now everything works well, i can both read and write by both CPU and GPU to
a shared buffer allocated in the host's cacheable memory.