I have an Trinity APU (A10-5800K) and would like to use both CPU and GPU in one context.
When I use the GPU in a single context, I can specify buffers with CL_MEM_READ_ONLY | CL_MEM_ALLOC_HOST_PTR and the buffer is placed in uncached host memory. This means fast memory access for the GPU (around 25 GB/s) and size constraints beyond GPU dedicated memory. For the other buffer flags with CL_MEM_ALLOC_HOST_PTR I see 6 GB/s which is typical for cached host memory. (all good so far)
When I have CPU and GPU together in one context, I cannot use the uncached memory. I see around 6 GB/s memory bandwidth for all memory flag options that use the host memory. There is no combination which gives me the fast uncached host memory.
Is there a way to use uncached host memory in a shared context or is this not meant to be?
Thanks a lot,
That is a very interesting observation
Will it be possible for you share your code here ? I can forward this to relevant AMD Engineers.
On the other hand, just speculating, but if you have both CPU & GPU in the same context, the runtime might assume you intend to use both the devices for running OpenCL kernels. Placing the buffer in USWC will be very slow for CPUs, so runtime is probably trying to do some optimization.
I thought it might be an optimization too.
But with one context for the GPU, I can create a USWC buffer and map it with zerocopy to a pointer where I can access the data with the host program. This is awfully slow but it is possible, so why not access it with OpenCL on CPU. It would be beneficial if I have a large kernel for the GPU with lots of memory accesses and only a small kernel on the CPU.
I will try to build a small example on the weekend.
Here comes the code. I only tested it on Windows7 so far.
I have a writing kernel, which just does that: data[get_global_id(0)] = get_global_id(0);
for 8388608 integers = 32 MB.
I test 7 different Buffer flags (flag combinations) where you can really see the 2 different bus systems working.
The interesting row is the last one in each group.
---> Testing GPU in its own Context
CL_MEM_READ_WRITE = 15.2134 GB/s
CL_MEM_WRITE_ONLY = 15.5883 GB/s
CL_MEM_READ_ONLY = 15.4508 GB/s
CL_MEM_ALLOC_HOST_PTR = 6.02002 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE = 5.94496 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY = 6.01704 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY = 14.3714 GB/s
---> Testing GPU in shared Context
CL_MEM_READ_WRITE = 15.2083 GB/s
CL_MEM_WRITE_ONLY = 15.7109 GB/s
CL_MEM_READ_ONLY = 15.4362 GB/s
CL_MEM_ALLOC_HOST_PTR = 6.03106 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_WRITE = 6.01752 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_WRITE_ONLY = 6.00192 GB/s
CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY = 6.03233 GB/s
I have filed a bug report and is still pending for resolution.
At this point, I cannot comment whether this is a bug (or) not -- I dont have first hand information.
I am checking.. If I get to know something, I will let you know of.