I'm developing some simple algorithms on an APU (on a Windows machine) and I'd like the CPU and the integrated GPU to operate in parallel exploiting zero-copy.
As a sample, we can consider matrix multiplication: C = A * B, where the CPU takes the first N/2 rows of A while the GPU operates on the other N/2.
My question is: which is the best strategy for context/buffer creation to avoid (or at least, to minimize) copies of data and to speed-up the computation?
I can think to some solutions/tricks that I'd such as:
1) Create sub buffer from A for the two halves used respectively by the CPU and the GPU: do subbuffers imply copies?
2) Create a shared context for A and B, passing an offset to each device since A is entirely shared. Which is the best way to allocate buffers for A and B to be shared between the CPU device and the GPU one?
Moreover, is there any constraint in shared buffer allocation/initialization (e.g. mem flags, mapping/enqueue_read,write)?
3) Use fission extension to reserve a CPU core for GPU scheduling: did anybody of you checked the impact of having a CPU core dedicated to schedule on GPU instead of working together with the other cores in performing part of the computation?
Thank you very much!
P.S. I know that the CPU is not able to contribute significantly to the speed-up of a matrix multiplication, since its completion time is orders of magnitude higher than the GPU one. It's just for testing purpose (and will be applied to more "heterogeneous-oriented" algorithms).