AnsweredAssumed Answered

Best way to write heterogeneous matrix multiplication on fusion

Question asked by cadorino on Mar 5, 2012


I'm developing some simple algorithms on an APU (on a Windows machine) and I'd like the CPU and the integrated GPU to operate in parallel exploiting zero-copy.

As a sample, we can consider matrix multiplication: C = A * B, where the CPU takes the first N/2 rows of A while the GPU operates on the other N/2.


My question is: which is the best strategy for context/buffer creation to avoid (or at least, to minimize) copies of data and to speed-up the computation?

I can think to some solutions/tricks that I'd such as:


1) Create sub buffer from A for the two halves used respectively by the CPU and the GPU: do subbuffers imply copies?

2) Create a shared context for A and B, passing an offset to each device since A is entirely shared. Which is the best way to allocate buffers for A and B to be shared between the CPU device and the GPU one?

Moreover, is there any constraint in shared buffer allocation/initialization (e.g. mem flags, mapping/enqueue_read,write)?

3) Use fission extension to reserve a CPU core for GPU scheduling: did anybody of you checked the impact of having a CPU core dedicated to schedule on GPU instead of working together with the other cores in performing part of the computation?


Thank you very much!


P.S. I know that the CPU is not able to contribute significantly to the speed-up of a matrix multiplication, since its completion time is orders of magnitude higher than the GPU one. It's just for testing purpose (and will be applied to more "heterogeneous-oriented" algorithms).