0 Replies Latest reply on Mar 5, 2012 3:29 AM by cadorino

    Best way to write heterogeneous matrix multiplication on fusion



      I'm developing some simple algorithms on an APU (on a Windows machine) and I'd like the CPU and the integrated GPU to operate in parallel exploiting zero-copy.

      As a sample, we can consider matrix multiplication: C = A * B, where the CPU takes the first N/2 rows of A while the GPU operates on the other N/2.


      My question is: which is the best strategy for context/buffer creation to avoid (or at least, to minimize) copies of data and to speed-up the computation?

      I can think to some solutions/tricks that I'd such as:


      1) Create sub buffer from A for the two halves used respectively by the CPU and the GPU: do subbuffers imply copies?

      2) Create a shared context for A and B, passing an offset to each device since A is entirely shared. Which is the best way to allocate buffers for A and B to be shared between the CPU device and the GPU one?

      Moreover, is there any constraint in shared buffer allocation/initialization (e.g. mem flags, mapping/enqueue_read,write)?

      3) Use fission extension to reserve a CPU core for GPU scheduling: did anybody of you checked the impact of having a CPU core dedicated to schedule on GPU instead of working together with the other cores in performing part of the computation?


      Thank you very much!


      P.S. I know that the CPU is not able to contribute significantly to the speed-up of a matrix multiplication, since its completion time is orders of magnitude higher than the GPU one. It's just for testing purpose (and will be applied to more "heterogeneous-oriented" algorithms).