I want to multiply two matrices, doing C = A * B.
Now, I've three OpenCL devices in my context (an AMD CPU and 2 AMD GPUs, one integrated and one discrete) and I'd like to make the computation/data to be splitted among all of them.
The target is that each device uses the whole matrix B and a slice (some rows of A) to compute a slice of C.
While writing the code, I discovered that clCreateBuffer creates a buffer which is shared among all those devices. Anyway, I've two question I was not able to answer.
1) How allocation on a particular device is correlated with the call to clCreateBufferFunction? Does the allocation happen as soon as the function is called of when the buffer is initialized or, again, when the content of the buffer is firstly used by the kernel? Since I'm using a discrete GPU and an integrated one, it would be nice to allocate a buffer for B on the pinned host memory, so that a copy is (or may be) allocated on the discrete card, but no allocation is made on the integrated GPU (would be better to rely on memory sharing to access B without copy). Are the current driver implementing some smart strategies like this even in a context with integrated and discrete multiple GPUs?
2) What is the behaviour of buffer initialization? Do I have to do clEnqueueWriteBuffer (or clEnqueueMapBuffer + write) for each device or it is enough to do it once (and all the devices will see the content put into the buffer)?
Thank you very much!