I'm testing some stencil code on heterogeneous architectures by using 2 GPUs.
In order to update memory data in different GPUs, I tried to use the function clEnqueueWriteBufferRec and clEnqueueReadBufferRec to transfer 1000 Bytes data from table A on GPU_1 to table A' on GPU_2.
Then i found this phenomenon: the overhead of data transfer increases linearly with the size of table A (We only and always transfer 1000 Bytes data from table A!). I'd like to know if anyone has noticed that? any solution?
Please can you explain your program flow (e.g. how context and memory have been created on multiple devices) in details and share your goal and observations more explicitly. A sample code which manifests your problem would be greatly appreciated.
Also, please let us know your system setups like hardware(CPU, GPU), Catalyst driver and APP SDK version, OS etc.