I have the following situation, a machine with 2 AMD GPUs and 1 CPU all of which are presented as devices. I have created queues to all 3 devices in "out-of-order" mode in the same Context. I have 3 kernels (2 OpenCL, one Native). I have 2 memory buffers, the first is created using
CL_MEM_READ_WRITE | CL_MEM_COPY_HOST_PTR to create and copy the data in one step.
I pass the first buffer as input to the 2 GPU kernels, and I queue the 2 OpenCL kernels, one to each GPU, with the event from the first passed as a wait event to the second. I queue the Native kernel onto the CPU queue with a wait event of the 2nd of the GPU kernels, I also pass both memory buffers to the Native kernel.
I understood it to be that the Context managed the movement of memory buffers as required and since I have used events to synchronise the 3 kernels I should get the correct result. However, although the kernels start one after the other (I have looked at the start and stop times for the 3 kernel events and the times do not overlap and are in the correct order) the data from the second kernel is not propagated to the third kernel before it executes so it uses out of date data (seemingly the data after just hte first kernel).
If I put a readBuffer command before the 3rd kernel launch so that the data is forced back to the host then all is well, but I thought that OpenCL did this for me without me having to manage it. Have I missunderstood this ?