Normally you can serially do this:
enqueueWriteBuffer() on GPU1 buffer
enqueueNDRangeKernel(kernel1) on GPU1
enqueueReadBuffer() on GPU1 buffer so result1 is now back in RAM
renqueueWriteBuffer() on GPU2 buffer so result1 is now at GPU2 as input of kernel2
enqueueNDRangeKernel(kernel2) on GPU2
enqueueReadBuffer() on GPU2 buffer so result2 is now in RAM
and run multiple instances of this software to crunch multiple independent data-inputs then drivers should do the necessary overlapping of buffer copies and kernel computes. For example, you should be able to process multiple image folders using a different instance of software for each folder, for an image processing. But, using single software instance and pipelining, you use less number of contexts per device and have explicit control over timings. Some pro cards may even shorten the way between two GPUs as in OpenCL - GPU to GPU transfer
In a pipeline, you can duplicate input and output buffers so they can be used for two things at the same time: copying and computing.