I have one device so I haven’t this problems and unfortunately can’t answer your first question.
But your second question is very actual to me. I sent my own question about IL output array without global buffer 15.06.2009 but nobody has answered yet.
I think the best way for data transfer is to use DMA or may be copy shader to copy data from host resource to Local resource and vice-versa. And, both DMA & copy shader can be asynchronous.
Still, in this case you need to map host resource to copy data from CPU pointer, but I think the map-unmap time would be much lesser.
Well, map-unmap time isn't that important for me. As it only 16-128Kb data to copy while kernel runs for 200ms minimum. Real problem is calResMap sync issues.
Without extending to many processes running it's possible to do:
1. map input data #1
2. map input data #2
3. start kernel on gpu #1
4. start kernel on gpu #2
5. wait for both of them to complete
6. map output data #1
7. map output data #2
(Funny thing that you'll need two threads to wait for kernels because calCtxIsEventDone() blocks thread until kernel not executed while documentation said exactly opposite. I already wrote about this in another unanswered topic. Anyways).
It works but 2 more problems:
1. While mapping inputs/outputs ALL GPUs idling doing nothing. Not that a problem for small sized inputs/outputs and really big problem for large data sets.
2. Executing speed limited by lowest speed GPU. It's no problem if there only similar GPUs at system (say, one 4870x2) but it's problem when mixing anything else (like I have atm 4770 + 4850). It, in theory, can be resolved with carefully selecting input data size so each kernel will runs for (almost) same time even on different GPUs. But in practice it's way too complex.
Anyway, calResMap behaviour is just a result of poor design done by ATI team and it's possible to improve this with some work on ATI side not ours. Unfortunately I can see a clear message from ATI: "We simply don't care".