Some hours of fiddling later and I have a program that creates a context on each device and launches the same kernels on the same data simultaneously. I was expecting the overall execution time to stay about the same + a bit of overhead. However the execution time has doubled....?
Presumably the kernels are not running concurrently as they should be. Do I need two monitors attached to the card or something (I seem to remember something about this mentioned before)?