I was trying to overlap data transfer and kernel execution on my Radeon HD 5970 to hide the overhead of data transfer. I therefore created two separate queues (one for data transfer and one for kernel execution) and used events to synchronize both.
However, I wasn't able to see any overlap when I looked at the events profiling information...
According to the ATI Stream Programming Guide it should be possible to do data transfer and GPU computation in parallel. Has anyone ever managed to achieve this?