Both can be used. But, I think second one would be better, as it would help to hide Setup time required in Brook+ StreamRead.
First one. The kernel launch will wait for streamReads to complete on all the streams passed as argument and then only it will return.
But, usually kernel launch setup time also include kernel compilation, it might be good idea to sleep for lesser period after stream reads.
A.read(buf1);
B.read(buf2);
C.read(buf3);
Sleep(t); // t < 15
ABC_kernel(A,B,C,dest);
Sleep(20);
Only on first kernel call.