Hi Micael
Thanks, I started to see things better now.
Regarding question number 4, I meant the variables which are used for carrying intermediate results during multi-step calculations, within the same kernel.
But from your reply, it seems I can just send some dummy streams during the first kernel, and keep using them by other kernels. But in all cases, these dummy streams will always have to be in the arguments list of each kernel?
I see from your reply that the actual physical data transfer is during streamRead & streamWrite.
This is different from my previous understanding that streamRead & streamWrite are some sort of "malloc" for the GPU and pointer setting, and that the actual data transfer takes place when I invoke the kernel. It seems I was wrong,
New Question:
when I call the kernel several times within my program, does the kernel GPU-executable code gets loaded, DURING RUNTIME, to the GPU everytime I call it (the way an interpreter works), or
do all kernels get loaded to the GPU at the beginning of my C/C++ program, even if I will not use some of them?
Best Regards
Amr