This content has been marked as final. Show 4 replies
1) When you do the streamRead() that kicks off the memory transfer from system memory to the GPU memory. When the transfer completes (not when streamRead() returns since it is an async call) the data resides in the GPU memory.
2) Streams are not flushed from GPU memory after each kernel call. The data still resides in the GPU memory.
3) When you talk about the real time consumer, are you saying what takes most of the time? Both are async function calls so both schedule the action but do not wait. How much time each takes depends on how much data you are transfering and how much computation you are doing. streamWrites, however, do block (since they must guarantee the data is available in your C array when they return).
4) If you are passing data between kernels, keeping the streams in GPU memory (not writing them back to the system memory) is the best idea for now.
Thanks, I started to see things better now.
Regarding question number 4, I meant the variables which are used for carrying intermediate results during multi-step calculations, within the same kernel.
But from your reply, it seems I can just send some dummy streams during the first kernel, and keep using them by other kernels. But in all cases, these dummy streams will always have to be in the arguments list of each kernel?
I see from your reply that the actual physical data transfer is during streamRead & streamWrite.
This is different from my previous understanding that streamRead & streamWrite are some sort of "malloc" for the GPU and pointer setting, and that the actual data transfer takes place when I invoke the kernel. It seems I was wrong,
when I call the kernel several times within my program, does the kernel GPU-executable code gets loaded, DURING RUNTIME, to the GPU everytime I call it (the way an interpreter works), or
do all kernels get loaded to the GPU at the beginning of my C/C++ program, even if I will not use some of them?
another question on the same topic: At what point does the CPU wait for GPU calculations to finish - when I call streamWrite() or when I call the next kernel? Or to phrase the question differently, is there something like a queue for kernels?
Only streamWrite() blocks. streamRead() and kernel calls are asynchronous (the kernel call does wait for its inputs to be transfered and be ready before executing).
This does lead to some confusion when doing timing since you will tend to incorrectly believe streamRead() and the kernel call are super-fast and streamWrite() is super-slow. streamWrite() is likely just getting dinged for most of the calculation since it has to wait for the kernel execution to finish.