4 Replies Latest reply on May 21, 2008 6:33 AM by michael.chu

    some performance tuning questions

    bayoumi
      Hi,
      I have some questions regarding i/o optimization:
      1- When we do streamRead, is the stream saved on the system remote memory or on the GPU card local memory?
      2- Do the streams get flushed from the GPU after each kernel is finished?
      3- Which is the real time consumer: the streamRead or the calling the kernel, when it comes to passing streams?
      4- What is the best way to use temp variables for intermediate storage? The only solution I found is to pass dummy streams without streamRead/streamWrite. Brook does not allow currently having temp local storage (either in array or stream format). My target is to minimize unnecessary I/O operations for these temp. variables

      Thanks
      Amr
        • some performance tuning questions
          michael.chu
          Hi Amr,

          1) When you do the streamRead() that kicks off the memory transfer from system memory to the GPU memory. When the transfer completes (not when streamRead() returns since it is an async call) the data resides in the GPU memory.

          2) Streams are not flushed from GPU memory after each kernel call. The data still resides in the GPU memory.

          3) When you talk about the real time consumer, are you saying what takes most of the time? Both are async function calls so both schedule the action but do not wait. How much time each takes depends on how much data you are transfering and how much computation you are doing. streamWrites, however, do block (since they must guarantee the data is available in your C array when they return).

          4) If you are passing data between kernels, keeping the streams in GPU memory (not writing them back to the system memory) is the best idea for now.

          Michael.
          • some performance tuning questions
            bayoumi
            Hi Micael
            Thanks, I started to see things better now.
            Regarding question number 4, I meant the variables which are used for carrying intermediate results during multi-step calculations, within the same kernel.

            But from your reply, it seems I can just send some dummy streams during the first kernel, and keep using them by other kernels. But in all cases, these dummy streams will always have to be in the arguments list of each kernel?

            I see from your reply that the actual physical data transfer is during streamRead & streamWrite.
            This is different from my previous understanding that streamRead & streamWrite are some sort of "malloc" for the GPU and pointer setting, and that the actual data transfer takes place when I invoke the kernel. It seems I was wrong,

            New Question:
            when I call the kernel several times within my program, does the kernel GPU-executable code gets loaded, DURING RUNTIME, to the GPU everytime I call it (the way an interpreter works), or
            do all kernels get loaded to the GPU at the beginning of my C/C++ program, even if I will not use some of them?
            Best Regards
            Amr
            • some performance tuning questions
              nberger
              Hi Michael,
              another question on the same topic: At what point does the CPU wait for GPU calculations to finish - when I call streamWrite() or when I call the next kernel? Or to phrase the question differently, is there something like a queue for kernels?

              Thanks

              Nik
                • some performance tuning questions
                  michael.chu
                  Hi Nik,

                  Only streamWrite() blocks. streamRead() and kernel calls are asynchronous (the kernel call does wait for its inputs to be transfered and be ready before executing).

                  This does lead to some confusion when doing timing since you will tend to incorrectly believe streamRead() and the kernel call are super-fast and streamWrite() is super-slow. streamWrite() is likely just getting dinged for most of the calculation since it has to wait for the kernel execution to finish.

                  Michael.