4 Replies Latest reply on Jun 6, 2009 7:30 PM by Raistmer

    How costly kernel call ?

    Raistmer
      How many CPU cycles takes to prepare kernel launch?...

      I use pretty simple kernels but call them many times in program.

      CPU backend performance of Brook version worse than pure CPU version, but CAL backend performance even worse!

      Performane degrades in many folds when running on GPU.  (Both elapsed and CPU times)

      I use HD4870 for benchmarking, not slowest one, so such result pretty discouraging.

      When I added RDTSC-based counters to see what kernel took longest time it appeared that all counters returns approx same mean ticks value no matter what of kernels is running.

      It could lead to conclusion that actual running time of my simple kernels is very low and totally hided in kernel run preparation that took vast majority of running time.

      So, the question is - does some info what CPU time takes very simple (for example stream A + stream B) kernel call available ?

      What is recommended kernel length to be useful (to decreas app running time instead of increasing it) ?

       

        • How costly kernel call ?
          sambucuself

          I think that actually calling kernel and all the steps neccesery to perform that operation are very CPU time ineffective if the kernel is "too short" or the field of execution (the domain) is too small.

          You should try working with relatively large streams and perform as much calclucation as you can with as few memory operations so that you avoid bottlenecks.

           

          I'm working on some technical calculations related stream kernel programming and those are my conclusions.

            • How costly kernel call ?
              Raistmer

               

              Originally posted by: sambucuself I think that actually calling kernel and all the steps neccesery to perform that operation are very CPU time ineffective if the kernel is "too short" or the field of execution (the domain) is too small.

              You should try working with relatively large streams and perform as much calclucation as you can with as few memory operations so that you avoid bottlenecks.

               

              I'm working on some technical calculations related stream kernel programming and those are my conclusions.

              Yes, but maybe some number estimates?

              Stream (domain) size restricted by size of data array processed, sometimes it prety small... Will try to enlarge kernel itself.

            • How costly kernel call ?
              Gipsel

              Originally posted by: Raistmer

              I use pretty simple kernels but call them many times in program.


              That's inherently bad   Simple kernels are never compute bound and the calling overhead will kill the performance.

              Originally posted by: Raistmer

              When I added RDTSC-based counters to see what kernel took longest time it appeared that all counters returns approx same mean ticks value no matter what of kernels is running.

              It could lead to conclusion that actual running time of my simple kernels is very low and totally hided in kernel run preparation that took vast majority of running time.

              So, the question is - does some info what CPU time takes very simple (for example stream A + stream B) kernel call available ?

              What is recommended kernel length to be useful (to decreas app running time instead of increasing it) ?



              I've seen somewhere a number of about 20µs overhead per kernel call, but I guess it was for the CAL interface, I never measured it. The Brook+ layer will add a bit on top of it. I try to have kernels that need some milliseconds (or several tens of ms). Copying a lot of stuff to the GPU before and back after a kernel also cause some major slowdown. It's better to let all results in the GPU memory if possible.

              Such a simple kernel of adding two arrays is only useful as an intermediate step between complex kernels (copying the arrays to the GPU just to add it there is definitely slower than to do it on the CPU). If possible, one should integrate such things into the kernel before or after it.

                • How costly kernel call ?
                  Raistmer

                  Actually all data reside in GPU already, just many data transfers inside GPU memory. Enlarging of one of kernels (put loop inside kernel instead of calling it in loop) already gave big performance boost. It seems it's way to go