5 Replies Latest reply on Jun 30, 2008 3:38 PM by ryta1203

    kernel calls are slow

    josopait

      Consider the following test program:

       

      kernel void copy(float a<>, out float b<>
      {
        b = a;
      }

      int main()
      {
        float a<10>;
        float b<10>;
        int t;

        for (t=0; t<100000; ++t)
          {
            copy(a, b);
          }
      }


      It simply copies stream a to stream b, 100000 times. This seems like an easy task to do, but it takes 6.5 seconds to run on my computer. That's 65 microseconds for each call to the copy kernel.

      Why is it so slow? What happens during a kernel call? Isn't it simply a matter of pushing a couple of instructions to the gpu? Can one speed this up?

      Thanks for any help,

      Ingo

       

        • kernel calls are slow
          Nexis

          Using CAL instead of Brook+ can speed this up quite much... Calling the same kernel with CAL takes about 30us.

          However, I still think this is very slow... Would there be a way to speed this up ? I know CAL is built over CTM so perhaps we could save some latency going at a lower level but CTM is nowhere to be found on AMD's website. Does someone know if it's still possible to use CTM?

          • kernel calls are slow
            MicahVillmow
            Josopait,
            There are few issues with your code.
            1) The main issue is that you are using a 1D stream on a device that is optimized for 2D memory accesses, which brings up a few minor issues.
            2) The size of your stream is so small that you are actually testing more of the runtime than the kernel performance. The runtime has to do things like setup state on the graphics card and copy the kernel over, which isn't free.
            3) The architecture is best utilized when using a 2D stream that has a width and height that is a multiple of 2. This is because of how the streaming cores are setup and how they process pixels in quads. If you only use a 1D stream that is not address translated, then you are basically throwing away half of your streaming cores as they are not active when you run the kernel.
            4) If a 1D stream IS address translated, then there is ALU overhead involved with translated between 1D and 2D coordinates in a generic way, which is not the most efficient.

            If you run the throughput example from the CAL sdk with a width of 10, height of 1 and 1000000 iterations, you can see you get a measly 0.02GB/s, which is nowhere close to the peak performance of the card.

            Hope this information helps.
              • kernel calls are slow
                josopait

                Well, using 2D streams increases the speed by a factor of two. It takes now 32 us for every kernel call. But this is still way too slow for the purpose which I had in mind.

                Isn't there any way to make kernel calls faster? Does the kernel have to be copied over in every loop cycle?

                Ingo

                 

                  • kernel calls are slow
                    Remotion

                    @josopait:

                    Just try to look at c++ code that BROOK+ will create.

                    If I understand this correct then every time (100000 times)

                    CPU will call copy() function with will do a loot of work before beginin to just coping only 10 values.

                    Probably to test performance you shold modifi codo to something like this.

                    [code]

                        float a<65536>;
                        float b<65536>;
                        int t;

                        for (t=0; t<16; ++t)
                        {
                            copy(a, b);
                        }

                    [/code]


                    What exacly do you want to do may be it is easier and faste do this on CPU and letting GPU do other work.


                    Remo

                     

                      • kernel calls are slow
                        ryta1203
                        Parallelism is not the only requirement for good GPGPU. Arithmetic intensity should also be high. This application is probably better suited for the CPU.

                        Arithmetic Intensity:

                        arithmetic intensity = operations / words transferred.