cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

josopait
Journeyman III

kernel calls are slow

Consider the following test program:

 

kernel void copy(float a<>, out float b<>
{
  b = a;
}

int main()
{
  float a<10>;
  float b<10>;
  int t;

  for (t=0; t<100000; ++t)
    {
      copy(a, b);
    }
}


It simply copies stream a to stream b, 100000 times. This seems like an easy task to do, but it takes 6.5 seconds to run on my computer. That's 65 microseconds for each call to the copy kernel.

Why is it so slow? What happens during a kernel call? Isn't it simply a matter of pushing a couple of instructions to the gpu? Can one speed this up?

Thanks for any help,

Ingo

 

0 Likes
5 Replies
Nexis
Journeyman III

Using CAL instead of Brook+ can speed this up quite much... Calling the same kernel with CAL takes about 30us.

However, I still think this is very slow... Would there be a way to speed this up ? I know CAL is built over CTM so perhaps we could save some latency going at a lower level but CTM is nowhere to be found on AMD's website. Does someone know if it's still possible to use CTM?

0 Likes

Josopait,
There are few issues with your code.
1) The main issue is that you are using a 1D stream on a device that is optimized for 2D memory accesses, which brings up a few minor issues.
2) The size of your stream is so small that you are actually testing more of the runtime than the kernel performance. The runtime has to do things like setup state on the graphics card and copy the kernel over, which isn't free.
3) The architecture is best utilized when using a 2D stream that has a width and height that is a multiple of 2. This is because of how the streaming cores are setup and how they process pixels in quads. If you only use a 1D stream that is not address translated, then you are basically throwing away half of your streaming cores as they are not active when you run the kernel.
4) If a 1D stream IS address translated, then there is ALU overhead involved with translated between 1D and 2D coordinates in a generic way, which is not the most efficient.

If you run the throughput example from the CAL sdk with a width of 10, height of 1 and 1000000 iterations, you can see you get a measly 0.02GB/s, which is nowhere close to the peak performance of the card.

Hope this information helps.
0 Likes

Well, using 2D streams increases the speed by a factor of two. It takes now 32 us for every kernel call. But this is still way too slow for the purpose which I had in mind.

Isn't there any way to make kernel calls faster? Does the kernel have to be copied over in every loop cycle?

Ingo

 

0 Likes

@josopait:

Just try to look at c++ code that BROOK+ will create.

If I understand this correct then every time (100000 times)

CPU will call copy() function with will do a loot of work before beginin to just coping only 10 values.

Probably to test performance you shold modifi codo to something like this.

    float a<65536>;
    float b<65536>;
    int t;

    for (t=0; t<16; ++t)
    {
        copy(a, b);
    }


What exacly do you want to do may be it is easier and faste do this on CPU and letting GPU do other work.


Remo

 

0 Likes

Parallelism is not the only requirement for good GPGPU. Arithmetic intensity should also be high. This application is probably better suited for the CPU.

Arithmetic Intensity:

arithmetic intensity = operations / words transferred.
0 Likes