Consider the following test program:
kernel void copy(float a<>, out float b<>
{
b = a;
}
int main()
{
float a<10>;
float b<10>;
int t;
for (t=0; t<100000; ++t)
{
copy(a, b);
}
}
It simply copies stream a to stream b, 100000 times. This seems like an easy task to do, but it takes 6.5 seconds to run on my computer. That's 65 microseconds for each call to the copy kernel.
Why is it so slow? What happens during a kernel call? Isn't it simply a matter of pushing a couple of instructions to the gpu? Can one speed this up?
Thanks for any help,
Ingo
Using CAL instead of Brook+ can speed this up quite much... Calling the same kernel with CAL takes about 30us.
However, I still think this is very slow... Would there be a way to speed this up ? I know CAL is built over CTM so perhaps we could save some latency going at a lower level but CTM is nowhere to be found on AMD's website. Does someone know if it's still possible to use CTM?
Well, using 2D streams increases the speed by a factor of two. It takes now 32 us for every kernel call. But this is still way too slow for the purpose which I had in mind.
Isn't there any way to make kernel calls faster? Does the kernel have to be copied over in every loop cycle?
Ingo
@josopait:
Just try to look at c++ code that BROOK+ will create.
If I understand this correct then every time (100000 times)
CPU will call copy() function with will do a loot of work before beginin to just coping only 10 values.
Probably to test performance you shold modifi codo to something like this.
float a<65536>;
float b<65536>;
int t;
for (t=0; t<16; ++t)
{
copy(a, b);
}
What exacly do you want to do may be it is easier and faste do this on CPU and letting GPU do other work.
Remo