Consider the following test program:
kernel void copy(float a<>, out float b<>
b = a;
for (t=0; t<100000; ++t)
It simply copies stream a to stream b, 100000 times. This seems like an easy task to do, but it takes 6.5 seconds to run on my computer. That's 65 microseconds for each call to the copy kernel.
Why is it so slow? What happens during a kernel call? Isn't it simply a matter of pushing a couple of instructions to the gpu? Can one speed this up?
Thanks for any help,