I use pretty simple kernels but call them many times in program.
CPU backend performance of Brook version worse than pure CPU version, but CAL backend performance even worse!
Performane degrades in many folds when running on GPU. (Both elapsed and CPU times)
I use HD4870 for benchmarking, not slowest one, so such result pretty discouraging.
When I added RDTSC-based counters to see what kernel took longest time it appeared that all counters returns approx same mean ticks value no matter what of kernels is running.
It could lead to conclusion that actual running time of my simple kernels is very low and totally hided in kernel run preparation that took vast majority of running time.
So, the question is - does some info what CPU time takes very simple (for example stream A + stream B) kernel call available ?
What is recommended kernel length to be useful (to decreas app running time instead of increasing it) ?