Hi, I've got excellent performance of my application on Brook platform, but very first brook kernel takes ~4000ms.
kernel function itself have around 25 streams, and amount of data transferred each call is around 1Mb. 2nd and later calls are fast & perfect (~90ms).
Right now I have to measure performance of my application, so I have to do first "dummy" call , and bench second :-S
Brook+ kernel call implementation implements various caches. That's why you see a speed-up from second kernel call. There is no way you can avoid first slow kernel call.
Probably these 4000 ms just eaten by calclCompile routine. For large kernels compiling speed become a real problem. Try to grab your kernel code from brook+ *.cpp code and compile it alone to figure out.
I ran the black scholes example in the brook directory. It seems to indicate that the GPU calculations are slower than the ones on the CPU? Has anyone seen this or is something misconfigured on my box (openSUSE AMD64/ATI 4850 HD).
Originally posted by: BarsMonster You right, I have quite huge kernel. This kinda sucks.
CUDA port gets executed almost instantly :-S
Thanks for your replies.
CUDA has offline kernel compilation. I'm not sure why no one included it in Brook+ compilation step, must be because forward-compatibility