Hi, I've got excellent performance of my application on Brook platform, but very first brook kernel takes ~4000ms.
kernel function itself have around 25 streams, and amount of data transferred each call is around 1Mb. 2nd and later calls are fast & perfect (~90ms).
Any clues?
Right now I have to measure performance of my application, so I have to do first "dummy" call , and bench second :-S
Brook+ kernel call implementation implements various caches. That's why you see a speed-up from second kernel call. There is no way you can avoid first slow kernel call.
Probably these 4000 ms just eaten by calclCompile routine. For large kernels compiling speed become a real problem. Try to grab your kernel code from brook+ *.cpp code and compile it alone to figure out.
You right, I have quite huge kernel. This kinda sucks.
CUDA port gets executed almost instantly :-S
Thanks for your replies.
I ran the black scholes example in the brook directory. It seems to indicate that the GPU calculations are slower than the ones on the CPU? Has anyone seen this or is something misconfigured on my box (openSUSE AMD64/ATI 4850 HD).
Thanks,
-Greg
use greater input, I've seen improvement on 1 million up to 3 million input samples for black scholes
Yes, I do see the speed up now. Thank you. -Greg
I ran the example up to 200k replications. I will try what you suggest.
Thank you,
-Greg
Originally posted by: BarsMonster You right, I have quite huge kernel. This kinda sucks.
CUDA port gets executed almost instantly :-S
Thanks for your replies.
CUDA has offline kernel compilation. I'm not sure why no one included it in Brook+ compilation step, must be because forward-compatibility