cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

BarsMonster
Journeyman III

First Brook call is sloooooooowww

Hi, I've got excellent performance of my application on Brook platform, but very first brook kernel takes ~4000ms.

kernel function itself have around 25 streams, and amount of data transferred each call is around 1Mb. 2nd and later calls are fast & perfect (~90ms).

Any clues?

Right now I have to measure performance of my application, so I have to do first "dummy" call , and bench second :-S

0 Likes
8 Replies
gaurav_garg
Adept I

Brook+ kernel call implementation implements various caches. That's why you see a speed-up from second kernel call. There is no way you can avoid first slow kernel call.

0 Likes

Probably these 4000 ms just eaten by calclCompile routine. For large kernels compiling speed become a real problem. Try to grab your kernel code from brook+ *.cpp code and compile it alone to figure out.

0 Likes

You right, I have quite huge kernel. This kinda sucks.

CUDA port gets executed almost instantly :-S

Thanks for your replies.

0 Likes

I ran the black scholes example in the brook directory. It seems to indicate that the GPU calculations are slower than the ones on the CPU? Has anyone seen this or is something misconfigured on my box (openSUSE AMD64/ATI 4850 HD).

Thanks,

-Greg

0 Likes

use greater input, I've seen improvement on 1 million up to 3 million input samples for black scholes

0 Likes

Yes, I do see the speed up now. Thank you. -Greg

0 Likes

I ran the example up to 200k replications. I will try what you suggest.

Thank you,

-Greg

0 Likes

Originally posted by: BarsMonster You right, I have quite huge kernel. This kinda sucks.

CUDA port gets executed almost instantly :-S

Thanks for your replies.

CUDA has offline kernel compilation. I'm not sure why no one included it in Brook+ compilation step, must be because forward-compatibility

0 Likes