I loop brook+ optimized_matmult 10 times and CAL simple_matmult 10 times for 6400x6400
Why CAL perform much better, when in 1 run the brook+ optimized_matmult runs better if not the same?
What's happen actually?
As I said earlier, system time reported by CAL sample doesn't include a lot of stuff. You should review the code and change the timer placement similar to Brook+.
Thanks gaurav,
I thought those actions are excluded after first kernel call in Brook+
Woops... Double post...
My connection bad lately at this site.
Originally posted by: gaurav.garg As I said earlier, system time reported by CAL sample doesn't include a lot of stuff. You should review the code and change the timer placement similar to Brook+.
Hm... After some thought. I ask because I got the average running-time for 10 iteration are about 50 percent of that one iteration, some perform 35 percent of that one iteration. Does CAL has some sort of caching algorithm too?
Bump, I really need help fast