I bench these http://img515.imageshack.us/i/benchmark.jpg/
Why there are some dimensions that fail, slow for small input for some cases, and CAL is faster in small matrix while Brook+ faster for the rest?
Where I can find more detailed information about these? About online kernel compilation and etc?
I am not sure if you changed the timing calculation in Brook+ and CAL sample.
The CAL sample timers doesn't include a lot of stuff in timing calculation like IL compilation, resource allocation and symbol binding. Including that in time you would see better performance with Brook+.
The reason you see better performance is data transfer optimizations done in Brook+. CAL samples have a naive data transfer implementation that can be improved to get much better results.
For smaller matrices, the data transfer optimizations might not be able to overcome other overhead in IL compilation and kernel call.
About online kernel compilation and etc?
Online compilation means compiling IL kernels at runtime. It can be changed to offline compilation (using calImageRead/calclImageWrite) for better performance.
Thank you gaurav,
I don't change anything at all, one run, one iteration that's what I used.
Originally posted by: gaurav.garg I am not sure if you changed the timing calculation in Brook+ and CAL sample.
So it is very bias then. What should I do to make it "bias free"? Place the timer before what?
The CAL sample timers doesn't include a lot of stuff in timing calculation like IL compilation, resource allocation and symbol binding. Including that in time you would see better performance with Brook+.
The CAL samples uses offline or online compilation? Brook+ uses offline right?
The reason you see better performance is data transfer optimizations done in Brook+. CAL samples have a naive data transfer implementation that can be improved to get much better results.
For smaller matrices, the data transfer optimizations might not be able to overcome other overhead in IL compilation and kernel call.
About online kernel compilation and etc?
Online compilation means compiling IL kernels at runtime. It can be changed to offline compilation (using calImageRead/calclImageWrite) for better performance.
Oh yeah, I wonder why simple_matmult in Brook+ only available until 4096x4096, larger than that the program crash, stop responding, whereas optimized_matmult gets into 6400x6400 that on the edge of memory boundary?
Also some restrictions on why compute_matmult won't run on 4096x4096 and larger? memexport_matmult won't run on 2048x2048 and larger? memimport_matmult only run between 256x256 to 1024x1024?
Both Brook+ and CAL samples use online compilation.
Brook+ simple_matmult sample is a double precision MMM. That's why it requires double the size of optimized_matmult.
Thank you gaurav,
How about the last:
why compute_matmult won't run on 4096x4096 and larger? memexport_matmult won't run on 2048x2048 and larger? memimport_matmult only run between 256x256 to 1024x1024?
I can think of reason why memimport_matmult should run for smaller dimension. As with memimport, a single global buffer is used for both the inputs. The maximum dimension used in memimprt sample is exactly half of memexport sample.
I think there is some bug on CAL side that is not allowing Global buffer allocation > size 2048x2048.
Oh I see, so that compute_matmult, memexport_matmult and memimport_matmult are using global buffer too
Thank you gaurav
gaurav, what exactly is memimport example? I can find that memexport is the one that uses global buffer, but in memimport, what is written is also memexport, no memimport whatsoever.
So what is it?
Just now I've seen double_matmult example in CAL, but it only runs up to 1024x1024
Why?
memimprot uses global memory as input whereas memexport uses it as output to kernel.
and about double_matmult example?