cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

riza_guntur
Journeyman III

Need explanation about variety of matrix multiplication

I bench these http://img515.imageshack.us/i/benchmark.jpg/

Why there are some dimensions that fail, slow for small input for some cases, and CAL is faster in small matrix while Brook+ faster for the rest?

Where I can find more detailed information about these? About online kernel compilation and etc?

0 Likes
10 Replies
gaurav_garg
Adept I

I am not sure if you changed the timing calculation in Brook+ and CAL sample.

The CAL sample timers doesn't include a lot of stuff in timing calculation like IL compilation, resource allocation and symbol binding. Including that in time you would see better performance with Brook+.

The reason you see better performance is data transfer optimizations done in Brook+. CAL samples have a naive data transfer implementation that can be improved to get much better results.

For smaller matrices, the data transfer optimizations might not be able to overcome other overhead in IL compilation and kernel call.

About online kernel compilation and etc?


Online compilation means compiling IL kernels at runtime. It can be changed to offline compilation (using calImageRead/calclImageWrite) for better performance.

0 Likes

Thank you gaurav,

Originally posted by: gaurav.garg I am not sure if you changed the timing calculation in Brook+ and CAL sample.
I don't change anything at all, one run, one iteration that's what I used.

The CAL sample timers doesn't include a lot of stuff in timing calculation like IL compilation, resource allocation and symbol binding. Including that in time you would see better performance with Brook+.
So it is very bias then. What should I do to make it "bias free"? Place the timer before what?

The reason you see better performance is data transfer optimizations done in Brook+. CAL samples have a naive data transfer implementation that can be improved to get much better results.

For smaller matrices, the data transfer optimizations might not be able to overcome other overhead in IL compilation and kernel call.

About online kernel compilation and etc?


Online compilation means compiling IL kernels at runtime. It can be changed to offline compilation (using calImageRead/calclImageWrite) for better performance.

The CAL samples uses offline or online compilation? Brook+ uses offline right?

Oh yeah, I wonder why simple_matmult in Brook+ only available until 4096x4096, larger than that the program crash, stop responding, whereas optimized_matmult gets into 6400x6400 that on the edge of memory boundary?

Also some restrictions on why compute_matmult won't run on 4096x4096 and larger? memexport_matmult won't run on 2048x2048 and larger? memimport_matmult only run between 256x256 to 1024x1024?

0 Likes

Both Brook+ and CAL samples use online compilation.

Brook+ simple_matmult sample is a double precision MMM. That's why it requires double the size of optimized_matmult.

0 Likes

Thank you gaurav,

How about the last:

why compute_matmult won't run on 4096x4096 and larger? memexport_matmult won't run on 2048x2048 and larger? memimport_matmult only run between 256x256 to 1024x1024?

0 Likes

I can think of reason why memimport_matmult should run for smaller dimension. As with memimport, a single global buffer is used for both the inputs. The maximum dimension used in memimprt sample is exactly half of memexport sample.

I think there is some bug on CAL side that is not allowing Global buffer allocation > size 2048x2048.

0 Likes

Oh I see, so that compute_matmult, memexport_matmult and memimport_matmult are using global buffer too

Thank you gaurav

0 Likes

gaurav, what exactly is memimport example? I can find that memexport is the one that uses global buffer, but in memimport, what is written is also memexport, no memimport whatsoever.

So what is it?

0 Likes

Just now I've seen double_matmult example in CAL, but it only runs up to 1024x1024

Why?

0 Likes

memimprot uses global memory as input whereas memexport uses it as output to kernel.

0 Likes

and about double_matmult example?

0 Likes