**O****ptimized_matmult** and **double_precision_optimized_matmult** (*A x B = C* where *A* and *B* are n by k and k by m matrices, respectively, and *C* is an n by m matrix) divides *A* and *B* into smaller sub-matrices, multiplies the sub-matrices, and combines the product sub-matrices to get *C*, allowing for fewer implicit calls to the *kernel* routine ("optimized_matmult").

Both appear to use a more optimal strategy for loading (i.e., copying) *A* and *B* into video RAM (i.e., GDDR): the sub-matrices are declared as float4 and double2 in **optimized_matmult** and **double_precision_optimized_matmult,** respectively. I assume **streamRead**(...) is optimized when loading into float4 or double2 streams?

I've also notice is that by using this strategy I can multiply larger matrices, even though I'm loading both *A* and *B* into GDDR. Is this related to how data is stored in memory?

