Optimized_matmult and double_precision_optimized_matmult (A x B = C where A and B are n by k and k by m matrices, respectively, and C is an n by m matrix) divides A and B into smaller sub-matrices, multiplies the sub-matrices, and combines the product sub-matrices to get C, allowing for fewer implicit calls to the kernel routine ("optimized_matmult").
Both appear to use a more optimal strategy for loading (i.e., copying) A and B into video RAM (i.e., GDDR): the sub-matrices are declared as float4 and double2 in optimized_matmult and double_precision_optimized_matmult, respectively. I assume streamRead(...) is optimized when loading into float4 or double2 streams?
I've also notice is that by using this strategy I can multiply larger matrices, even though I'm loading both A and B into GDDR. Is this related to how data is stored in memory?
---jski