Optimized when loading into float4 or double2 streams?

Discussion created by jski on Sep 17, 2008

Optimized_matmult and double_precision_optimized_matmult (A x B = C where A and B are n by k and k by m matrices, respectively, and C is an n by m matrix) divides A and B into smaller sub-matrices, multiplies the sub-matrices, and combines the product sub-matrices to get C, allowing for fewer implicit calls to the kernel routine ("optimized_matmult"). 

Both appear to use a more optimal strategy for loading (i.e., copying) A and B into video RAM (i.e., GDDR): the sub-matrices are declared as float4 and double2 in optimized_matmult and double_precision_optimized_matmult,  respectively.  I assume streamRead(...) is optimized when loading into float4 or double2 streams?

I've also notice is that by using this strategy I can multiply larger matrices, even though I'm  loading both A and B into GDDR.  Is this related to how data is stored in memory?