0 Replies Latest reply on Sep 17, 2008 5:14 AM by jski

    Optimized when loading into float4 or double2 streams?


      Optimized_matmult and double_precision_optimized_matmult (A x B = C where A and B are n by k and k by m matrices, respectively, and C is an n by m matrix) divides A and B into smaller sub-matrices, multiplies the sub-matrices, and combines the product sub-matrices to get C, allowing for fewer implicit calls to the kernel routine ("optimized_matmult"). 

      Both appear to use a more optimal strategy for loading (i.e., copying) A and B into video RAM (i.e., GDDR): the sub-matrices are declared as float4 and double2 in optimized_matmult and double_precision_optimized_matmult,  respectively.  I assume streamRead(...) is optimized when loading into float4 or double2 streams?

      I've also notice is that by using this strategy I can multiply larger matrices, even though I'm  loading both A and B into GDDR.  Is this related to how data is stored in memory?