To facilitate the parallelism optimized maltmult sample split matrix into sub-matrices, multiplies them and finally combines the original output matrix. The logic is clear as submatix dot product equal to row and columns dot products of original matrices.

However, sample swaps the rows in input matrices when submatrices are built. Without swapping the rows (but retaining all the rest of the logic splitting the original input matrixes into submatrices) the result mismatches the direct matrix multiplication done by CPU. From algebra point of view this action (swapping the rows) is meaningless, but seems to be very critical for stream computing point of view!

If I populate input matrix with small integer numbers the CPU result matches the submatrix multiplication result, but if I use LINEAR_INT fill array mode (or RANDOM) the result does not match.

What also confusing me is that it is not directly concerns GPU, as even is BRT_RUNTIME=cpu leads to the same behavior. It is hard to believe I hit some overflow and it shall not be related to algorithm in general and numbers are still pretty small.

In Stream Computing User Guide, however (3.5.3 Optimized Implementation) there is a note: “to get the correct result, the input data must be preprocessed so that each four-component element in the input matrixes contain a 2x2 micro-tile of data values from the original matrix” however I do not see how it is connected with the results I have (partly because I have the same results in cpu mode).

Any help will be greatly appreciated.