3 Replies Latest reply on Oct 27, 2008 8:59 AM by foxx1337

    How does this code perform matrix-matrix multiplication?


      kernel void
      simple_matmult(float Width, float A[][], float B[][], out float result<>
          // vPos - Position of the output matrix i.e. (x,y)
       float2 vPos = indexof(result).xy;
          // index - coordinates of A & B from where the values are fetched
       float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y);
          // step - represents the step by which index is incremented
       float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f);
          // accumulator - Accumulates the result of intermediate calculation
          // between A & B
       float accumulator = 0.0f;
          // Running a  loop which starts from
          // (0,vPos.y) in A and (vPos.x,0) in B
          // and increments the 'y' value of A and the 'x' value of B
          // which basically implies that we're fetching values from
          // the 'vPos.y'th row of A and 'vPox.x'th column of B
          float i0 = Width;
          while(i0 > 0)
              // A[k] * B[k][j]
              accumulator += A[index.zw]*B[index.xy];
              index += step;  
              i0 = i0 - 1.0f;

          // Writing the result back to the buffer
          result = accumulator;

      accumulator += A[index.zw]*B[index.xy];
      I am confused by this statement.

      According to the aboving statement, it should be A's column * B's row.

      I really don't understand this statement.

      Thanks for explaination.

        • How does this code perform matrix-matrix multiplication?

          hello there,

          in cpu code you'd write:

          for (x = 0; x < height; ++x)
              for (y = 0; y < width; ++y)

          which means that y varies the fastest, and x the slowest.

          on the gpu side, the order is reversed.

          for example, if you fill up a cube on the cpu with:

          for (i = 0; i < planes; ++i)
              for (j = 0; j < height; ++j)
                  for (k = 0; k < width; ++k)
                      cube[ i][j][k] = i * planes * height + j * height + width;

          then streamRead(c, cube) and call a kernel for c, kernel with an argument c[width][height][planes]

          you'll want to loop through it to see its elements in the above order with:

          for (k = 0; k < width; ++k)
              for (j = 0; j < height; ++j)
                  for (i = 0; i < planes; ++i) {
                      int3 pos = int3(i, j, k); // notice that the first arg of int3 is the fastest changing one
                      ... c[pos];


          hope i didn't mess up my indices.


          for int4 / float4, the 4 subcomponents are .x, .y, .z and .w respectively, x the fastest changing, w the slowest in a 4D universe.

          here's how i wrote my matmult kernel, in order to help me explain it easier to my colleagues:

          kernel void GpuMul(int k, float A[][], float B[][], out float C<>
              int2 position = indexof(C).yx;
              int4 i = int4(position.x, 0, 0, position.y);    // reversed layout for indexof(C).xy
              int4 delta = int4(0, 1, 1, 0);
              float a = 0.0f;
              int len;
              for (len = 0; len < k; ++len) { // multiply whole row from A with whole column from B
                  a += A[i.yx] * B[i.wz];
                  i += delta;
              C = a;