kernel void

simple_matmult(float Width, float A[][], float B[][], out float result<>

{

// vPos - Position of the output matrix i.e. (x,y)

float2 vPos = indexof(result).xy;

// index - coordinates of A & B from where the values are fetched

float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y);

// step - represents the step by which index is incremented

float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f);

// accumulator - Accumulates the result of intermediate calculation

// between A & B

float accumulator = 0.0f;

// Running a loop which starts from

// (0,vPos.y) in A and (vPos.x,0) in B

// and increments the 'y' value of A and the 'x' value of B

// which basically implies that we're fetching values from

// the 'vPos.y'th row of A and 'vPox.x'th column of B

float i0 = Width;

while(i0 > 0)

{

// A*[k] * B[k][j] accumulator += A[index.zw]*B[index.xy]; index += step; i0 = i0 - 1.0f; }*

* // Writing the result back to the buffer result = accumulator;}*

**accumulator += A[index.zw]*B[index.xy];**

I am confused by this statement.

*According to the aboving statement, it should be A's column * B's row.*

*I really don't understand this statement.*

*Thanks for explaination.*

hello there,

in cpu code you'd write:

for (x = 0; x < height; ++x)

for (y = 0; y < width; ++y)

....

which means that y varies the fastest, and x the slowest.

on the gpu side, the order is reversed.

for example, if you fill up a cube on the cpu with:

for (i = 0; i < planes; ++i)

for (j = 0; j < height; ++j)

for (k = 0; k < width; ++k)

cube[ i][j][k] = i * planes * height + j * height + width;

then streamRead(c, cube) and call a kernel for c, kernel with an argument c[width][height][planes]

you'll want to loop through it to see its elements in the above order with:

for (k = 0; k < width; ++k)

for (j = 0; j < height; ++j)

for (i = 0; i < planes; ++i) {

int3 pos = int3(i, j, k); // notice that the first arg of int3 is the fastest changing one

... c[pos];

}

hope i didn't mess up my indices.

for int4 / float4, the 4 subcomponents are .x, .y, .z and .w respectively, x the fastest changing, w the slowest in a 4D universe.

here's how i wrote my matmult kernel, in order to help me explain it easier to my colleagues:

kernel void GpuMul(int k, float A[][], float B[][], out float C<>

{

int2 position = indexof(C).yx;

int4 i = int4(position.x, 0, 0, position.y); // reversed layout for indexof(C).xy

int4 delta = int4(0, 1, 1, 0);

float a = 0.0f;

int len;

for (len = 0; len < k; ++len) { // multiply whole row from A with whole column from B

a += A[i.yx] * B[i.wz];

i += delta;

}

C = a;

}