Archives Discussions

gpgpu · ‎10-26-2008

kernel void
simple_matmult(float Width, float A[][], float B[][], out float result<>
{
    // vPos - Position of the output matrix i.e. (x,y)
float2 vPos = indexof(result).xy;

    // index - coordinates of A & B from where the values are fetched
float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y);
    // step - represents the step by which index is incremented
float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f);

    // accumulator - Accumulates the result of intermediate calculation
    // between A & B
float accumulator = 0.0f;

    // Running a loop which starts from
    // (0,vPos.y) in A and (vPos.x,0) in B
    // and increments the 'y' value of A and the 'x' value of B
    // which basically implies that we're fetching values from
    // the 'vPos.y'th row of A and 'vPox.x'th column of B
    float i0 = Width;
    while(i0 > 0)
    {
        // A * B
        accumulator += A[index.zw]*B[index.xy];
        index += step;
        i0 = i0 - 1.0f;
    }

// Writing the result back to the buffer
result = accumulator;
}

accumulator += A[index.zw]*B[index.xy];
I am confused by this statement.

According to the aboving statement, it should be A's column * B's row.

I really don't understand this statement.

Thanks for explaination.

foxx1337 · ‎10-26-2008

hello there,

in cpu code you'd write:

for (x = 0; x < height; ++x)
for (y = 0; y < width; ++y)
....

which means that y varies the fastest, and x the slowest.

on the gpu side, the order is reversed.

for example, if you fill up a cube on the cpu with:

for (i = 0; i < planes; ++i)
    for (j = 0; j < height; ++j)
        for (k = 0; k < width; ++k)
            cube[ i] = i * planes * height + j * height + width;

then streamRead(c, cube) and call a kernel for c, kernel with an argument c[width][height][planes]

you'll want to loop through it to see its elements in the above order with:

for (k = 0; k < width; ++k)
    for (j = 0; j < height; ++j)
        for (i = 0; i < planes; ++i) {
            int3 pos = int3(i, j, k); // notice that the first arg of int3 is the fastest changing one
            ... c[pos];
        }

hope i didn't mess up my indices.

for int4 / float4, the 4 subcomponents are .x, .y, .z and .w respectively, x the fastest changing, w the slowest in a 4D universe.

here's how i wrote my matmult kernel, in order to help me explain it easier to my colleagues:

kernel void GpuMul(int k, float A[][], float B[][], out float C<>
{
    int2 position = indexof(C).yx;
    int4 i = int4(position.x, 0, 0, position.y);    // reversed layout for indexof(C).xy
    int4 delta = int4(0, 1, 1, 0);
    float a = 0.0f;
    int len;
    for (len = 0; len < k; ++len) { // multiply whole row from A with whole column from B
        a += A[i.yx] * B[i.wz];
        i += delta;
    }
    C = a;
}

gpgpu · ‎10-27-2008

Thanks for your nice and detailed illustration

Could you post your optimized matmult kernel?

Are there any better algorithm for the mat-mat multiply?

Thanks in advance

foxx1337 · ‎10-27-2008

Don't have one better than AMD's, but by simply applying a transposition on B, one gets a 100% speedup due to data locality, as stated in the AMD Stream SDK User Guide.

Archives Discussions

How does this code perform matrix-matrix multiplication?