In trying to walk through (rationalize) "simple_matmult" sample code below I noticed "vPos" is declared & initialized and then used only once, in the initialization of "index". Is this use of "indexof(result).xy" an example of "think of the kernel body as being executed on every element in the output stream"?
The kernel code shows arguments "float A[][], float B[][]" as doubly indexed but their use appears to singularly indexed, "A[index.zw]*B[index.xy]". I assume this is an example of "gather stream" arguments being indexed with a float2 vector?
kernel void
simple_matmult(float Width, float A[][], float B[][], out float result<>
{
// vPos - Position of the output matrix i.e. (x,y)
float2 vPos = indexof(result).xy;
// index - coordinates of A & B from where the values are fetched
float4 index = float4(vPos.x, 0.0f, 0.0f, vPos.y);
// step - represents the step by which index is incremented
float4 step = float4(0.0f, 1.0f, 1.0f, 0.0f);
// accumulator - Accumulates the result of intermediate calculation
// between A & B
float accumulator = 0.0f;
// Running a loop which starts from
// (0,vPos.y) in A and (vPos.x,0) in B
// and increments the 'y' value of A and the 'x' value of B
// which basically implies that we're fetching values from
// the 'vPos.y'th row of A and 'vPox.x'th column of B
float i0 = Width;
while(i0 > 0)
{
// A * B
accumulator += A[index.zw]*B[index.xy];
index += step;
i0 = i0 - 1.0f;
}
// Writing the result back to the buffer
result = accumulator;
}
---jski