hello there,
in cpu code you'd write:
for (x = 0; x < height; ++x)
for (y = 0; y < width; ++y)
....
which means that y varies the fastest, and x the slowest.
on the gpu side, the order is reversed.
for example, if you fill up a cube on the cpu with:
for (i = 0; i < planes; ++i)
for (j = 0; j < height; ++j)
for (k = 0; k < width; ++k)
cube[ i] = i * planes * height + j * height + width;
then streamRead(c, cube) and call a kernel for c, kernel with an argument c[width][height][planes]
you'll want to loop through it to see its elements in the above order with:
for (k = 0; k < width; ++k)
for (j = 0; j < height; ++j)
for (i = 0; i < planes; ++i) {
int3 pos = int3(i, j, k); // notice that the first arg of int3 is the fastest changing one
... c[pos];
}
hope i didn't mess up my indices.
for int4 / float4, the 4 subcomponents are .x, .y, .z and .w respectively, x the fastest changing, w the slowest in a 4D universe.
here's how i wrote my matmult kernel, in order to help me explain it easier to my colleagues:
kernel void GpuMul(int k, float A[][], float B[][], out float C<>
{
int2 position = indexof(C).yx;
int4 i = int4(position.x, 0, 0, position.y); // reversed layout for indexof(C).xy
int4 delta = int4(0, 1, 1, 0);
float a = 0.0f;
int len;
for (len = 0; len < k; ++len) { // multiply whole row from A with whole column from B
a += A[i.yx] * B[i.wz];
i += delta;
}
C = a;
}