I'm trying to figure out how matrixA/B are stored? Assuming the matrices that are passed in (the arguments) are in **row-major order**(?), then are matrixA/B simply stored as every 4 floats of the matrices passed in are an entry in matrixA/B ?

In particular, I'm interested in:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];

float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];

float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];

float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

If widthA/B is 64, what is the range of pos.y? This seems like things are **column-major ordering**?

In a C routine to multiply matrices, I assume the loop in the attached OpenCL code correspnds to the innermost loop (of the simply C code):

for (int k = 0; k < SIZE; k++) C[\i][j] = A[\i][k] * B[k][j]; Correct?

Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j). This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Am I missing something?

---John

* *

#define TILEY_SHIFT 2 /* Output tile size : 4x4 = Each thread computes 16 float values*/ /* Required global threads = (widthC / 4, heightC / 4) */ /* This kernel runs on 7xx and CPU as they don't have hardware local memory */ __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, __global float4* matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); /* Vectorization of input Matrices reduces their width by a factor of 4 */ widthB /= 4; for(int k = 0; k < widthA; k+=4) { float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)]; float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)]; float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)]; //Matrix B is not transposed float4 tempB0 = matrixB[pos.x + k * widthB]; float4 tempB1 = matrixB[pos.x + (k + 1) * widthB]; float4 tempB2 = matrixB[pos.x + (k + 2) * widthB]; float4 tempB3 = matrixB[pos.x + (k + 3) * widthB]; sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 0) * widthB] = sum0; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 1) * widthB] = sum1; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 2) * widthB] = sum2; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 3) * widthB] = sum3; }

> Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).

You definitely can, but _this_ version would be column-major ordering. Ex. matrix:

[1 2]

[3 4]

would be stored as 1,3,2,4. Just try to compute it.

> This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Yes, the mapping is such. And this is row-major

ADDED:

> If widthA/B is 64, what is the range of pos.y?

I suppose it's 0..15 (as well as pos.x)