cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

jski
Journeyman III

MatrixMultiplication sample question?

 

I'm trying to figure out how matrixA/B are stored?  Assuming the matrices that are passed in (the arguments) are in row-major order(?), then are matrixA/B simply stored as every 4 floats of the matrices passed in are an entry in matrixA/B ?

In particular, I'm interested in:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

If widthA/B is 64, what is the range of pos.y? This seems like things are column-major ordering?

In a C routine to multiply matrices, I assume the loop in the attached OpenCL code correspnds to the innermost loop (of the simply C code):

for (int k = 0; k < SIZE; k++) C[\i] = A[\i] * B; Correct?

Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).  This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Am I missing something?

 

---John

 



 

 

 

 

 

 



 

 

 

 

#define TILEY_SHIFT 2 /* Output tile size : 4x4 = Each thread computes 16 float values*/ /* Required global threads = (widthC / 4, heightC / 4) */ /* This kernel runs on 7xx and CPU as they don't have hardware local memory */ __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, __global float4* matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); /* Vectorization of input Matrices reduces their width by a factor of 4 */ widthB /= 4; for(int k = 0; k < widthA; k+=4) { float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)]; float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)]; float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)]; //Matrix B is not transposed float4 tempB0 = matrixB[pos.x + k * widthB]; float4 tempB1 = matrixB[pos.x + (k + 1) * widthB]; float4 tempB2 = matrixB[pos.x + (k + 2) * widthB]; float4 tempB3 = matrixB[pos.x + (k + 3) * widthB]; sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 0) * widthB] = sum0; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 1) * widthB] = sum1; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 2) * widthB] = sum2; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 3) * widthB] = sum3; }

0 Likes
18 Replies
pulec
Journeyman III

> Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).

You definitely can, but _this_ version would be column-major ordering. Ex. matrix:

[1 2]

[3 4]

would be stored as 1,3,2,4. Just try to compute it.

> This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Yes, the mapping is such. And this is row-major

ADDED:

> If widthA/B is 64, what is the range of pos.y?

I suppose it's 0..15 (as well as pos.x)

0 Likes
jski
Journeyman III

1) If the matrices store in column-major ordering, then I assume the code:

inputBuffer1 = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(cl_float) * width1 * height1, 0, &status);

status = clEnqueueWriteBuffer(commandQueue, inputBuffer1, 1, 0, sizeof(cl_float) * width1 * height1, input1, 0, 0, 0);

does this???

2) Are the ranges of:

get_global_id(0), get_global_id(1), get_local_id(0), and get_local_id(1)

established by:

size_t globalThreads[2]= {width1 / 4, height0/ 4};
size_t localThreads[2] = {blockSize, blockSize};

status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);  ???

 

---John

 

0 Likes
pulec
Journeyman III

ad 1) I am afraid I have not understood your question

ad 2) yes

0 Likes
jski
Journeyman III

1) Is there some transformation of matrix from the way it's store on the C-side vs. how it's stored on the OpenCL (kernel)-side?  Or is there an assumption that the matrix that's passed in as a parameter to clEnqueueWriteBuffer(...) is in column-major order?

2) I trying to firure out the code in the MatrixMultiplication sample:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?  Or is there some OpenCL kernel format I'm unaware of?

When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

 

---John

0 Likes
pulec
Journeyman III

1) I think you haven't understood me correctly (sorry for my English) - I meant that the sample (in SDK) is (computes) in row-major format, not column-major.

There is basically no assumption about format of the data - buffer is simple linear (1D) data structure. You can represent a matrix in any format you want (even sparse or so), if you compute in kernel with respect to the format.

2)

> Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?

Yes, thats correct. There is nothing special about how kernel acceses the data (in case of buffers). It is just standard C99 code. Data paralellism is achieved so that different threads process different chunks of data - they typically obtain the information about which data to process from their "position" (determined by get_global_id/get_local_id calls)

EDIT:

> When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

It is third element, so third tuple (float4 in this case), pointer arithmetics functions as expected in OpenCL (as in standard C)

EDIT2:

Oh, I have just realized what you asked to before. No, clCreateBuffer just allocates memory on the card, clEnqueueWriteBuffer copies the data from main memory to the card's. (If the computing device is graphic card.) That's all

0 Likes
jski
Journeyman III

If it's in row-major-order, then why in the example do they use:

 float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; ???

This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

 

---John

0 Likes
jski
Journeyman III

One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

0 Likes
pulec
Journeyman III

Originally posted by: jski One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

 

It is because of the vectorization - each thread computes 4x4 elements. The vertical count of threads is (height_of_matrix / 4), so every thread needs to access 4 rows : first thread accesses row 0-3, second 4-7 and so on. That's why it is multiplied.

0 Likes
pulec
Journeyman III

Originally posted by: jski

 

This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

 

 

No, it is row-major: Think of matrix:

[0 1]

[2 3]

In the row-major format it stored as [0, 1, 2, 3]. If you need to access element A[1][0], so you must multiply the row index(j) by SIZE and add index i, so 1*2 + 0 gives linear index 2, where lies element 2. Realize that in row-m. format there are elements from one row "near" each other. On the other hand elements from same column but different rows are "far" away, because of the row index multiplied by the width of the matrix.

By column matrix format it would be the opposite.

0 Likes
jski
Journeyman III

I agree. row-major order is equivalent to: matrixA[ i*SIZE + j ] but in the example they're multiplying pos.y by SIZE (or widthA/4), which is equivalent to matrixA[ i + j*SIZE ].

0 Likes
pulec
Journeyman III

Originally posted by: jski I agree. row-major order is equivalent to: matrixA[ i*SIZE + j ] but in the example they're multiplying pos.y by SIZE (or widthA/4), which is equivalent to matrixA[ i + j*SIZE ].

 

Just not, row-major is matrixA[i + j *WIDTH].

EDIT: see wiki: http://en.wikipedia.org/wiki/Row_major

0 Likes
jski
Journeyman III

I do believe we've arrive at the crux (according to Wikipedia):

Row major: offset = row*NUMCOLS + column (or offset = i*SIZE + j)

Column major:  offset = row + column*NUMROWS (or offset = i + j*SIZE),

which is what I believe I've been saying.

But you're saying (whatever we call it) that the OpenCL code in the example is using: offset = i + j*SIZE ???

 

---John

0 Likes
pulec
Journeyman III

I have to sorry for making a bit mess here. I have just verified, that I have swapped the meaning of the letters i and j. (I didn't remember it from school and I meant that i,j is similar to x,y where x is vertical coordinate, but it is the opposite)
So you were right that row-major is:
offset = i*SIZE + j
> But you're saying (whatever we call it) that the OpenCL code in the example is using: offset = i + j*SIZE ???
I'd better not use i and j
In sample it is therefore:
offset = row_index * width + column_index (supposing that i == row index, j == column index)
So row-major.
In the sample is pos.y equivalent to the letter i.
Again, sorry for making a mess
0 Likes
jski
Journeyman III

Pulec, first THANKS!

So we've arrive at the conclusion: in this example, AMD has inverted the usual interpretation of get_global_id(0) and get_global_id(1).   Here, get_global_id(0) corresponds to the Y-axis and get_global_id(1) corresponds to the X-axis ???

0 Likes
pulec
Journeyman III

> So we've arrive at the conclusion
I am afraid we didn't.


The meaning of X,Y is usual, get_global_id(0) corresponds to the X-axis/j-matrix-coordinate/horizontal axis. get_global_id(1) is Y-axis or row matrix coordinate.
Or why do you mean it is inverted?

The sample is basically this (non-vectorized version):

int2 pos = (int2)(get_global_id(0), get_global_id(1));

matrixC[pos.x + pos.y * widthB] = 0;
for(int k = 0; k < widthA; ++k)
{
matrixC[pos.x + pos.y * widthB] += matrixA[k + pos.y * widthA] * matrixB[pos.x + k * widthB];
}
I think it is perfectly staightforward - for Cij we take a row from matA multiplied by column from matB (pos.x == j; pos.y == i).
0 Likes
jski
Journeyman III

Thanks.

My misunderstanding was I had assumed the opposite of: pos.x == j; pos.y == i. ---John

0 Likes
pulec
Journeyman III

Originally posted by: jski

Thanks.




My misunderstanding was: pos.x == j; pos.y == i.  I had assumed the opposite.  ---John



Yes, it's natural, I had also swapped the the mapping of those letters, as you has probably noticed
0 Likes
jski
Journeyman III

Thanks for your patience!

0 Likes