cancel
Showing results for
Did you mean:

# Archives Discussions

Journeyman III

## MatrixMultiplication sample question?

I'm trying to figure out how matrixA/B are stored?  Assuming the matrices that are passed in (the arguments) are in row-major order(?), then are matrixA/B simply stored as every 4 floats of the matrices passed in are an entry in matrixA/B ?

In particular, I'm interested in:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

If widthA/B is 64, what is the range of pos.y? This seems like things are column-major ordering?

In a C routine to multiply matrices, I assume the loop in the attached OpenCL code correspnds to the innermost loop (of the simply C code):

for (int k = 0; k < SIZE; k++) C[\i] = A[\i] * B; Correct?

Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).  This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Am I missing something?

---John

`#define TILEY_SHIFT 2 /* Output tile size : 4x4 = Each thread computes 16 float values*/ /* Required global threads = (widthC / 4, heightC / 4) */ /* This kernel runs on 7xx and CPU as they don't have hardware local memory */ __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, __global float4* matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); /* Vectorization of input Matrices reduces their width by a factor of 4 */ widthB /= 4; for(int k = 0; k < widthA; k+=4) { float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)]; float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)]; float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)]; //Matrix B is not transposed float4 tempB0 = matrixB[pos.x + k * widthB]; float4 tempB1 = matrixB[pos.x + (k + 1) * widthB]; float4 tempB2 = matrixB[pos.x + (k + 2) * widthB]; float4 tempB3 = matrixB[pos.x + (k + 3) * widthB]; sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 0) * widthB] = sum0; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 1) * widthB] = sum1; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 2) * widthB] = sum2; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 3) * widthB] = sum3; }`
Tags (2)
18 Replies
Journeyman III

## MatrixMultiplication sample question?

> Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).

You definitely can, but _this_ version would be column-major ordering. Ex. matrix:

[1 2]

[3 4]

would be stored as 1,3,2,4. Just try to compute it.

> This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Yes, the mapping is such. And this is row-major

> If widthA/B is 64, what is the range of pos.y?

I suppose it's 0..15 (as well as pos.x)

Journeyman III

## MatrixMultiplication sample question?

1) If the matrices store in column-major ordering, then I assume the code:

inputBuffer1 = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(cl_float) * width1 * height1, 0, &status);

status = clEnqueueWriteBuffer(commandQueue, inputBuffer1, 1, 0, sizeof(cl_float) * width1 * height1, input1, 0, 0, 0);

does this???

2) Are the ranges of:

get_global_id(0), get_global_id(1), get_local_id(0), and get_local_id(1)

established by:

size_t globalThreads[2]= {width1 / 4, height0/ 4};

status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);  ???

---John

Journeyman III

Journeyman III

## MatrixMultiplication sample question?

1) Is there some transformation of matrix from the way it's store on the C-side vs. how it's stored on the OpenCL (kernel)-side?  Or is there an assumption that the matrix that's passed in as a parameter to clEnqueueWriteBuffer(...) is in column-major order?

2) I trying to firure out the code in the MatrixMultiplication sample:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?  Or is there some OpenCL kernel format I'm unaware of?

When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

---John

Journeyman III

## MatrixMultiplication sample question?

1) I think you haven't understood me correctly (sorry for my English) - I meant that the sample (in SDK) is (computes) in row-major format, not column-major.

There is basically no assumption about format of the data - buffer is simple linear (1D) data structure. You can represent a matrix in any format you want (even sparse or so), if you compute in kernel with respect to the format.

2)

> Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?

Yes, thats correct. There is nothing special about how kernel acceses the data (in case of buffers). It is just standard C99 code. Data paralellism is achieved so that different threads process different chunks of data - they typically obtain the information about which data to process from their "position" (determined by get_global_id/get_local_id calls)

EDIT:

> When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

It is third element, so third tuple (float4 in this case), pointer arithmetics functions as expected in OpenCL (as in standard C)

EDIT2:

Oh, I have just realized what you asked to before. No, clCreateBuffer just allocates memory on the card, clEnqueueWriteBuffer copies the data from main memory to the card's. (If the computing device is graphic card.) That's all

Journeyman III

## MatrixMultiplication sample question?

If it's in row-major-order, then why in the example do they use:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; ???

This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

---John

Journeyman III

## MatrixMultiplication sample question?

One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

Journeyman III

## MatrixMultiplication sample question?

 Originally posted by: jski  This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

No, it is row-major: Think of matrix:

[0 1]

[2 3]

In the row-major format it stored as [0, 1, 2, 3]. If you need to access element A[1][0], so you must multiply the row index(j) by SIZE and add index i, so 1*2 + 0 gives linear index 2, where lies element 2. Realize that in row-m. format there are elements from one row "near" each other. On the other hand elements from same column but different rows are "far" away, because of the row index multiplied by the width of the matrix.

By column matrix format it would be the opposite.

Journeyman III

## MatrixMultiplication sample question?

 Originally posted by: jski One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

It is because of the vectorization - each thread computes 4x4 elements. The vertical count of threads is (height_of_matrix / 4), so every thread needs to access 4 rows : first thread accesses row 0-3, second 4-7 and so on. That's why it is multiplied.