Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- AMD Community
- Communities
- Developers
- Devgurus Archives
- Archives Discussions
- MatrixMultiplication sample question?

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Printer Friendly Page

jski

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-25-2010
09:36 PM

MatrixMultiplication sample question?

I'm trying to figure out how matrixA/B are stored? Assuming the matrices that are passed in (the arguments) are in **row-major order**(?), then are matrixA/B simply stored as every 4 floats of the matrices passed in are an entry in matrixA/B ?

In particular, I'm interested in:

float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];

float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];

float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];

float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

If widthA/B is 64, what is the range of pos.y? This seems like things are **column-major ordering**?

In a C routine to multiply matrices, I assume the loop in the attached OpenCL code correspnds to the innermost loop (of the simply C code):

for (int k = 0; k < SIZE; k++) C[\i]

Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j). This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Am I missing something?

---John

#define TILEY_SHIFT 2 /* Output tile size : 4x4 = Each thread computes 16 float values*/ /* Required global threads = (widthC / 4, heightC / 4) */ /* This kernel runs on 7xx and CPU as they don't have hardware local memory */ __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, __global float4* matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); /* Vectorization of input Matrices reduces their width by a factor of 4 */ widthB /= 4; for(int k = 0; k < widthA; k+=4) { float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)]; float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)]; float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)]; //Matrix B is not transposed float4 tempB0 = matrixB[pos.x + k * widthB]; float4 tempB1 = matrixB[pos.x + (k + 1) * widthB]; float4 tempB2 = matrixB[pos.x + (k + 2) * widthB]; float4 tempB3 = matrixB[pos.x + (k + 3) * widthB]; sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 0) * widthB] = sum0; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 1) * widthB] = sum1; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 2) * widthB] = sum2; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 3) * widthB] = sum3; }

18 Replies

pulec

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
02:10 PM

MatrixMultiplication sample question?

> Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).

You definitely can, but _this_ version would be column-major ordering. Ex. matrix:

[1 2]

[3 4]

would be stored as 1,3,2,4. Just try to compute it.

> This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

Yes, the mapping is such. And this is row-major

ADDED:

> If widthA/B is 64, what is the range of pos.y?

I suppose it's 0..15 (as well as pos.x)

jski

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
04:11 PM

MatrixMultiplication sample question?

** 1)** If the matrices store in

**inputBuffer1 = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(cl_float) * width1 * height1, 0, &status);**

**status = clEnqueueWriteBuffer(commandQueue, inputBuffer1, 1, 0, sizeof(cl_float) * width1 * height1, input1, 0, 0, 0);**

does this???

** 2)** Are the ranges of:

**get_global_id(0)**, **get_global_id(1)**, **get_local_id(0)**, and **get_local_id(1)**

established by:

**size_t globalThreads[2]= {width1 / 4, height0/ 4};****size_t localThreads[2] = {blockSize, blockSize};**

**status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);** ???

---John

pulec

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
04:21 PM

MatrixMultiplication sample question?

ad 1) I am afraid I have not understood your question

ad 2) yes

jski

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
04:50 PM

MatrixMultiplication sample question?

1) Is there some transformation of matrix from the way it's store on the C-side vs. how it's stored on the OpenCL (kernel)-side? Or is there an assumption that the matrix that's passed in as a parameter to **clEnqueueWriteBuffer**(...) is in ** column-major order**?

2) I trying to firure out the code in the MatrixMultiplication sample:

**float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];**

Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry? Or is there some OpenCL kernel format I'm unaware of?

When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as **float4** ?

---John

pulec

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
05:05 PM

MatrixMultiplication sample question?

1) I think you haven't understood me correctly (sorry for my English) - I meant that the sample (in SDK) is (computes) in row-major format, not column-major.

There is basically no assumption about format of the data - buffer is simple linear (1D) data structure. You can represent a matrix in any format you want (even sparse or so), if you compute in kernel with respect to the format.

2)

> Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?

Yes, thats correct. There is nothing special about how kernel acceses the data (in case of buffers). It is just standard C99 code. Data paralellism is achieved so that different threads process different chunks of data - they typically obtain the information about which data to process from their "position" (determined by get_global_id/get_local_id calls)

EDIT:

> When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as **float4** ?

It is third element, so third tuple (float4 in this case), pointer arithmetics functions as expected in OpenCL (as in standard C)

EDIT2:

Oh, I have just realized what you asked to before. No, clCreateBuffer just allocates memory on the card, clEnqueueWriteBuffer copies the data from main memory to the card's. (If the computing device is graphic card.) That's all

jski

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
05:16 PM

MatrixMultiplication sample question?

If it's in row-major-order, then why in the example do they use:

**float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];** ???

This seems to be equivalent to: **matrixA[ i + SIZE*j ]**, which is column-major-order?

---John

jski

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
05:28 PM

MatrixMultiplication sample question?

**matrixA** is for **float4** entries, why do they multiple **pos.y** by 4 (**pos.y << TILEY_SHIFT**) ???

pulec

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
05:36 PM

MatrixMultiplication sample question?

Originally posted by:jski

This seems to be equivalent to:

matrixA[ i + SIZE*j ], which is column-major-order?

No, it is row-major: Think of matrix:

[0 1]

[2 3]

In the row-major format it stored as [0, 1, 2, 3]. If you need to access element A[1][0], so you must multiply the row index(j) by SIZE and add index i, so 1*2 + 0 gives linear index 2, where lies element 2. Realize that in row-m. format there are elements from one row "near" each other. On the other hand elements from same column but different rows are "far" away, because of the row index multiplied by the width of the matrix.

By column matrix format it would be the opposite.

pulec

Journeyman III

- Mark as New
- Bookmark
- Subscribe
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-26-2010
05:41 PM

MatrixMultiplication sample question?

Originally posted by:One more question: if the indexing intojskimatrixAis forfloat4entries, why do they multiplepos.yby 4 (pos.y << TILEY_SHIFT) ???

It is because of the vectorization - each thread computes 4x4 elements. The vertical count of threads is (height_of_matrix / 4), so every thread needs to access 4 rows : first thread accesses row 0-3, second 4-7 and so on. That's why it is multiplied.