18 Replies Latest reply on Dec 27, 2010 2:11 PM by jski

    MatrixMultiplication sample question?

    jski

       

      I'm trying to figure out how matrixA/B are stored?  Assuming the matrices that are passed in (the arguments) are in row-major order(?), then are matrixA/B simply stored as every 4 floats of the matrices passed in are an entry in matrixA/B ?

      In particular, I'm interested in:

      float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
      float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
      float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
      float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

      If widthA/B is 64, what is the range of pos.y? This seems like things are column-major ordering?

      In a C routine to multiply matrices, I assume the loop in the attached OpenCL code correspnds to the innermost loop (of the simply C code):

      for (int k = 0; k < SIZE; k++) C[\i][j] = A[\i][k] * B[k][j]; Correct?

      Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).  This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

      Am I missing something?

       

      ---John

       

       

       

       



       

       

       

       

       

       



       

       

       

       

      #define TILEY_SHIFT 2 /* Output tile size : 4x4 = Each thread computes 16 float values*/ /* Required global threads = (widthC / 4, heightC / 4) */ /* This kernel runs on 7xx and CPU as they don't have hardware local memory */ __kernel void mmmKernel(__global float4 *matrixA, __global float4 *matrixB, __global float4* matrixC, uint widthA, uint widthB) { int2 pos = (int2)(get_global_id(0), get_global_id(1)); float4 sum0 = (float4)(0); float4 sum1 = (float4)(0); float4 sum2 = (float4)(0); float4 sum3 = (float4)(0); /* Vectorization of input Matrices reduces their width by a factor of 4 */ widthB /= 4; for(int k = 0; k < widthA; k+=4) { float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)]; float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)]; float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)]; //Matrix B is not transposed float4 tempB0 = matrixB[pos.x + k * widthB]; float4 tempB1 = matrixB[pos.x + (k + 1) * widthB]; float4 tempB2 = matrixB[pos.x + (k + 2) * widthB]; float4 tempB3 = matrixB[pos.x + (k + 3) * widthB]; sum0.x += tempA0.x * tempB0.x + tempA0.y * tempB1.x + tempA0.z * tempB2.x + tempA0.w * tempB3.x; sum0.y += tempA0.x * tempB0.y + tempA0.y * tempB1.y + tempA0.z * tempB2.y + tempA0.w * tempB3.y; sum0.z += tempA0.x * tempB0.z + tempA0.y * tempB1.z + tempA0.z * tempB2.z + tempA0.w * tempB3.z; sum0.w += tempA0.x * tempB0.w + tempA0.y * tempB1.w + tempA0.z * tempB2.w + tempA0.w * tempB3.w; sum1.x += tempA1.x * tempB0.x + tempA1.y * tempB1.x + tempA1.z * tempB2.x + tempA1.w * tempB3.x; sum1.y += tempA1.x * tempB0.y + tempA1.y * tempB1.y + tempA1.z * tempB2.y + tempA1.w * tempB3.y; sum1.z += tempA1.x * tempB0.z + tempA1.y * tempB1.z + tempA1.z * tempB2.z + tempA1.w * tempB3.z; sum1.w += tempA1.x * tempB0.w + tempA1.y * tempB1.w + tempA1.z * tempB2.w + tempA1.w * tempB3.w; sum2.x += tempA2.x * tempB0.x + tempA2.y * tempB1.x + tempA2.z * tempB2.x + tempA2.w * tempB3.x; sum2.y += tempA2.x * tempB0.y + tempA2.y * tempB1.y + tempA2.z * tempB2.y + tempA2.w * tempB3.y; sum2.z += tempA2.x * tempB0.z + tempA2.y * tempB1.z + tempA2.z * tempB2.z + tempA2.w * tempB3.z; sum2.w += tempA2.x * tempB0.w + tempA2.y * tempB1.w + tempA2.z * tempB2.w + tempA2.w * tempB3.w; sum3.x += tempA3.x * tempB0.x + tempA3.y * tempB1.x + tempA3.z * tempB2.x + tempA3.w * tempB3.x; sum3.y += tempA3.x * tempB0.y + tempA3.y * tempB1.y + tempA3.z * tempB2.y + tempA3.w * tempB3.y; sum3.z += tempA3.x * tempB0.z + tempA3.y * tempB1.z + tempA3.z * tempB2.z + tempA3.w * tempB3.z; sum3.w += tempA3.x * tempB0.w + tempA3.y * tempB1.w + tempA3.z * tempB2.w + tempA3.w * tempB3.w; } matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 0) * widthB] = sum0; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 1) * widthB] = sum1; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 2) * widthB] = sum2; matrixC[pos.x + ((pos.y << TILEY_SHIFT) + 3) * widthB] = sum3; }

        • MatrixMultiplication sample question?
          pulec

          > Nornally if I'm indexing into a matrix stored as a linear sequence of values, I would map M(i,j) to (float *)(M + i*SIZE + j).

          You definitely can, but _this_ version would be column-major ordering. Ex. matrix:

          [1 2]

          [3 4]

          would be stored as 1,3,2,4. Just try to compute it.

          > This seems the be the opposite of: matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]. This seems more like mapping M(i,j) to (float *)(M + i + j*SIZE) ?

          Yes, the mapping is such. And this is row-major

          ADDED:

          > If widthA/B is 64, what is the range of pos.y?

          I suppose it's 0..15 (as well as pos.x)

            • MatrixMultiplication sample question?
              jski

              1) If the matrices store in column-major ordering, then I assume the code:

              inputBuffer1 = clCreateBuffer( context, CL_MEM_READ_ONLY, sizeof(cl_float) * width1 * height1, 0, &status);

              status = clEnqueueWriteBuffer(commandQueue, inputBuffer1, 1, 0, sizeof(cl_float) * width1 * height1, input1, 0, 0, 0);

              does this???

              2) Are the ranges of:

              get_global_id(0), get_global_id(1), get_local_id(0), and get_local_id(1)

              established by:

              size_t globalThreads[2]= {width1 / 4, height0/ 4};
              size_t localThreads[2] = {blockSize, blockSize};

              status = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);  ???

               

              ---John

               

                • MatrixMultiplication sample question?
                  pulec

                  ad 1) I am afraid I have not understood your question

                  ad 2) yes

                    • MatrixMultiplication sample question?
                      jski

                      1) Is there some transformation of matrix from the way it's store on the C-side vs. how it's stored on the OpenCL (kernel)-side?  Or is there an assumption that the matrix that's passed in as a parameter to clEnqueueWriteBuffer(...) is in column-major order?

                      2) I trying to firure out the code in the MatrixMultiplication sample:

                      float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)];
                      float4 tempA1 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 1) * (widthA / 4)];
                      float4 tempA2 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 2) * (widthA / 4)];
                      float4 tempA3 = matrixA[k/4 + ((pos.y << TILEY_SHIFT) + 3) * (widthA / 4)];

                      Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?  Or is there some OpenCL kernel format I'm unaware of?

                      When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

                       

                      ---John

                        • MatrixMultiplication sample question?
                          pulec

                          1) I think you haven't understood me correctly (sorry for my English) - I meant that the sample (in SDK) is (computes) in row-major format, not column-major.

                          There is basically no assumption about format of the data - buffer is simple linear (1D) data structure. You can represent a matrix in any format you want (even sparse or so), if you compute in kernel with respect to the format.

                          2)

                          > Is matrixA just a linear sequence of floats and by casting (typing) matrixA as *float4 it simply accesses the next 4 floats as a single entry?

                          Yes, thats correct. There is nothing special about how kernel acceses the data (in case of buffers). It is just standard C99 code. Data paralellism is achieved so that different threads process different chunks of data - they typically obtain the information about which data to process from their "position" (determined by get_global_id/get_local_id calls)

                          EDIT:

                          > When I index into matrixA[ 2 ], is this the third set of 4 floats? Or is it simply the third float in the matrix and the next 4 floats are cast as float4 ?

                          It is third element, so third tuple (float4 in this case), pointer arithmetics functions as expected in OpenCL (as in standard C)

                          EDIT2:

                          Oh, I have just realized what you asked to before. No, clCreateBuffer just allocates memory on the card, clEnqueueWriteBuffer copies the data from main memory to the card's. (If the computing device is graphic card.) That's all

                            • MatrixMultiplication sample question?
                              jski

                              If it's in row-major-order, then why in the example do they use:

                               float4 tempA0 = matrixA[k/4 + (pos.y << TILEY_SHIFT) * (widthA / 4)]; ???

                              This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

                               

                              ---John

                                • MatrixMultiplication sample question?
                                  jski

                                  One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

                                    • MatrixMultiplication sample question?
                                      pulec

                                       

                                      Originally posted by: jski One more question: if the indexing into matrixA is for float4 entries, why do they multiple pos.y by 4 (pos.y << TILEY_SHIFT)  ???

                                       

                                      It is because of the vectorization - each thread computes 4x4 elements. The vertical count of threads is (height_of_matrix / 4), so every thread needs to access 4 rows : first thread accesses row 0-3, second 4-7 and so on. That's why it is multiplied.

                                    • MatrixMultiplication sample question?
                                      pulec

                                       

                                      Originally posted by: jski

                                       

                                      This seems to be equivalent to: matrixA[ i + SIZE*j ], which is column-major-order?

                                       

                                       

                                      No, it is row-major: Think of matrix:

                                      [0 1]

                                      [2 3]

                                      In the row-major format it stored as [0, 1, 2, 3]. If you need to access element A[1][0], so you must multiply the row index(j) by SIZE and add index i, so 1*2 + 0 gives linear index 2, where lies element 2. Realize that in row-m. format there are elements from one row "near" each other. On the other hand elements from same column but different rows are "far" away, because of the row index multiplied by the width of the matrix.

                                      By column matrix format it would be the opposite.

                                        • MatrixMultiplication sample question?
                                          jski

                                          I agree. row-major order is equivalent to: matrixA[ i*SIZE + j ] but in the example they're multiplying pos.y by SIZE (or widthA/4), which is equivalent to matrixA[ i + j*SIZE ].

                                            • MatrixMultiplication sample question?
                                              pulec

                                               

                                              Originally posted by: jski I agree. row-major order is equivalent to: matrixA[ i*SIZE + j ] but in the example they're multiplying pos.y by SIZE (or widthA/4), which is equivalent to matrixA[ i + j*SIZE ].

                                               

                                              Just not, row-major is matrixA[i + j *WIDTH].

                                              EDIT: see wiki: http://en.wikipedia.org/wiki/Row_major

                                                • MatrixMultiplication sample question?
                                                  jski

                                                  I do believe we've arrive at the crux (according to Wikipedia):

                                                  Row major: offset = row*NUMCOLS + column (or offset = i*SIZE + j)

                                                  Column major:  offset = row + column*NUMROWS (or offset = i + j*SIZE),

                                                  which is what I believe I've been saying.

                                                  But you're saying (whatever we call it) that the OpenCL code in the example is using: offset = i + j*SIZE ???

                                                   

                                                  ---John

                                                    • MatrixMultiplication sample question?
                                                      pulec
                                                      I have to sorry for making a bit mess here. I have just verified, that I have swapped the meaning of the letters i and j. (I didn't remember it from school and I meant that i,j is similar to x,y where x is vertical coordinate, but it is the opposite)
                                                      So you were right that row-major is:
                                                      offset = i*SIZE + j
                                                      > But you're saying (whatever we call it) that the OpenCL code in the example is using: offset = i + j*SIZE ???
                                                      I'd better not use i and j
                                                      In sample it is therefore:
                                                      offset = row_index * width + column_index (supposing that i == row index, j == column index)
                                                      So row-major.
                                                      In the sample is pos.y equivalent to the letter i.
                                                      Again, sorry for making a mess