9 Replies Latest reply on Oct 13, 2010 1:26 PM by rolandman99

    Matrix multiplication example on SDK

    rolandman99

      Hi,

      in the matrix multiplication example provided by the SDK, there is a method called runKernelsCL(void)

      int MatrixMultiplication::runKernels(void)

      {

         ...

         size_t globalThreads[2] {width1/4, height0/4};

         size_t localThreads[2] = {blockSize, blockSize);

         ...

      }

      To my understanding, the program multiply 2 matrices (64x64 each), so globalThreads[0] and globalThreads[1] are 16. The blockSize = 8, so localThreads[0] and localThreads[1] are 8.

      Can someone help explain what are the localThreads and globalThreads mean here?

      Very sorry to bother you all here. I tried hard to understand the Execution Model in OpenCL 1.0 rev. 48 spec page 20-22, but until now have no very clear understanding how it goes. Maybe someone here can explain it with a simple word using the matrix addition or multiplication example.

       

      Thank you

        • Matrix multiplication example on SDK
          himanshu.gautam

          hi rolandman,

          The matrix multiplication sample in the SDK is a vectorized one.So we multiply 64*64 float matrix using 16*16 float4 matrix.This technique has been used to have highly coelesced and aligned global memory reads which are generally the bottle neck in this alorithm.

          You understand them right.

          With regard to openCL spec can you please ask any specific questions?

            • Matrix multiplication example on SDK
              rolandman99

              Hi Himanshu,

              Thank you for reply.

              So, the globalThreads[0] * globalThreads[1] is the 16x16 matrix? Does it mean that it has 64 work items? What are the mean localThreads[0] and localThreads[1] here (8x8) ? What is the relation between the global threads, local threads to work items?

              Actually what I do with the example is to modify it to matrix addition (not multiplication), but I'm get 'lost' when modify it.

              Another one, should we use the local memory? because I see there is a part of code to check local memory. Is this local memory is the same with what is termed "Local Data Store"?

              Thank you.

               

               

                • Matrix multiplication example on SDK
                  himanshu.gautam

                  Refer to \\Documents\ATI Stream\samples\opencl\cl\app\MatrixMultiplication\docs for clarification regarding the multiplication algorithm.

                  The hierarchy is as follows:We have a ndrange which is equal to the total threads needed by the algorithm.This is also called as global work size and global threads specify the work size in x,y and z dimensions.But a compute unit can only support a fixed number of workitems at one time.This number has a fixed value for each device,but it might be less than that depending on the resource requirements of work-items.This number is called a workgroup and local threads specify the work group dimensions.Although we can quote any local threads size less than the permissible dimensions, workgroup size can be dynamically queried using clGetKernelWorkGroupInfo function.I hope that clears the things a bit.

                  Regarding the matrix addition problem i do not recommend using Local Memory or Shared Memory or LDS.The reason is you need any particular element only once in addition.i.e Sum[1,2]=A[1,2]+B[1,2].So better to use global memory directly as we have to have one global fetch any ways.

                  One important thing to keep in mind while using global fetches is access pattern.You can refer to global memory optimization section of openCL Programming Guide for more details about that.

                  • Matrix multiplication example on SDK
                    rolandman99

                    Thank you for the explanation. I will read the documentation first, and will get back if I have another question.

                    Thank you very much for help

                      • Matrix multiplication example on SDK
                        rolandman99

                        Hi Himanshu,

                        My program can work, but I think there something wrong in my kernel program.

                        As in the matrix multiplication example:

                        clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);

                        and in the kernel code, I saw

                        int2 pos = (int2) (get_global_id(0), get_global_id(1));

                         

                        I also look at the Template example

                        globalThreads[0] = width; // width = 256

                        localThreads[0] = 1;

                        clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]);

                        In the kernel code:

                        uint tid = get_global_id(0);

                         

                        My question:

                        I know the first one (Matrix multiplication sample) is 2D NDRange and the inputs are 2 2D arrays (matrix) and the second one (Template sample) is 1D NDRange and the input is 1D array. The kernel multiplies this 1D array with a constant multiplier.

                        Can you explain what is the get_global_id(0) and get_global_id(1) mean? Is there get_global_id(2), get_global_id(3), etc ?

                        For now I think the problem is in my kernel code. I don't have a clear understanding how to access the element in the 2D arrays

                        Thank you for help

                         

                         

                          • Matrix multiplication example on SDK
                            himanshu.gautam

                            get_global_id() is a standard openCL function to identify the current thread.OpenCL only supports upto three dimensions so

                            get_global_id(0) gives thread id in x dimension

                            get_global_id(1) gives thread id in y dimension

                            get_global_id(2) gives thread id in z dimension

                            Refer to openCL spec for more info about these.You can also learn get_local_id() which is also used very oftenly.

                              • Matrix multiplication example on SDK
                                rolandman99

                                Hi,

                                I think there is something worng with my kernel. Unfortunately, I don't know how to debug. I read the ATI OpenCL programming guide and they use gdb. I'm working using VS 2008 Professional 90 days trial edition.

                                The error is runtime error (I think). Here is my kernel code

                                __kernel void add(__global float *output, __global float *input0, __global float *input1, int height, int width)

                                {

                                   int i, j;

                                   int x = get_global_id(0);

                                   int y = get_global_id(1);

                                   for(i=0; i<x; i++)

                                    for(j=0;j<y;j++)

                                   {

                                       int index = i * width + j;

                                       output[index] = input0[index] + input1[index];

                                   }

                                }

                                The size of input0 matrix and input1 matrix is 16x16. Again, I just want to try a simple matrix addition

                                 

                                One more,

                                here is the enqueueND range call:

                                size_t globalThreads[2] = {16, 16};

                                size_t localThreads[2] = {1, 1};

                                error = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, NULL);

                                 

                                Anything wrong with my kernel code? Thank you for point out my mistake.

                                Sorry to bother you again with my silly question.

                                Thank you

                                 

                                  • Matrix multiplication example on SDK
                                    himanshu.gautam

                                    1. Do you get wrong results(i don't think so),or may be there something wrong with your host code and you are getting some error.

                                    2.But to be frank you got it wrong.You are trying to calculate the matrix addition completely by one thread.But you actually have 16*16=256 work items in all.So just add one element of A to corrosponding element of B in 1 workitem(eliminate the for loops).

                                    3.In case the code is not running put an error check after each cl statement and post the exact instruction where you get error with error code.

                                    4.Use very large matrix sizes  typically 512*512 or higher to actually see the performance of GPUs.

                                    5.I can give you the code,but i think you can figure it out yourself.Please post your progress.

                                      • Matrix multiplication example on SDK
                                        rolandman99

                                        Hi,

                                        Finally, it works

                                        I removed the for loop from the kernel. You're right, I made a mistake on the host side code. Here is my mistake (on the set kernel argument)

                                        error = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)inputBuffer);

                                        It should be

                                        error = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&inputBuffer);

                                        I miss the "&" before inputBuffer

                                         

                                        Why the compiler did not generate error or warning on this? What's wrong with the compiler?

                                         

                                        Himanshu, thank you for keep replying my post.

                                        I'm getting interested on AMD/ATI Stream technology and OpenCL. I will consider it to use in my school project.