Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Journeyman III

Matrix multiplication example on SDK


in the matrix multiplication example provided by the SDK, there is a method called runKernelsCL(void)

int MatrixMultiplication::runKernels(void)



   size_t globalThreads[2] {width1/4, height0/4};

   size_t localThreads[2] = {blockSize, blockSize);



To my understanding, the program multiply 2 matrices (64x64 each), so globalThreads[0] and globalThreads[1] are 16. The blockSize = 8, so localThreads[0] and localThreads[1] are 8.

Can someone help explain what are the localThreads and globalThreads mean here?

Very sorry to bother you all here. I tried hard to understand the Execution Model in OpenCL 1.0 rev. 48 spec page 20-22, but until now have no very clear understanding how it goes. Maybe someone here can explain it with a simple word using the matrix addition or multiplication example.


Thank you

9 Replies

hi rolandman,

The matrix multiplication sample in the SDK is a vectorized one.So we multiply 64*64 float matrix using 16*16 float4 matrix.This technique has been used to have highly coelesced and aligned global memory reads which are generally the bottle neck in this alorithm.

You understand them right.

With regard to openCL spec can you please ask any specific questions?


Hi Himanshu,

Thank you for reply.

So, the globalThreads[0] * globalThreads[1] is the 16x16 matrix? Does it mean that it has 64 work items? What are the mean localThreads[0] and localThreads[1] here (8x8) ? What is the relation between the global threads, local threads to work items?

Actually what I do with the example is to modify it to matrix addition (not multiplication), but I'm get 'lost' when modify it.

Another one, should we use the local memory? because I see there is a part of code to check local memory. Is this local memory is the same with what is termed "Local Data Store"?

Thank you.




Refer to \\Documents\ATI Stream\samples\opencl\cl\app\MatrixMultiplication\docs for clarification regarding the multiplication algorithm.

The hierarchy is as follows:We have a ndrange which is equal to the total threads needed by the algorithm.This is also called as global work size and global threads specify the work size in x,y and z dimensions.But a compute unit can only support a fixed number of workitems at one time.This number has a fixed value for each device,but it might be less than that depending on the resource requirements of work-items.This number is called a workgroup and local threads specify the work group dimensions.Although we can quote any local threads size less than the permissible dimensions, workgroup size can be dynamically queried using clGetKernelWorkGroupInfo function.I hope that clears the things a bit.

Regarding the matrix addition problem i do not recommend using Local Memory or Shared Memory or LDS.The reason is you need any particular element only once in addition.i.e Sum[1,2]=A[1,2]+B[1,2].So better to use global memory directly as we have to have one global fetch any ways.

One important thing to keep in mind while using global fetches is access pattern.You can refer to global memory optimization section of openCL Programming Guide for more details about that.


Thank you for the explanation. I will read the documentation first, and will get back if I have another question.

Thank you very much for help


Hi Himanshu,

My program can work, but I think there something wrong in my kernel program.

As in the matrix multiplication example:

clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, &events[0]);

and in the kernel code, I saw

int2 pos = (int2) (get_global_id(0), get_global_id(1));


I also look at the Template example

globalThreads[0] = width; // width = 256

localThreads[0] = 1;

clEnqueueNDRangeKernel(commandQueue, kernel, 1, NULL, globalThreads, localThreads, 0, NULL, &events[0]);

In the kernel code:

uint tid = get_global_id(0);


My question:

I know the first one (Matrix multiplication sample) is 2D NDRange and the inputs are 2 2D arrays (matrix) and the second one (Template sample) is 1D NDRange and the input is 1D array. The kernel multiplies this 1D array with a constant multiplier.

Can you explain what is the get_global_id(0) and get_global_id(1) mean? Is there get_global_id(2), get_global_id(3), etc ?

For now I think the problem is in my kernel code. I don't have a clear understanding how to access the element in the 2D arrays

Thank you for help




get_global_id() is a standard openCL function to identify the current thread.OpenCL only supports upto three dimensions so

get_global_id(0) gives thread id in x dimension

get_global_id(1) gives thread id in y dimension

get_global_id(2) gives thread id in z dimension

Refer to openCL spec for more info about these.You can also learn get_local_id() which is also used very oftenly.



I think there is something worng with my kernel. Unfortunately, I don't know how to debug. I read the ATI OpenCL programming guide and they use gdb. I'm working using VS 2008 Professional 90 days trial edition.

The error is runtime error (I think). Here is my kernel code

__kernel void add(__global float *output, __global float *input0, __global float *input1, int height, int width)


   int i, j;

   int x = get_global_id(0);

   int y = get_global_id(1);

   for(i=0; i<x; i++)



       int index = i * width + j;

       output[index] = input0[index] + input1[index];



The size of input0 matrix and input1 matrix is 16x16. Again, I just want to try a simple matrix addition


One more,

here is the enqueueND range call:

size_t globalThreads[2] = {16, 16};

size_t localThreads[2] = {1, 1};

error = clEnqueueNDRangeKernel(commandQueue, kernel, 2, NULL, globalThreads, localThreads, 0, NULL, NULL);


Anything wrong with my kernel code? Thank you for point out my mistake.

Sorry to bother you again with my silly question.

Thank you



1. Do you get wrong results(i don't think so),or may be there something wrong with your host code and you are getting some error.

2.But to be frank you got it wrong.You are trying to calculate the matrix addition completely by one thread.But you actually have 16*16=256 work items in all.So just add one element of A to corrosponding element of B in 1 workitem(eliminate the for loops).

3.In case the code is not running put an error check after each cl statement and post the exact instruction where you get error with error code.

4.Use very large matrix sizes  typically 512*512 or higher to actually see the performance of GPUs.

5.I can give you the code,but i think you can figure it out yourself.Please post your progress.



Finally, it works

I removed the for loop from the kernel. You're right, I made a mistake on the host side code. Here is my mistake (on the set kernel argument)

error = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)inputBuffer);

It should be

error = clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&inputBuffer);

I miss the "&" before inputBuffer


Why the compiler did not generate error or warning on this? What's wrong with the compiler?


Himanshu, thank you for keep replying my post.

I'm getting interested on AMD/ATI Stream technology and OpenCL. I will consider it to use in my school project.