cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

galmok
Journeyman III

local memory in function body versus function header

one works, the other doesn't

I have a kernel that uses local memory and at first I declared a local array of doubles inside the function body. The resulting calculations of that kernel turned out be to wrong.

Each work-item would copy 1 value to the array after which I had a barrier to syncronize. Efter the synchronization I checked the content of the local array and discovered it was corrupt.

The I supplied the local array as part of the function header which solved the problem. The content of the local array was correct now.

Why is it that my first attempt of using local array fails while the second is ok? In the first situation, is the local array not shared between all work-items?

0 Likes
19 Replies

Please provide a test case that shows the issue, there should be no problem for declaring the array in a kernel function body.
0 Likes

I have now tried to isolate the issue a bit but I am still falling short. It seems my issue occurs only if the kernel is called from another kernel. If I call the kernel directly from the host, there is no issue.

So my question, before cleaning up all the code, is this:

If a kernel A is calling kernel B, may kernel B allocate local memory or must the memory be allocated in kernel A and passed to B (or allocated outside opencl)?

I have tried some situations:

Allocate local memory in A and pass to B: Small errors in local memory.

Allocate local memory outside opencl: Works without problems.

Allocate local memory in B (with function defined as kernel) and call A: Large errors in local memory.

Call B directly: Works without problems.

 

It seems simply calling function B from A causes local memory to become non-working.

It would be best if I could mail the code to you if you would like it...

0 Likes

galmok,
A kernel calling another kernels turns the called kernel into a function. Therefor it is illegal to have local memory defined in the called kernel since it is no longer a kernel, but a funciton.
0 Likes

Ok. I didn't know that. But even if I define the local memory in the kernel and pass the pointers to the function, it doesn't work (local memory not updated correctly).

Should the compiler catch the problem with local memory being defined in a function if the function is actually declared as a kernel?

According to the OpenCL 1.1 specifications, pointers to local address space are allowed as arguments to functions and kernels. So the situation where I allocate local space in the kernel and pass the pointers to a function should work fine (but doesn't).

0 Likes

galmok,
You also have to take into account section 6.7.1 of the OpenCL spec where it states that
"It is just a regular function call if a __kernel function is called by another kernel
function."
and
"The behavior of calling kernel functions with variables declared inside the function with the
__local or local qualifier from other kernel functions is implementation-defined."

We view it as not being allowed at this time with our implementation because of the first statement and the second statement allows this approach.
0 Likes

I think you misunderstand me.

I know kernel functions, when called from other kernel functions, are to be seen as normal function which cannot allocate local memory (well, it is implementation specific and I wont go there).

The error I see:

The kernel (that is called from the host) allocates 2 local arrays and calls a function, passing the local arrays as arguments. The function however doesn't work as expected. Moving the allocation to the function and calling it directly from the host makes it work as expected.

 

0 Likes

Your right, I did misunderstand. If you can supply a test case that shows your issue, we can make sure it is fixed in the next release.
0 Likes

I have attached the kernel/function. The host calls the bottom kernel which allocated local space and calls the topmost function. It is a simple matrix multiplication as found on the web (for CUDA, I translated a bit but haven't optimised _anything_ for AMD yet).

I would have enabled the printf stuck inside the for-loop but when enabling it, everything is calculated correctly. Without the printf, a slight error in the result crops in (you can verify by supplying any matrix larger than 32 (multiple of 8, BLOCK_SIZE) as the result is correct when supplying matrices of size 8, 16, 24, and 32).

The result is also correct if you call the top-most function directly from the host (and remember to allocate the local memory in that kernel directly or supply it from the host using clSetKernelArg).

Only the local space As is not working correctly. Bs is fine regardless.

#pragma OPENCL EXTENSION cl_amd_fp64: enable //#pragma OPENCL EXTENSION cl_amd_printf : enable #define BLOCK_SIZE 8 // http://gpgpu-computing4.blogspot.com/2009/09/matrix-multiplication-3.html // A[m x k] * B[k x n] = C[m x n] __kernel void MatrixMul_NN_opt2(global double *C, global double *A, global double *B, uint m, uint n, uint k, uint lda, uint ldb, uint ldc, double alpha, double beta, local double *As, local double *Bs) { // for 128x128 m,n,k,lda,ldb,ldc are all 128. Alpha=1.0 and beta 0.0. int wA = k; int wB = n; double Csub; // Block index int bx = get_group_id(0); //blockIdx.x; int by = get_group_id(1); //blockIdx.y; // Thread index int tx = get_local_id(0); //threadIdx.x; int ty = get_local_id(1); //threadIdx.y; // Index of the first sub-matrix of A processed // by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed // by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the // sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed // by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the // sub-matrices of B int bStep = BLOCK_SIZE * wB; // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix // Declaration of the shared memory array As // used to store the sub-matrix of A // local double As[BLOCK_SIZE*BLOCK_SIZE]; // Declaration of the shared memory array Bs // used to store the sub-matrix of B // local double Bs[BLOCK_SIZE*BLOCK_SIZE]; Csub=0.0; for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Load the matrices from global memory // to shared memory; each thread loads // one element of each matrix As[ty*BLOCK_SIZE + tx] = A[a + wA * ty + tx]; Bs[ty*BLOCK_SIZE + tx] = B[b + wB * ty + tx]; // Synchronize to make sure the matrices // are loaded barrier(CLK_LOCAL_MEM_FENCE); // if printf is inserted, calculation is correct(!!??) //if (As[ty*BLOCK_SIZE + tx] != A[a + wA * ty + tx]) //printf("ERROR As != A %i %i\n",ty*BLOCK_SIZE + tx, a + wA * ty + tx); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix for (int k = 0; k < BLOCK_SIZE; ++k) //Csub += As[ty*BLOCK_SIZE + k] * Bs[k*BLOCK_SIZE + tx]; // doesn't work //Csub += A[a + wA*ty + k] * B[b + wB * k + tx]; // works //Csub += A[a + wA*ty + k] * Bs[k*BLOCK_SIZE + tx]; // works Csub += As[ty*BLOCK_SIZE + k] * B[b + wB * k + tx]; // doesn't work // conclusion: local memory As isn't getting copied correctly while local memory Bs is. // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration barrier(CLK_LOCAL_MEM_FENCE); } // Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub; } __kernel void MatrixMul_NN_opt(global double *C, global double *A, global double *B, uint m, uint n, uint k, uint lda, uint ldb, uint ldc, double alpha, double beta/*, local double *As, local double *Bs*/) { // Declaration of the shared memory array As // used to store the sub-matrix of A local double As[BLOCK_SIZE*BLOCK_SIZE]; // Declaration of the shared memory array Bs // used to store the sub-matrix of B local double Bs[BLOCK_SIZE*BLOCK_SIZE]; MatrixMul_NN_opt2(C, A, B, m, n, k, lda, ldb, ldc, alpha, beta, As, Bs); }

0 Likes

As specified by micah in the 6.7 section of spec

The behavior of calling kernel functions with variables declared inside the function with the
__local or local qualifier from other kernel functions is implementation-defined

I find tha same thing happening in your test case. The variables As and Bs are local in nature and are passed to the called function as arguments. Behaviour in this case is implementaion defined.

0 Likes

Originally posted by: himanshu.gautam As specified by micah in the 6.7 section of spec

 

The behavior of calling kernel functions with variables declared inside the function with the __local or local qualifier from other kernel functions is implementation-defined

 

I find tha same thing happening in your test case. The variables As and Bs are local in nature and are passed to the called function as arguments. Behaviour in this case is implementaion defined.

I don't think you read the specs correctly. In my kernel (MatrixMul_NN_opt), I allocate As and Bs as local memory and pass pointers to these arrays to the function (MatrixMul_NN_opt2, declared kernel, but is to be seen as a function). Nowhere inside MatrixMul_NN_opt2 do I allocate local memory. The allocation of local memory only happens in the kernel MatrixMul_NN_opt. And that is ok according to the specs.

0 Likes

hi galmok,

Yeah i see something odd here. Would t be possible for you send in the complete test case. IT would help in quickly verifying the issue at our end.

0 Likes

Can I email it or something like that? Would make it easier for me (MSVC10 project). I don't have a single C file to support the kernel, but possibly I could create one.

0 Likes

yes sure. You can try filing a IT help ticket and attach code there.

CC: Himanshu.Gautam@amd.com 

0 Likes

I don't understand why you are using:

__kernel void MatrixMul_NN_opt2(...)

instead of:

void MatrixMul_NN_opt2(...)

 

0 Likes

Originally posted by: Jawed I don't understand why you are using:

 

__kernel void MatrixMul_NN_opt2(...)

 

instead of:

 

void MatrixMul_NN_opt2(...)

Makes it easier to switch the code around (calling MatrixMul_NN_opt2 directly). "__kernel" is ignored if it is called indirectly (i.e. it acts as a function).

0 Likes

Originally posted by: galmok I have attached the kernel/function. The host calls the bottom kernel which allocated local space and calls the topmost function. It is a simple matrix multiplication as found on the web (for CUDA, I translated a bit but haven't optimised _anything_ for AMD yet).

 

I would have enabled the printf stuck inside the for-loop but when enabling it, everything is calculated correctly. Without the printf, a slight error in the result crops in (you can verify by supplying any matrix larger than 32 (multiple of 8, BLOCK_SIZE) as the result is correct when supplying matrices of size 8, 16, 24, and 32).

 

The result is also correct if you call the top-most function directly from the host (and remember to allocate the local memory in that kernel directly or supply it from the host using clSetKernelArg).

 

Only the local space As is not working correctly. Bs is fine regardless.

 

galmok,

            Could you please try to run on CPU and see whether you are getting correct results?

             Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

0 Likes

Originally posted by: genaganna

 

galmok,

 

            Could you please try to run on CPU and see whether you are getting correct results?

 

             Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

 

Running the code on the CPU causes correct calculations.

System details: Windows 7 64bit, Core i7 930, SDK v2.3, Catalyst 11.1, MSVC10.

0 Likes

Originally posted by: galmok
Originally posted by: genaganna

 

 galmok,

 

             Could you please try to run on CPU and see whether you are getting correct results?

 

              Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

 

 



 

Running the code on the CPU causes correct calculations.

 

System details: Windows 7 64bit, Core i7 930, SDK v2.3, Catalyst 11.1, MSVC10.

 

Please copy your runtime code also here.

0 Likes

Originally posted by: genaganna

 

Please copy your runtime code also here.

 

I have already mailed the code to himanshu.gautam at AMD. It included source code for a matrix library used to verify correctness and some messy code to glue it all together. I'd rather not post it here as well.

If himanshu didn't receive the code, please let me know.

0 Likes