19 Replies Latest reply on Feb 15, 2011 11:59 AM by galmok

    local memory in function body versus function header

    galmok
      one works, the other doesn't

      I have a kernel that uses local memory and at first I declared a local array of doubles inside the function body. The resulting calculations of that kernel turned out be to wrong.

      Each work-item would copy 1 value to the array after which I had a barrier to syncronize. Efter the synchronization I checked the content of the local array and discovered it was corrupt.

      The I supplied the local array as part of the function header which solved the problem. The content of the local array was correct now.

      Why is it that my first attempt of using local array fails while the second is ok? In the first situation, is the local array not shared between all work-items?

        • local memory in function body versus function header
          MicahVillmow
          Please provide a test case that shows the issue, there should be no problem for declaring the array in a kernel function body.
            • local memory in function body versus function header
              galmok

              I have now tried to isolate the issue a bit but I am still falling short. It seems my issue occurs only if the kernel is called from another kernel. If I call the kernel directly from the host, there is no issue.

              So my question, before cleaning up all the code, is this:

              If a kernel A is calling kernel B, may kernel B allocate local memory or must the memory be allocated in kernel A and passed to B (or allocated outside opencl)?

              I have tried some situations:

              Allocate local memory in A and pass to B: Small errors in local memory.

              Allocate local memory outside opencl: Works without problems.

              Allocate local memory in B (with function defined as kernel) and call A: Large errors in local memory.

              Call B directly: Works without problems.

               

              It seems simply calling function B from A causes local memory to become non-working.

              It would be best if I could mail the code to you if you would like it...

            • local memory in function body versus function header
              MicahVillmow
              galmok,
              A kernel calling another kernels turns the called kernel into a function. Therefor it is illegal to have local memory defined in the called kernel since it is no longer a kernel, but a funciton.
                • local memory in function body versus function header
                  galmok

                  Ok. I didn't know that. But even if I define the local memory in the kernel and pass the pointers to the function, it doesn't work (local memory not updated correctly).

                  Should the compiler catch the problem with local memory being defined in a function if the function is actually declared as a kernel?

                  According to the OpenCL 1.1 specifications, pointers to local address space are allowed as arguments to functions and kernels. So the situation where I allocate local space in the kernel and pass the pointers to a function should work fine (but doesn't).

                • local memory in function body versus function header
                  MicahVillmow
                  galmok,
                  You also have to take into account section 6.7.1 of the OpenCL spec where it states that
                  "It is just a regular function call if a __kernel function is called by another kernel
                  function."
                  and
                  "The behavior of calling kernel functions with variables declared inside the function with the
                  __local or local qualifier from other kernel functions is implementation-defined."

                  We view it as not being allowed at this time with our implementation because of the first statement and the second statement allows this approach.
                    • local memory in function body versus function header
                      galmok

                      I think you misunderstand me.

                      I know kernel functions, when called from other kernel functions, are to be seen as normal function which cannot allocate local memory (well, it is implementation specific and I wont go there).

                      The error I see:

                      The kernel (that is called from the host) allocates 2 local arrays and calls a function, passing the local arrays as arguments. The function however doesn't work as expected. Moving the allocation to the function and calling it directly from the host makes it work as expected.

                       

                    • local memory in function body versus function header
                      MicahVillmow
                      Your right, I did misunderstand. If you can supply a test case that shows your issue, we can make sure it is fixed in the next release.
                        • local memory in function body versus function header
                          galmok

                          I have attached the kernel/function. The host calls the bottom kernel which allocated local space and calls the topmost function. It is a simple matrix multiplication as found on the web (for CUDA, I translated a bit but haven't optimised _anything_ for AMD yet).

                          I would have enabled the printf stuck inside the for-loop but when enabling it, everything is calculated correctly. Without the printf, a slight error in the result crops in (you can verify by supplying any matrix larger than 32 (multiple of 8, BLOCK_SIZE) as the result is correct when supplying matrices of size 8, 16, 24, and 32).

                          The result is also correct if you call the top-most function directly from the host (and remember to allocate the local memory in that kernel directly or supply it from the host using clSetKernelArg).

                          Only the local space As is not working correctly. Bs is fine regardless.

                          #pragma OPENCL EXTENSION cl_amd_fp64: enable //#pragma OPENCL EXTENSION cl_amd_printf : enable #define BLOCK_SIZE 8 // http://gpgpu-computing4.blogspot.com/2009/09/matrix-multiplication-3.html // A[m x k] * B[k x n] = C[m x n] __kernel void MatrixMul_NN_opt2(global double *C, global double *A, global double *B, uint m, uint n, uint k, uint lda, uint ldb, uint ldc, double alpha, double beta, local double *As, local double *Bs) { // for 128x128 m,n,k,lda,ldb,ldc are all 128. Alpha=1.0 and beta 0.0. int wA = k; int wB = n; double Csub; // Block index int bx = get_group_id(0); //blockIdx.x; int by = get_group_id(1); //blockIdx.y; // Thread index int tx = get_local_id(0); //threadIdx.x; int ty = get_local_id(1); //threadIdx.y; // Index of the first sub-matrix of A processed // by the block int aBegin = wA * BLOCK_SIZE * by; // Index of the last sub-matrix of A processed // by the block int aEnd = aBegin + wA - 1; // Step size used to iterate through the // sub-matrices of A int aStep = BLOCK_SIZE; // Index of the first sub-matrix of B processed // by the block int bBegin = BLOCK_SIZE * bx; // Step size used to iterate through the // sub-matrices of B int bStep = BLOCK_SIZE * wB; // Loop over all the sub-matrices of A and B // required to compute the block sub-matrix // Declaration of the shared memory array As // used to store the sub-matrix of A // local double As[BLOCK_SIZE*BLOCK_SIZE]; // Declaration of the shared memory array Bs // used to store the sub-matrix of B // local double Bs[BLOCK_SIZE*BLOCK_SIZE]; Csub=0.0; for (int a = aBegin, b = bBegin; a <= aEnd; a += aStep, b += bStep) { // Load the matrices from global memory // to shared memory; each thread loads // one element of each matrix As[ty*BLOCK_SIZE + tx] = A[a + wA * ty + tx]; Bs[ty*BLOCK_SIZE + tx] = B[b + wB * ty + tx]; // Synchronize to make sure the matrices // are loaded barrier(CLK_LOCAL_MEM_FENCE); // if printf is inserted, calculation is correct(!!??) //if (As[ty*BLOCK_SIZE + tx] != A[a + wA * ty + tx]) //printf("ERROR As != A %i %i\n",ty*BLOCK_SIZE + tx, a + wA * ty + tx); // Multiply the two matrices together; // each thread computes one element // of the block sub-matrix for (int k = 0; k < BLOCK_SIZE; ++k) //Csub += As[ty*BLOCK_SIZE + k] * Bs[k*BLOCK_SIZE + tx]; // doesn't work //Csub += A[a + wA*ty + k] * B[b + wB * k + tx]; // works //Csub += A[a + wA*ty + k] * Bs[k*BLOCK_SIZE + tx]; // works Csub += As[ty*BLOCK_SIZE + k] * B[b + wB * k + tx]; // doesn't work // conclusion: local memory As isn't getting copied correctly while local memory Bs is. // Synchronize to make sure that the preceding // computation is done before loading two new // sub-matrices of A and B in the next iteration barrier(CLK_LOCAL_MEM_FENCE); } // Write the block sub-matrix to device memory; // each thread writes one element int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx; C[c + wB * ty + tx] = Csub; } __kernel void MatrixMul_NN_opt(global double *C, global double *A, global double *B, uint m, uint n, uint k, uint lda, uint ldb, uint ldc, double alpha, double beta/*, local double *As, local double *Bs*/) { // Declaration of the shared memory array As // used to store the sub-matrix of A local double As[BLOCK_SIZE*BLOCK_SIZE]; // Declaration of the shared memory array Bs // used to store the sub-matrix of B local double Bs[BLOCK_SIZE*BLOCK_SIZE]; MatrixMul_NN_opt2(C, A, B, m, n, k, lda, ldb, ldc, alpha, beta, As, Bs); }

                            • local memory in function body versus function header
                              himanshu.gautam

                              As specified by micah in the 6.7 section of spec

                              The behavior of calling kernel functions with variables declared inside the function with the
                              __local or local qualifier from other kernel functions is implementation-defined

                              I find tha same thing happening in your test case. The variables As and Bs are local in nature and are passed to the called function as arguments. Behaviour in this case is implementaion defined.

                              • local memory in function body versus function header
                                genaganna

                                 

                                Originally posted by: galmok I have attached the kernel/function. The host calls the bottom kernel which allocated local space and calls the topmost function. It is a simple matrix multiplication as found on the web (for CUDA, I translated a bit but haven't optimised _anything_ for AMD yet).

                                 

                                I would have enabled the printf stuck inside the for-loop but when enabling it, everything is calculated correctly. Without the printf, a slight error in the result crops in (you can verify by supplying any matrix larger than 32 (multiple of 8, BLOCK_SIZE) as the result is correct when supplying matrices of size 8, 16, 24, and 32).

                                 

                                The result is also correct if you call the top-most function directly from the host (and remember to allocate the local memory in that kernel directly or supply it from the host using clSetKernelArg).

                                 

                                Only the local space As is not working correctly. Bs is fine regardless.

                                 

                                galmok,

                                            Could you please try to run on CPU and see whether you are getting correct results?

                                             Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

                                  • local memory in function body versus function header
                                    galmok

                                     

                                    Originally posted by: genaganna

                                     

                                    galmok,

                                     

                                                Could you please try to run on CPU and see whether you are getting correct results?

                                     

                                                 Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

                                     

                                    Running the code on the CPU causes correct calculations.

                                    System details: Windows 7 64bit, Core i7 930, SDK v2.3, Catalyst 11.1, MSVC10.

                                      • local memory in function body versus function header
                                        genaganna

                                         

                                        Originally posted by: galmok
                                        Originally posted by: genaganna

                                         

                                         galmok,

                                         

                                                     Could you please try to run on CPU and see whether you are getting correct results?

                                         

                                                      Could you please give your system details(OS, CPU, GPU, SDK version and Driver version)

                                         

                                         



                                         

                                        Running the code on the CPU causes correct calculations.

                                         

                                        System details: Windows 7 64bit, Core i7 930, SDK v2.3, Catalyst 11.1, MSVC10.

                                         

                                        Please copy your runtime code also here.

                                          • local memory in function body versus function header
                                            galmok

                                             

                                            Originally posted by: genaganna

                                             

                                            Please copy your runtime code also here.

                                             

                                            I have already mailed the code to himanshu.gautam at AMD. It included source code for a matrix library used to verify correctness and some messy code to glue it all together. I'd rather not post it here as well.

                                            If himanshu didn't receive the code, please let me know.