7 Replies Latest reply on Sep 30, 2013 12:16 AM by himanshu.gautam

    Performance issues with LDS memory


      Hello, I have a performance problem when using LDS memory.


      I have two kernels as parts of a N-body simulation in two spatial dimensions, with which I'm trying to learn OpenCL, with a special type of potential (to make things more fun than with gravity, but not important now) and I am using the Verlet integration algorithm (to have more precise calculation, not important now). The problematic kernel is:


      #define UNROLL_FACTOR 8
      //Vernet velocity part kernel
      __kernel void kernel_velocity(const float deltaTime,
                                   global const float4 *pos1,
                                 __global float4 *vel,
                                 __global float4 *accel,
                                 __local float4 *pblock)
          const int gid = get_global_id(0); //global id of work item
          const int id = get_local_id(0); //local id of work item within work group
          const int s_wg = get_local_size(0); //work group size
          const int n_wg = get_num_groups(0); //number of work groups
          const float4 myPos = pos1[gid];
          const float4 myVel = vel[gid];
          const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
          float4 acc = (float4)0.0f;
          for (int jw = 0; jw < n_wg; ++jw)
              pblock[id] = pos1[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
              barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
              for (int i = 0; i < s_wg; )
                  #pragma unroll UNROLL_FACTOR
                  for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
                      float4 r = myPos - pblock[i];
                      float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
                      float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
                      float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
                      float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
                      acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
          acc *= 24.0f / myPos.w;
          //update velocity only
          float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);
          //write to global memory
          vel[gid] = newVel;
          accel[gid] = acc;


      The kernels are called sequentially, i.e. the first then the second then the first again etc.

      The problem is that the posted kernel execution time is much larger when the LDS memory read operation is present (line 30) than when there is no read operation at all. So e.g. replacing the corresponding line with

      float4 r = myPos; // - pblock[i];

      will make the kernel run much faster (and produce nonsense, but to demonstrate the point), like five times. The performance of the original code is slower even when there is a global memory read instead like this:

      float4 r = myPos - pos1[jw * s_wg + i];

      This will now produce a correct simulation, but it will be about 30 % faster than with the LDS memory read. I would expect that using LDS memory would produce a faster code and not the other way round. Also I've seen that when the calculation chain is cut somewhere in the middle in the sense that the value from the LDS does not propagate into the global memory write at the end through the various calculations, the kernel execution is again faster (by 5 times again).


      So can someone, please, tell me what is going on here? Am I doing something wrong? In the simulation there are 4608 work units with work group size of 192 (other sizes yield either same or worse performance). I have Radeon HD 6850, AMD APP SDK v2.8.1 and Catalyst 13.8 beta. I use Ubuntu 12.04, but saw exactly the same behaviour with the code in Windows 7 with my GPU.

        • Re: Performance issues with LDS memory

          Well, apart the fact that the inner (unrolled) loop indexed by j is totally useless with your code and that the local memory accesses in this loop could be placed in the upper loop, there is not much to say.

          I have read in this forum, that the cache is faster than shared memory. So it may be the reason.

            • Re: Performance issues with LDS memory

              Depending on how much LDS memory you allocated per workgroup, it may become a limiting factor for GPU occupancy.  Try profiling it with CodeXL and see if that's the case.

                • Re: Performance issues with LDS memory

                  Thank you for the suggestion, but I already know the result, because I've tried before.

                  The LDS memory allocated per workgroup is simply 192 (workgroup size) * 4 (float4) * 4 (float size) = 3072 bytes, whereas the device limit is 32768. According to CodeXL the limiting factor for the occupancy is the number of VGPRs. However, I also get strangely large number of LDS fetch instructions (almost 11k), also there is 0.23 % of LDS bank conflict. I don't know where these numbers are coming from. When the global memory read is used, the fetch instructions is 4611 and bank conflict 0 %, which looks reasonable with respect to the simulation parameter values (work unit number).

                • Re: Performance issues with LDS memory

                  Can you tell me in what way the inner unrolled loop is useless? If the local memory access is just moved into the upper loop, the program will not work since i is incremented in the unrolled loop. Also if I put out the unrolled loop completely, which is the same as defining UNROLL_FACTOR to 1, the kernel execution time will be slower than now... So I don't really understand what you mean :-)

                • Re: Performance issues with LDS memory


                  I would like to share some basic stuffs here..


                  You are having a loop which loops through all the work groups...I think its not required. Because all the work groups work parallely. Since because of this loop each workgroup execute the block inside for "num of work groups" time.

                  And the same with loca size. Even in this case all the threads work parallely. you can use of local id if you wanted to do any thing.

                  I dont know really you wanted to loop it or what ur intesion behind.


                  May be because these loops you are getting low performance. Please check once whether you really need those looping.