AnsweredAssumed Answered

Performance issues with LDS memory

Question asked by prometheus.nox on Sep 20, 2013
Latest reply on Sep 30, 2013 by himanshu.gautam

Hello, I have a performance problem when using LDS memory.

 

I have two kernels as parts of a N-body simulation in two spatial dimensions, with which I'm trying to learn OpenCL, with a special type of potential (to make things more fun than with gravity, but not important now) and I am using the Verlet integration algorithm (to have more precise calculation, not important now). The problematic kernel is:

 

#define UNROLL_FACTOR 8
//Vernet velocity part kernel
__kernel void kernel_velocity(const float deltaTime,
                             global const float4 *pos1,
                           __global float4 *vel,
                           __global float4 *accel,
                           __local float4 *pblock)
{
    const int gid = get_global_id(0); //global id of work item
    const int id = get_local_id(0); //local id of work item within work group
    
    const int s_wg = get_local_size(0); //work group size
    const int n_wg = get_num_groups(0); //number of work groups
    
    const float4 myPos = pos1[gid];
    const float4 myVel = vel[gid];
    const float4 dt = (float4)(deltaTime, deltaTime, 0.0f, 0.0f);
    float4 acc = (float4)0.0f;
    
    for (int jw = 0; jw < n_wg; ++jw)
    {
        pblock[id] = pos1[jw * s_wg + id]; //cache a particle position; position in array: workgroup no. * size of workgroup + local id
        barrier (CLK_LOCAL_MEM_FENCE); //wait for others in the work group
        
        for (int i = 0; i < s_wg; )
        {
            #pragma unroll UNROLL_FACTOR
            for (int j = 0; j < UNROLL_FACTOR; ++j, ++i)
            {
                float4 r = myPos - pblock[i];
                        
                float rSizeSquareInv = native_recip (r.x*r.x + r.y*r.y + 0.0001f);
                float rSizeSquareInvDouble = rSizeSquareInv * rSizeSquareInv;
                float rSizeSquareInvQuadr = rSizeSquareInvDouble * rSizeSquareInvDouble;
                float rSizeSquareInvHept = rSizeSquareInvQuadr * rSizeSquareInvDouble * rSizeSquareInv;
                acc += r * (2.0f * rSizeSquareInvHept - rSizeSquareInvQuadr);
            }
        }    
        barrier(CLK_LOCAL_MEM_FENCE);
    }
    acc *= 24.0f / myPos.w;

    //update velocity only
    float4 newVel = myVel + 0.5f * dt * (accel[gid] + acc);

    //write to global memory
    vel[gid] = newVel;
    accel[gid] = acc;
}

 

The kernels are called sequentially, i.e. the first then the second then the first again etc.

The problem is that the posted kernel execution time is much larger when the LDS memory read operation is present (line 30) than when there is no read operation at all. So e.g. replacing the corresponding line with

float4 r = myPos; // - pblock[i];

will make the kernel run much faster (and produce nonsense, but to demonstrate the point), like five times. The performance of the original code is slower even when there is a global memory read instead like this:

float4 r = myPos - pos1[jw * s_wg + i];

This will now produce a correct simulation, but it will be about 30 % faster than with the LDS memory read. I would expect that using LDS memory would produce a faster code and not the other way round. Also I've seen that when the calculation chain is cut somewhere in the middle in the sense that the value from the LDS does not propagate into the global memory write at the end through the various calculations, the kernel execution is again faster (by 5 times again).

 

So can someone, please, tell me what is going on here? Am I doing something wrong? In the simulation there are 4608 work units with work group size of 192 (other sizes yield either same or worse performance). I have Radeon HD 6850, AMD APP SDK v2.8.1 and Catalyst 13.8 beta. I use Ubuntu 12.04, but saw exactly the same behaviour with the code in Windows 7 with my GPU.

Outcomes