AnsweredAssumed Answered

Scan kernel optimization

Question asked by 0xfeedface on Mar 23, 2012
Latest reply on Mar 26, 2012 by MicahVillmow

Hi, I am trying to port the optimized CUDA scan kernel from the Modern GPU tutorial to OpenCL/AMD. It performs an intra-wavefront scan, i.e. no explicit synchronization is used. The plain kernel works fine and is the standard way to implement scan. The optimized version, however, does not work on my Radeon HD 6490M without synchronization in between scan steps.

 

Here is my plain scan kernel which works fine:

__kernel
void scan_wavefront(__global value_t * input)
{
    const uint globx = get_global_id(0);
    const uint locx  = get_local_id(0);
    const uint lane  = locx & (WAVEFRONT_SIZE - 1);

    __local value_t buffer[WAVEFRONT_SIZE];
    buffer[locx] = input[globx];

    #pragma unroll
    for (uint offset = 1; offset < WAVEFRONT_SIZE; offset <<= 0x1) {
        buffer[locx] += (lane >= offset) ? buffer[locx - offset] : 0;
    }

    input[globx] = buffer[locx];
}

 

The optimization idea is to us a local buffer which has place for an additional half-wavefront. This first half wavefront is filled with zeros and the initial index is 32. This way, one can safe the conditionals, since those lanes for which the conditionals would be false will just add zero to their values.

Here is the my optimized kernel. Instead of juggling pointers as in the tutorial, I just create a second index.

__kernel
void scan_wavefront_optimized(__global value_t * input)
{
    const uint globx = get_global_id(0);
    const uint locx  = get_local_id(0);
    const uint idx   = locx + HALF_WAVEFRONT_SIZE;
    const uint lane  = locx & (WAVEFRONT_SIZE - 1);

    __local value_t buffer[WAVEFRONT_SIZE + HALF_WAVEFRONT_SIZE];
    buffer[locx] = 0;
    buffer[idx] = input[globx];

    #pragma unroll
    for (uint offset = 1; offset < WAVEFRONT_SIZE; offset <<= 0x1) {
        barrier(CLK_LOCAL_MEM_FENCE); // this is required but shouldn't
        buffer[idx] += buffer[idx - offset];
    }

    input[globx] = buffer[idx];
}

 

Does anybody have any insights as to why the barrier is required in the second version? How can it be that a workgroup of 64 elements is not processed by a single wavefront?

 

Thanks,

Norman

Outcomes