cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

0xfeedface
Journeyman III

Scan kernel optimization

Hi, I am trying to port the optimized CUDA scan kernel from the Modern GPU tutorial to OpenCL/AMD. It performs an intra-wavefront scan, i.e. no explicit synchronization is used. The plain kernel works fine and is the standard way to implement scan. The optimized version, however, does not work on my Radeon HD 6490M without synchronization in between scan steps.

Here is my plain scan kernel which works fine:

__kernel
void scan_wavefront(__global value_t * input)
{
    const uint globx = get_global_id(0);
    const uint locx  = get_local_id(0);
    const uint lane  = locx & (WAVEFRONT_SIZE - 1);

    __local value_t buffer[WAVEFRONT_SIZE];
    buffer[locx] = input[globx];

    #pragma unroll
    for (uint offset = 1; offset < WAVEFRONT_SIZE; offset <<= 0x1) {
        buffer[locx] += (lane >= offset) ? buffer[locx - offset] : 0;
    }

    input[globx] = buffer[locx];
}

The optimization idea is to us a local buffer which has place for an additional half-wavefront. This first half wavefront is filled with zeros and the initial index is 32. This way, one can safe the conditionals, since those lanes for which the conditionals would be false will just add zero to their values.

Here is the my optimized kernel. Instead of juggling pointers as in the tutorial, I just create a second index.

__kernel
void scan_wavefront_optimized(__global value_t * input)
{
    const uint globx = get_global_id(0);
    const uint locx  = get_local_id(0);
    const uint idx   = locx + HALF_WAVEFRONT_SIZE;
    const uint lane  = locx & (WAVEFRONT_SIZE - 1);

    __local value_t buffer[WAVEFRONT_SIZE + HALF_WAVEFRONT_SIZE];
    buffer[locx] = 0;
    buffer[idx] = input[globx];

    #pragma unroll
    for (uint offset = 1; offset < WAVEFRONT_SIZE; offset <<= 0x1) {
        barrier(CLK_LOCAL_MEM_FENCE); // this is required but shouldn't
        buffer[idx] += buffer[idx - offset];
    }

    input[globx] = buffer[idx];
}

Does anybody have any insights as to why the barrier is required in the second version? How can it be that a workgroup of 64 elements is not processed by a single wavefront?

Thanks,

Norman

0 Likes
1 Solution
notzed
Challenger

If you set the reqd_work_group_size (page 4-41 of amd app programming guide 1.3f), and it's within the hardware capability, then any barriers are compiled out.  This is a clean and safe way to get the same optimisation where the hardware supports it (and valid code otherwise), although unfortunately nvidia doesn't support it (last i looked).

Maybe try a barrier before the loop, but in reality the relaxed memory model of opencl (spec 1.1, section 3.3.1) means reads could be done anywhere - including before the loop even starts, and that writes never leave registers ...

e.g. with no explicit barriers/fences, i reckon it would be entirely valid to compile into this (someone correct me if i'm wrong here):

v = input[globx];

r0 = buffer[idx-1];

r1 = buffer[idx-2];

r2 = buffer[idx-4];

...

v += r0;

v += r1;

etc.

input[globx] = v;

This relaxed memory model is very important for allowing the compiler a lot of freedom for optimisations that would not otherwise be possible (i know nothing of cuda's, but i'd be surprised if it was different for this reason).  But obviously you have to deal with it too.

Perhaps the reason the first one works is that it has to make an in-loop run-time decision on either a constant or a local memory access, and it is choosing to optimise this by making the local access conditional.  So maybe it only works by quirk.

(one could always check the asm to see what's going on).

fwiw i've had a lot of problems using #pragma unroll with AMD's compiler, everything from very very slow code (10x slower, probably register spills?) to invalid results - it broke so much code at the time I just deleted all unroll pragmas and never even consider it anymore (in most cases it was pretty insignificant anyway).  If the loop parameters are compile-time constant like this it will unroll by itself.

View solution in original post

0 Likes
8 Replies
nou
Exemplar

i am not sure but IIRC low end cards have 32 workitems wide wavefront.

0 Likes

The AMD APP Guide states that HD 5400 series had a wavefront size of 32 and HD 5800 (and probably everything beyond) a size of 64. I don't know which of these the 6490M is derived from. But the problem occurs also when I assume a wavefront size of 32.

0 Likes
LeeHowes
Staff

You may just need to insert a fence in the loop. While it's true that hardware won't reorder across the wavefront on current chips the compiler is quite another story. A local fence will ensure it does not. Off the top of my head I'm not sure what the vector width of the 6490M is.

0 Likes
notzed
Challenger

If you set the reqd_work_group_size (page 4-41 of amd app programming guide 1.3f), and it's within the hardware capability, then any barriers are compiled out.  This is a clean and safe way to get the same optimisation where the hardware supports it (and valid code otherwise), although unfortunately nvidia doesn't support it (last i looked).

Maybe try a barrier before the loop, but in reality the relaxed memory model of opencl (spec 1.1, section 3.3.1) means reads could be done anywhere - including before the loop even starts, and that writes never leave registers ...

e.g. with no explicit barriers/fences, i reckon it would be entirely valid to compile into this (someone correct me if i'm wrong here):

v = input[globx];

r0 = buffer[idx-1];

r1 = buffer[idx-2];

r2 = buffer[idx-4];

...

v += r0;

v += r1;

etc.

input[globx] = v;

This relaxed memory model is very important for allowing the compiler a lot of freedom for optimisations that would not otherwise be possible (i know nothing of cuda's, but i'd be surprised if it was different for this reason).  But obviously you have to deal with it too.

Perhaps the reason the first one works is that it has to make an in-loop run-time decision on either a constant or a local memory access, and it is choosing to optimise this by making the local access conditional.  So maybe it only works by quirk.

(one could always check the asm to see what's going on).

fwiw i've had a lot of problems using #pragma unroll with AMD's compiler, everything from very very slow code (10x slower, probably register spills?) to invalid results - it broke so much code at the time I just deleted all unroll pragmas and never even consider it anymore (in most cases it was pretty insignificant anyway).  If the loop parameters are compile-time constant like this it will unroll by itself.

0 Likes

Being curious, I tried it out: my guess was right, it just compiled out all the writes.

nobarrier.isa - your second example with, but with no barrier: it just compiled out all the writes

barrier.isa - original code with barrier, valid but lots of clauses

barrier-rws.isa same code with barrier, but __attribute__((reqd_work_group_size(64,1,1))) specified on the kernel.  i.e. barriers removed and what you were after.

0 Likes

Thanks for your answer. I was aware of the reqd_work_group_size attribute but I never got the exact same performance from barrier + reqd_work_group_size compared to no barrier at all, so I tried to avoid it. Thinking of it, this might suggest a wavefront size less than 64 on my hardware (i.e. only 5 barriers optimized away and one left in). But you are right, to produce correct, portable code on should include barriers and leave optimizations to the compiler.

I am on Mac OS X and my compiler seems to ignore the GPU_DUMP_DEVICE_KERNEL environment variable. So I cannot get the ISA for my particular hardware.

0 Likes

We do not support the OpenCL implementation on OS X so none of our environment variables will work on that platform. You will need to contact Apple to find a way to dump IL/ISA for an OpenCL kernel.

0 Likes
0xfeedface
Journeyman III

For the record, here is a working version of the optimized CUDA kernel without barrier calls. It turns out the pointer juggling I tried to omit was indeed necessary to prevent the compiler from doing the optimizations notzed described.

__kernel
void scan_wavefront_optimized2(__global value_t * input)
{
    const uint globx = get_global_id(0);
    const uint locx  = get_local_id(0);

    __local value_t buffer[WAVEFRONT_SIZE + HALF_WAVEFRONT_SIZE];
    buffer[locx] = 0;
    volatile value_t * s = (value_t *)buffer + WAVEFRONT_SIZE / 2 + locx;
    value_t x = input[globx];
    s[0] = x;

    value_t sum = x;
    #pragma unroll
    for (uint offset = 1; offset < WAVEFRONT_SIZE; offset <<= 0x1) {
        sum += s[-offset];
        s[0] = sum;
    }

    input[globx] = sum;
}
0 Likes