AnsweredAssumed Answered

How to do parallel reduction correctly?

Question asked by joggel on Feb 11, 2013
Latest reply on Feb 13, 2013 by himanshu.gautam

Hi,

i am trying to write some kernels, which use local memory and parallel reduction. With this I have some problems.

My kernel tries to evaluate the min value and the max value of an array.

 

Kernel:

http://pastebin.com/gRbDcgyU

 

This kernel doesn't work. Whenever it runs, it crashes and CL_INVALID_COMMAND_QUEUE is returned by clfinish().

What I find is very interesting is, if I replace this:

for(int offset = local_size/2;offset>0;offset/=2)
        {
                barrier(CLK_LOCAL_MEM_FENCE);

                if(gid_local<offset)
                {
                        Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                        Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
                }
        }
    barrier(CLK_GLOBAL_MEM_FENCE);

By this:

for(int offset = local_size/2;gid_local<offset;offset/=2)
        {
                barrier(CLK_LOCAL_MEM_FENCE);
                Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
        }
    barrier(CLK_GLOBAL_MEM_FENCE);

 

The kernel is working and even returns the correct values, although many threads in the same work-group don't reach the barrier-statement in my loop, which shouldn't be possible.

 

My idea of parallel reduction is pretty simple. I take the half of my work-group size and use that as an offset for an work-item to access the right elements.

This offset is halfed by each iteration until there is only one work-item alive. Both for loops are doing the same(in my opinion). If I use the first one, the kernel crashes. Sadly I don't know the reason for this. The second loop should not work...but it works.

 

Many thanks in advance

Outcomes