Archives Discussions

joggel · ‎02-11-2013

Hi,

i am trying to write some kernels, which use local memory and parallel reduction. With this I have some problems.

My kernel tries to evaluate the min value and the max value of an array.

Kernel:

This kernel doesn't work. Whenever it runs, it crashes and CL_INVALID_COMMAND_QUEUE is returned by clfinish().

What I find is very interesting is, if I replace this:

for(int offset = local_size/2;offset>0;offset/=2)
        {
                barrier(CLK_LOCAL_MEM_FENCE);
                if(gid_local<offset)
                {
                        Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                        Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
                }
        }
    barrier(CLK_GLOBAL_MEM_FENCE);

By this:

for(int offset = local_size/2;gid_local<offset;offset/=2)
        {
                barrier(CLK_LOCAL_MEM_FENCE);
                Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);
        }
    barrier(CLK_GLOBAL_MEM_FENCE);

The kernel is working and even returns the correct values, although many threads in the same work-group don't reach the barrier-statement in my loop, which shouldn't be possible.

My idea of parallel reduction is pretty simple. I take the half of my work-group size and use that as an offset for an work-item to access the right elements.

This offset is halfed by each iteration until there is only one work-item alive. Both for loops are doing the same(in my opinion). If I use the first one, the kernel crashes. Sadly I don't know the reason for this. The second loop should not work...but it works.

Many thanks in advance

himanshu_gautam · ‎02-12-2013

I am able to run your code without any problems on Tahiti 7970 on Linux 64 box.

The only problem is the bug(?) in your code that assumes global synchronization of workgroups.

As long as I am spawning only 1 workgroup - the code is returning correct result.

If I have more than 1, the code does not give correct output - This is as expected.

Also, I hope you are using the correct drivers for your notebook.

Please download and install the drivers from AMD directly - http://support.amd.com/us/gpudownload/Pages/index.aspx

View solution in original post

Archives Discussions

How to do parallel reduction correctly?