cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

joggel
Journeyman III

How to do parallel reduction correctly?

Hi,

i am trying to write some kernels, which use local memory and parallel reduction. With this I have some problems.

My kernel tries to evaluate the min value and the max value of an array.

Kernel:

http://pastebin.com/gRbDcgyU

This kernel doesn't work. Whenever it runs, it crashes and CL_INVALID_COMMAND_QUEUE is returned by clfinish().

What I find is very interesting is, if I replace this:

for(int offset = local_size/2;offset>0;offset/=2)

        {

                barrier(CLK_LOCAL_MEM_FENCE);

                if(gid_local<offset)

                {

                        Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);

                        Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);

                }

        }

    barrier(CLK_GLOBAL_MEM_FENCE);

By this:

for(int offset = local_size/2;gid_local<offset;offset/=2)

        {

                barrier(CLK_LOCAL_MEM_FENCE);

                Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);

                Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);

        }

    barrier(CLK_GLOBAL_MEM_FENCE);

The kernel is working and even returns the correct values, although many threads in the same work-group don't reach the barrier-statement in my loop, which shouldn't be possible.

My idea of parallel reduction is pretty simple. I take the half of my work-group size and use that as an offset for an work-item to access the right elements.

This offset is halfed by each iteration until there is only one work-item alive. Both for loops are doing the same(in my opinion). If I use the first one, the kernel crashes. Sadly I don't know the reason for this. The second loop should not work...but it works.

Many thanks in advance

0 Likes
1 Solution

I am able to run your code without any problems on Tahiti 7970 on Linux 64 box.

The only problem is the bug(?) in your code that assumes global synchronization of workgroups.

As long as I am spawning only 1 workgroup - the code is returning correct result.

If I have more than 1, the code does not give correct output - This is as expected.

Also, I hope you are using the correct drivers for your notebook.

Please download and install the drivers from AMD directly - http://support.amd.com/us/gpudownload/Pages/index.aspx

View solution in original post

0 Likes
15 Replies
himanshu_gautam
Grandmaster

The second loop should not work. You are correct.

How many workitems are you using in your workgroup? Is it 64? or something greater than that?

I hope "Minvals" and "Maxvals" are in Local Memory. Please confirm.

  1. for(int offset = local_size/2;offset>0;offset/=2)  
  2.         {  
  3.                 barrier(CLK_LOCAL_MEM_FENCE);  
  4.  
  5.                 if(gid_local<offset)  
  6.                 {  
  7.                         Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);  
  8.                         Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);  
  9.                 }  
  10.         }  
  11.     barrier(CLK_GLOBAL_MEM_FENCE); 

This code looks innocent. This should actually work. I dont see any possibility of a race.

Please post the following details for reproducing here

===========================================

Please post a copy of your code (as zip file) so that we can reproduce here.
Please include the following details as well.

1. Platform - win32 / win64 / lin32 / lin64 or some other?

    Win7 or win vista or Win8.. Similarly for linux, your distribution

2. Version of driver

3. CPU or GPU Target?

4. CPU/GPU details of your hardware


Thanks,

0 Likes

I can give you the kernel, but for the remaining source code it is complicated. I need to use ITK Toolkit(http://www.itk.org/), which encapsulates many parts of OpenCL-API in its own functions and methods. Creating an test project is very complicated, because I rely on the ITK functionalities.

I need to parallelise and existing segmentation algorithm(FSL-Fast) with Opencl.  One part of that is to check the min and max element of a voxel field.

Here are some parts of the Code:

http://dfiles.eu/files/auyb9avor

Some information:

According to image size and work-group size the globalSize of the kernel is generated. Then a #define statement is added in front of the kernel, to make work-group size dynamically and controllable(not letting OpenCl determine right work-group size). After that the kernel is build and all parameters are set. Then the Kernel is executed. For example, one of my voxel images is 160, 256, 173 in each direction. With a work-group size of 4,4,4, i get a globalsize of 160, 256, 176. The voxel data is saved in a 1D-array. Therefore it is needed to calculate the index(gid_x + gid_y*sizex + gid_z*sizex*sizez).

I have three machines.

1. Notebook:

Dell Vostro 3450

Intel Core I5 2410m

AMD HD6630m

4GB Ram

2. Desktop PC:

AMD Phenom II 945 x4

Nvidia GTS450

4GB RAM

3. Machine at university:

Intel Xeon X5675

AMD Firepro V5900

Nvidia Tesla C2075

48GB Ram

One the Notebook very simple kernels only work. For more information: http://devgurus.amd.com/thread/160282 . It looks like debugging won't enter any If oder For-Statements. I am using modded drivers(Unifl Leshcat), because Mobility Drivers don't work(CCC is not starting). The original Dell drivers are from 2011 and still have the ATI logo integrated(is not recognized as AMD graphics card).

That kernel(the second one, which should not work...) runs correctly on that Tesla-machine, but only if workgroup size is 64 or below. The same kernel crashes on my GTS450. Because I find debugging extremely important and Nvidias approach on OpenCL is very complicated, an AMD GCN(HD 7750) is on the way. AMDs CodeXL still looks very promising.

I hope this is enough information. I just need to know, if my kernel is workable.

0 Likes

Request you to post your kernel (at least the kernel source) as a ZIP here. Just explain me the args. I can do a sample repro case and take this forward.

PS: At the moment, the URL you have provided is blocked by my company's websense. Also, a ZIP will be a permanent record.

0 Likes

Oh, okay.

There you go!

0 Likes

WORKGROUPSIZE is defined to 64. Is this what you use?

THats actually the wavefront size. That probably (not sure) explains why the second code-snippet actually does not hang.

If you incease this to 128 or so, the second one should hang at the barrier.

Anyway, Thanks for posting. Will try it out

0 Likes

You mean, that my workgroup size is the same as the wavefront size? This definitely true for AMD Hardware. For Nvidia it should be 32. I wanted to make some performance tests according the size of local memory and arrays, in order to find out, which implementation is best.

€: I've done some additional testing. The kernel, which should work, works on the Tesla system, but neither on my Notebook(propably driver issues), nor on my desktop(GTS450, ???). I will do some additional testing, after I get my GCN. Then I will report back.

0 Likes

    barrier(CLK_GLOBAL_MEM_FENCE);

    //global reduction

    if(gid==0)

    {

        float mint=256;

        float maxt=0;

        for(int i =0; i < group_count; i=i+2)

        {

            mint     = fmin(mint,output);

            maxt     = fmax(maxt,output[i+1]);

        }

        //write to output -> end

        output[0] = mint;

        output[1] = maxt;

    }

I fear this snippet of code assumes "global synchronization" of all workgroups -- which is "not" possible in OpenCL.

This will not work - will not give correct results.

0 Likes

Yeah, you are right. Which means, I need a other method to perform reduction on global level. This is the reason, why that kernel is not working. Thanks a lot.

Is it possible to perform any other global reduction with more than one workgroup? It looks like, that using only one work group is the key to my problem.

So that the Tesla machine returned the right values, was just a coincidence. I will rewrite the code.

Thanks a lot, again.Now I know, what I shouldn't do in the future.

0 Likes

This kernel does not compile as "group_id" is not declared at all.

May be, you mean "get_group_id"?? But even this group_id has to be mapped from 3D to 1D.

Can you provide the code for that? (meanwhile, I will fill up my own logic there... but I just want to be sure. Thats why i am asking you)

The logic I am using is - Please Confirm this.

        int group_idx = get_group_id(0);
        int group_idy = get_group_id(1);
        int group_idz = get_group_id(2);

        int group_id = group_idz * get_num_groups(0) * get_num_groups(1) +
                       group_idy * get_num_groups(0) +
                       group_idx;

0 Likes

I am able to run your code without any problems on Tahiti 7970 on Linux 64 box.

The only problem is the bug(?) in your code that assumes global synchronization of workgroups.

As long as I am spawning only 1 workgroup - the code is returning correct result.

If I have more than 1, the code does not give correct output - This is as expected.

Also, I hope you are using the correct drivers for your notebook.

Please download and install the drivers from AMD directly - http://support.amd.com/us/gpudownload/Pages/index.aspx

0 Likes

Mhh, what do you mean by "spawning only one workgroup"? You mean I spawn only one work group, which has the size of the array(therefore there is only one)? If I do this, my device runs out of memory.

It looks like I need a complete different approach, or I let the cpu do the final reduction.

0 Likes

No... What i meant was this:

"

I  wrote some host code that will allocate memory , build program, set arguments and call your kernel.

In this host code, if I spawn only 1 workgroup, your code works fine.

The code-snippet that you complained about works fine on 7970 card here.

However, if I spawn multiple workgroups (i.e. increase the image size), the code does not return correct output.

This is because of the bug in the code that assumes global synchronization

"

So, as you rightly said, you can allow the CPU to do the final reduction.

(or) Spawn one another kernel with only 1 workgroup to do the final reduction alone

Thanks again. I think, all problems were cleared up.

0 Likes

Glad to know! Good luck!

0 Likes

Yeah, this is right!

0 Likes