15 Replies Latest reply on Feb 13, 2013 7:19 AM by himanshu.gautam

    How to do parallel reduction correctly?



      i am trying to write some kernels, which use local memory and parallel reduction. With this I have some problems.

      My kernel tries to evaluate the min value and the max value of an array.





      This kernel doesn't work. Whenever it runs, it crashes and CL_INVALID_COMMAND_QUEUE is returned by clfinish().

      What I find is very interesting is, if I replace this:

      for(int offset = local_size/2;offset>0;offset/=2)
                              Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                              Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);

      By this:

      for(int offset = local_size/2;gid_local<offset;offset/=2)
                      Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);
                      Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);


      The kernel is working and even returns the correct values, although many threads in the same work-group don't reach the barrier-statement in my loop, which shouldn't be possible.


      My idea of parallel reduction is pretty simple. I take the half of my work-group size and use that as an offset for an work-item to access the right elements.

      This offset is halfed by each iteration until there is only one work-item alive. Both for loops are doing the same(in my opinion). If I use the first one, the kernel crashes. Sadly I don't know the reason for this. The second loop should not work...but it works.


      Many thanks in advance

        • Re: How to do parallel reduction correctly?

          The second loop should not work. You are correct.

          How many workitems are you using in your workgroup? Is it 64? or something greater than that?

          I hope "Minvals" and "Maxvals" are in Local Memory. Please confirm.


          1. for(int offset = local_size/2;offset>0;offset/=2)  
          2.         {  
          3.                 barrier(CLK_LOCAL_MEM_FENCE);  
          5.                 if(gid_local<offset)  
          6.                 {  
          7.                         Minvals[gid_local] = fmin(Minvals[gid_local],Minvals[gid_local+offset]);  
          8.                         Maxvals[gid_local] = fmax(Maxvals[gid_local],Maxvals[gid_local+offset]);  
          9.                 }  
          10.         }  
          11.     barrier(CLK_GLOBAL_MEM_FENCE); 


          This code looks innocent. This should actually work. I dont see any possibility of a race.


          Please post the following details for reproducing here


          Please post a copy of your code (as zip file) so that we can reproduce here.
          Please include the following details as well.

          1. Platform - win32 / win64 / lin32 / lin64 or some other?

              Win7 or win vista or Win8.. Similarly for linux, your distribution

          2. Version of driver

          3. CPU or GPU Target?

          4. CPU/GPU details of your hardware


            • Re: How to do parallel reduction correctly?

              I can give you the kernel, but for the remaining source code it is complicated. I need to use ITK Toolkit(http://www.itk.org/), which encapsulates many parts of OpenCL-API in its own functions and methods. Creating an test project is very complicated, because I rely on the ITK functionalities.


              I need to parallelise and existing segmentation algorithm(FSL-Fast) with Opencl.  One part of that is to check the min and max element of a voxel field.


              Here are some parts of the Code:



              Some information:

              According to image size and work-group size the globalSize of the kernel is generated. Then a #define statement is added in front of the kernel, to make work-group size dynamically and controllable(not letting OpenCl determine right work-group size). After that the kernel is build and all parameters are set. Then the Kernel is executed. For example, one of my voxel images is 160, 256, 173 in each direction. With a work-group size of 4,4,4, i get a globalsize of 160, 256, 176. The voxel data is saved in a 1D-array. Therefore it is needed to calculate the index(gid_x + gid_y*sizex + gid_z*sizex*sizez).


              I have three machines.


              1. Notebook:

              Dell Vostro 3450

              Intel Core I5 2410m

              AMD HD6630m

              4GB Ram


              2. Desktop PC:

              AMD Phenom II 945 x4

              Nvidia GTS450

              4GB RAM


              3. Machine at university:

              Intel Xeon X5675

              AMD Firepro V5900

              Nvidia Tesla C2075

              48GB Ram


              One the Notebook very simple kernels only work. For more information: http://devgurus.amd.com/thread/160282 . It looks like debugging won't enter any If oder For-Statements. I am using modded drivers(Unifl Leshcat), because Mobility Drivers don't work(CCC is not starting). The original Dell drivers are from 2011 and still have the ATI logo integrated(is not recognized as AMD graphics card).

              That kernel(the second one, which should not work...) runs correctly on that Tesla-machine, but only if workgroup size is 64 or below. The same kernel crashes on my GTS450. Because I find debugging extremely important and Nvidias approach on OpenCL is very complicated, an AMD GCN(HD 7750) is on the way. AMDs CodeXL still looks very promising.


              I hope this is enough information. I just need to know, if my kernel is workable.