9 Replies Latest reply on Jun 30, 2014 10:29 AM by boxerab

    Most efficient way of dividing by power of two

    boxerab

      I have an array of signed ints, and I would like to divide by 2^n.

      I currently use this routine to do it:

       

      int out = in >= 0 ? int >> n : -((int & 0x7FFFFFFF) >> n);

       

      Is there a better way that does not involve a branch?

        • Re: Most efficient way of dividing by power of two
          realhet

          Checking the OpenCL docs, it says that arithmetic shift is used when the left operand is signed, so that workaround is unnecessary.

          http://www.khronos.org/registry/cl/specs/opencl-1.x-latest.pdf#page=143

            • Re: Most efficient way of dividing by power of two
              boxerab

              Thanks, realhet. So, negative numbers are rounded up towards zero?

              The spec mentiones setting empty bits, but this is a bit vague>

                • Re: Most efficient way of dividing by power of two
                  dipak

                  Hi,

                  As you know, signed number's are represented by 2's complement method where most significant bit tells sign of the number and set to 1 for negative number. When ">>" is applied to a signed number with -ve value, after the shift operation, it's most significant bit(empty bit) is set to 1 to mark -ve number. After applying ">>" several times or n times for n-bit number, all bits of the number becomes 1 which is nothing but -1 in decimal. So, negative numbers are rounded up towards -1 not 0.

                   

                  Regards,

                    • Re: Most efficient way of dividing by power of two
                      realhet

                      Hi,

                       

                      Indeed that -1 problem... A simlpe >> just fails on it.

                       

                      So I've checked how efficient are the different methods:

                       

                      int out = in >= 0 ? in >> n : -(-in >> n);  //this is re OP's version repaired a bit to work.

                        v_cmp_lt_i32  s[0:1], v3, 0        //v3:in, v0:n              
                       
                      s_and_saveexec_b64  s[0:1], s[0:1]                       
                       
                      v_sub_i32     v2, vcc, 0, v3                             
                       
                      v_ashr_i32    v0, v2, v0                                 
                       
                      v_sub_i32     v0, vcc, 0, v0                             
                       
                      tbuffer_store_format_x  v0, v1, s[4:7]... result for <0 goes out here
                       
                      s_andn2_b64   exec, s[0:1], exec                         
                       
                      v_ashr_i32    v0, v3, v0                                 
                       
                      tbuffer_store_format_x  v0, v1, s[4:7]... result for >=0 goes out here

                      this is like 7 cycles (without writing the data out) It's bad because the IF/ELSE

                       

                      Then I've found something much better here: c# - Dividing by power of 2 using bit shifting - Stack Overflow

                      int out = (in + ((in >> 31) & ((1 << n) -1))) >> n;

                        v_ashrrev_i32  v2, 31, v3     //v3:in, v0:n                           
                       
                      v_bfe_u32     v2, v2, 0, v0                              
                       
                      v_add_i32     v2, vcc, v3, v2                            
                       
                      v_ashr_i32    v0, v2, v0                                 
                       
                      tbuffer_store_format_x  v0, v1, s[4:7], ... result in v0

                      It uses only 4 cycles and it's pure arithmetic.

                      The CAL compiler does some magic here as it recognises that  x & ((1 << n) -1) expression can be implemented with the BitFieldExtract instruction.

                       

                      I also think of an opportunity to use the highest 16 bit of the 48 bit signed integer multiplier (mul24_hi, or don't know the name in OpenCL). Using that, 16bit results can be produced for integer fixed point division but the inputs must be shifted up. That could affect the previous parts of the program badly and also loses precision.