I have an array of signed ints, and I would like to divide by 2^n.

I currently use this routine to do it:

int out = in >= 0 ? int >> n : -((int & 0x7FFFFFFF) >> n);

Is there a better way that does not involve a branch?

I have an array of signed ints, and I would like to divide by 2^n.

I currently use this routine to do it:

int out = in >= 0 ? int >> n : -((int & 0x7FFFFFFF) >> n);

Is there a better way that does not involve a branch?

Checking the OpenCL docs, it says that arithmetic shift is used when the left operand is signed, so that workaround is unnecessary.

http://www.khronos.org/registry/cl/specs/opencl-1.x-latest.pdf#page=143

Thanks, realhet. So, negative numbers are rounded up towards zero?

The spec mentiones setting empty bits, but this is a bit vague>

Hi,

As you know, signed number's are represented by 2's complement method where most significant bit tells sign of the number and set to 1 for negative number. When ">>" is applied to a signed number with -ve value, after the shift operation, it's most significant bit(empty bit) is set to 1 to mark -ve number. After applying ">>" several times or n times for n-bit number, all bits of the number becomes 1 which is nothing but -1 in decimal. So, negative numbers are rounded up towards -1 not 0.

Regards,

Hi,

Indeed that -1 problem... A simlpe >> just fails on it.

So I've checked how efficient are the different methods:

int out = in >= 0 ? in >> n : -(-in >> n); //this is re OP's version repaired a bit to work.

`v_cmp_lt_i32`

`s`

`[`

`0:1`

`],`

`v3`

`,`

`0`

`//v3:in, v0:n`

`s_and_saveexec_b64`

`s`

`[`

`0:1`

`],`

`s`

`[`

`0:1`

`]`

`v_sub_i32`

`v2`

`,`

`vcc`

`,`

`0`

`,`

`v3`

`v_ashr_i32`

`v0`

`,`

`v2`

`,`

`v0`

`v_sub_i32`

`v0`

`,`

`vcc`

`,`

`0`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`]... result for <0 goes out here`

`s_andn2_b64`

`exec`

`,`

`s`

`[`

`0:1`

`],`

`exec`

`v_ashr_i32`

`v0`

`,`

`v3`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`]... result for >=0 goes out here`

`this is like 7 cycles (without writing the data out) It's bad because the IF/ELSE`

Then I've found something much better here: c# - Dividing by power of 2 using bit shifting - Stack Overflow

int out = (in + ((in >> 31) & ((1 << n) -1))) >> n;

`v_ashrrev_i32`

`v2`

`,`

`31`

`,`

`v3`

`//v3:in, v0:n`

`v_bfe_u32`

`v2`

`,`

`v2`

`,`

`0`

`,`

`v0`

`v_add_i32`

`v2`

`,`

`vcc`

`,`

`v3`

`,`

`v2`

`v_ashr_i32`

`v0`

`,`

`v2`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`],`

`... result in v0`

It uses only 4 cycles and it's pure arithmetic.

The CAL compiler does some magic here as it recognises that x & ((1 << n) -1) expression can be implemented with the BitFieldExtract instruction.

I also think of an opportunity to use the highest 16 bit of the 48 bit signed integer multiplier (mul24_hi, or don't know the name in OpenCL). Using that, 16bit results can be produced for integer fixed point division but the inputs must be shifted up. That could affect the previous parts of the program badly and also loses precision.

Thanks, guys.

What about:

int comp = (in >= 0);

int out = ((in >> n) & comp) + ((-in >> n) & ~comp);

No branch, and rounds negative numbers up to zero.

It's incorect:

I guess, the result of (in >= 0) is either 0 or 1. And it already takes 2 clocks. With the correct comp = i>=0 ? -1 : 0 it costs 3.

in>>n, -(-in>>n) is another 3

And the last instruction is a bit-select: (a & c) | (a & ~c) -> 1 clock. So it's not too optimal.

Thanks. You're right, your solution is better in this case.

But in general, to replace

if (a<b)

return x;

else

return y;

you can use:

unsigned int cmp = (a < b);

return (a & cmp) + (b & ~cmp) ;

and this removes the branch.

Well for that specific example, you mentioned there is a shortcut: min() and max() instructions. (btw the GCN hardware has 3 operand min/max too)

And IMO it's quiet dangerous to use opencl operations for masking: "For scalar types, these operators return

`0`if the specified relation is false and`1`if the specified relation is true. For vector types, these operators return`0`if the specified relation is false and`-1`(i.e., all bits set) if the specified relation is true""and this removes the branch." -> You can only be sure when you check the generated assembly code. If you see the s_cbranch or an instruction that alters the EXEC register, then the compiler made divergent codepaths.

Hi,

Indeed that -1 problem... A simlpe >> just fails on it.

So I've checked how efficient are the different methods:

int out = in >= 0 ? in >> n : -(-in >> n); //this is re OP's version repaired a bit to work.

`v_cmp_lt_i32`

`s`

`[`

`0:1`

`],`

`v3`

`,`

`0`

`//v3:in, v0:n`

`s_and_saveexec_b64`

`s`

`[`

`0:1`

`],`

`s`

`[`

`0:1`

`]`

`v_sub_i32`

`v2`

`,`

`vcc`

`,`

`0`

`,`

`v3`

`v_ashr_i32`

`v0`

`,`

`v2`

`,`

`v0`

`v_sub_i32`

`v0`

`,`

`vcc`

`,`

`0`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`]... result for <0 goes out here`

`s_andn2_b64`

`exec`

`,`

`s`

`[`

`0:1`

`],`

`exec`

`v_ashr_i32`

`v0`

`,`

`v3`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`]... result for >=0 goes out here`

`this is like 7 cycles (without writing the data out) It's bad because the IF/ELSE`

Then I've found something much better here: c# - Dividing by power of 2 using bit shifting - Stack Overflow

int out = (in + ((in >> 31) & ((1 << n) -1))) >> n;

`v_ashrrev_i32`

`v2`

`,`

`31`

`,`

`v3`

`//v3:in, v0:n`

`v_bfe_u32`

`v2`

`,`

`v2`

`,`

`0`

`,`

`v0`

`v_add_i32`

`v2`

`,`

`vcc`

`,`

`v3`

`,`

`v2`

`v_ashr_i32`

`v0`

`,`

`v2`

`,`

`v0`

`tbuffer_store_format_x`

`v0`

`,`

`v1`

`,`

`s`

`[`

`4:7`

`],`

`... result in v0`

It uses only 4 cycles and it's pure arithmetic.

The CAL compiler does some magic here as it recognises that x & ((1 << n) -1) expression can be implemented with the BitFieldExtract instruction.

I also think of an opportunity to use the highest 16 bit of the 48 bit signed integer multiplier (mul24_hi, or don't know the name in OpenCL). Using that, 16bit results can be produced for integer fixed point division but the inputs must be shifted up. That could affect the previous parts of the program badly and also loses precision.