
Integer division problem
nou Jan 30, 2011 12:42 PM (in response to eugenek)try look at native_div()

Integer division problem
eugenek Jan 30, 2011 6:49 PM (in response to nou)Originally posted by: nou try look at native_div()
native_divide() is "only" 195 instructions, but it is inexact.

Integer division problem
eugenek Jan 31, 2011 7:00 AM (in response to eugenek)As further proof that integer division can be done better, here's a simple inline function that does 64bit by 32bit division about 5 times faster than the native '/', and manages to do that correctly except perhaps in some corner cases.
ulong do_fast_div(ulong v1, uint v2) { ulong rem = (ulong) ( ((float)v1)/((float)v2) ); ulong check = rem * v2; ulong retval = rem + ((int)(v1check))/((uint)v2); if(v1<check && retval*v2 != v1) retval = 1; return retval; }

Integer division problem
nou Jan 31, 2011 7:10 AM (in response to eugenek)did you tried it with realy big numbers? which are too big fit into a float type?

Integer division problem
eugenek Jan 31, 2011 7:33 AM (in response to nou)uint64_t will always fit in a float type. 32bit float goes up to 10^38. uint64_t only goes up to 10^19. The only issue here is the loss of precision during the int to float cast (hence the need for lines 2 through 5).
This code will blow up if v1check does not fit into 32 bit. Which is possible, I suppose, with v1 larger than 2^55 (~10^16). That problem should be rectifiable at a relatively low cost by using double instead of float.
Edit: but, that, of course, presumes that doubleprecision FP is supported by the GPU, which is, according to the OpenCL guide, not the case for my 6800 ... ugh ... Is that a hardware or a software limitation?

Integer division problem
genaganna Jan 31, 2011 11:24 AM (in response to eugenek)Originally posted by: eugenek uint64_t will always fit in a float type. 32bit float goes up to 10^38. uint64_t only goes up to 10^19. The only issue here is the loss of precision during the int to float cast (hence the need for lines 2 through 5).
This code will blow up if v1check does not fit into 32 bit. Which is possible, I suppose, with v1 larger than 2^55 (~10^16). That problem should be rectifiable at a relatively low cost by using double instead of float.
Edit: but, that, of course, presumes that doubleprecision FP is supported by the GPU, which is, according to the OpenCL guide, not the case for my 6800 ... ugh ... Is that a hardware or a software limitation?
float is not able to represent integers which are greater than 2 ^ 24 and not a power of 2.
so your code will not work for range of integers.






Integer division problem
MicahVillmow Jan 31, 2011 1:38 PM (in response to eugenek)eugenek,
We have to be correct first and fast second. So far 64bit division has not been a bottleneck for a majority of the developers, so we have focused on things that are bottlenecks. We will add your request to the list of things to improve, but we can't give any timeline on if/when it gets implemented.
Integer division problem
eugenek Feb 3, 2011 7:52 AM (in response to MicahVillmow)OK,
On the same subject, ulong to float and float to ulong casts aren't done well either (80 and 40 instructions, respectively). The following inlines seem to work correctly, and they take only 11 and 6 instructions.inline float cast_ulong_to_float(ulong x) { return (float)(uint)(x & 0xFFFFFFFF) + 4294967296.*(float)(uint)(x>>32); } inline ulong cast_float_to_ulong(float x) { float y = x * (1./4294967296.); uint hiword = (uint)y; uint loword = (uint)(x4294967296.*(float)hiword); return (((ulong)hiword)<<32)loword; }

Integer division problem
eugenek Feb 3, 2011 7:55 AM (in response to eugenek)And what exactly is going on here? Why am I not getting a single instruction (RCP_e) and a store?
__kernel void do_cast(float v1, __global float* v2) { *v2 = 1./v1; } ;  Disassembly  00 ALU: ADDR(32) CNT(32) KCACHE0(CB1:015) 0 x: AND_INT T0.x, KC0[0].x, (0x80000000, 0.0f).x z: AND_INT ____, KC0[0].x, (0x807FFFFF, 1.175494211e38f).y w: AND_INT ____, KC0[0].x, (0x7F800000, 0.inff).z t: LSHR R1.x, KC0[1].x, (0x00000002, 2.802596929e45f).w 1 x: SETE_INT ____, PV0.w, 0.0f y: SETE_INT T0.y, PV0.w, (0x7F800000, 0.inff).x z: OR_INT ____, PV0.z, (0x3F800000, 1.0f).y w: SUB_INT T0.w, (0x3F800000, 1.0f).y, PV0.w 2 y: CNDE_INT R123.y, PV1.x, PV1.z, T0.x w: OR_INT T1.w, PV1.y, PV1.x 3 x: CNDE_INT R123.x, T0.y, PV2.y, KC0[0].x y: CNDE_INT T0.y, PV2.w, T0.w, 0.0f 4 x: ASHR T0.x, PV3.y, (0x00000017, 3.222986468e44f).x t: RCP_e ____, PV3.x 5 x: ADD_INT T1.x, T0.y, PS4 z: AND_INT T0.z, (0x80000000, 0.0f).x, PS4 w: AND_INT ____, (0x7FFFFFFF, 0.nanf).y, PS4 6 x: OR_INT T2.x, PV5.z, (0x7F800000, 0.inff).x y: ASHR ____, PV5.w, (0x00000017, 3.222986468e44f).y 7 z: ADD_INT ____, PV6.y, T0.x 8 y: SETGE_INT ____, 0.0f, PV7.z w: SETGE_INT ____, PV7.z, (0x000000FF, 3.573311084e43f).x 9 y: CNDE_INT R123.y, PV8.y, T1.x, T0.z z: CNDE_INT R123.z, T1.w, PV8.w, 0.0f 10 x: CNDE_INT R0.x, PV9.z, PV9.y, T2.x 01 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) VPM END_OF_PROGRAM

Integer division problem
nou Feb 3, 2011 8:04 AM (in response to eugenek)becuase it must compute memory address where it store.

Integer division problem
eugenek Feb 3, 2011 8:09 AM (in response to nou)Originally posted by: nou becuase it must compute memory address where it store.
Nope. Compare:
__kernel void do_cast(float v1, __global float* v2) { *v2 = v1; } ;  Disassembly  00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:015) 0 x: LSHR R1.x, KC0[1].x, (0x00000002, 2.802596929e45f).x t: MOV R0.x, KC0[0].x 01 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) VPM





Integer division problem
MicahVillmow Feb 3, 2011 2:33 PM (in response to eugenek)eugenek,
This is simply because the hardware recip does not follow all of the OpenCL requirements. If you have only the hardware recip, use native_recip.
Integer division problem
eugenek Feb 3, 2011 6:28 PM (in response to MicahVillmow)Does it fail the accuracy requirement? What is the accuracy of hardware reciprocal?

Integer division problem
eugenek Feb 7, 2011 3:14 AM (in response to eugenek)OK, so here's the best I could do in the 64 bit by 32 bit case. (The 64 bit by 64 bit case is very similar.) I believe that my implementation is exact. I've tested the code on a 6970 with 10^9 random input values all over the valid range, and didn't see any mismatch errors. It compiles into 47 VLIW instructions on Cayman and 42 VLIW instructions on Barts. (vs. 397 and 360 for the default implementation.) Performance wise, on Cayman, it beats the default implementation by about the factor of 4.
A major constraint is that the algorithm is heavy in integer multiplications, and all AMD architectures can only do one integer multiplication per VLIW. (AMD's can do 4 or 5 floatingpoint multiplications per VLIW, but I don't see how that helps me here.) Therefore, AMD's nominal advantage in processing units does not apply here. Unless something changes at the hardware level, I'd expect AMD to remain slower than NVIDIA in integer arithmetics.
Edit: but, on the second thought, Fermi takes two clocks per integer multiplication, so, theoretically speaking, the 6970 should have the throughput of 384 x 880 = 338G multiplications per second, which should still beat NVIDIA 470's 448 x 1215 / 2 = 272G multiplications per second. Hrm. Need to take a closer look at disassemblies...
inline ulong mul_ulong_by_float(ulong v, uint mantissa, uint exponent) { uint2 product; uint low_dword; product.y = mul_hi(as_uint2(v).y, mantissa); product.x = as_uint2(v).y*mantissa; low_dword = mul_hi(as_uint2(v).x, mantissa); ulong retval = as_ulong(product) + low_dword; retval >>= exponent; return retval; } inline ulong do_fast_div(ulong v1, uint v2) { if(v2==1) return v1; float inv_v2 = native_recip((float)v2); uint mantissa = (as_uint(inv_v2) << 8)  0x80000000; uint exponent = 126  (as_uint(inv_v2) >> 23) ; mantissa = 512; ulong div = mul_ulong_by_float(v1, mantissa, exponent); v1 = div*v2; ulong retval = div; // max error 2^41 div = mul_ulong_by_float(v1, mantissa, exponent); v1 = div*v2; retval += div; // max error 2^19 uint div32 = mul_hi((uint)v1, mantissa) >> exponent; v1 = div32*v2; retval += div32; // max error 1 if((uint)v1 >= v2) retval += 1; return retval; }

Integer division problem
eugenek Feb 7, 2011 9:10 AM (in response to eugenek)Well, surprise surprise. mul_hi(ulong,ulong) isn't implemented right, either. Why would you possibly use 11 multiplications to do that? You need 7, no more and no less.
I'm starting to get a feeling that there's a bunch of software people in need of getting fired somewhere.
I do hope I won't need to reimplement ulong+ulong manually, at least.


