cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

eugenek
Journeyman III

Integer division problem

This is a continuation of my thread in the OpenCL subforum.

AMD GPU architecture does not have any instructions that can do integer divisions. That, by itself, is understandable (sometimes you can't have an instruction for everything). So when I try to do an integer division in a GPGPU kernel, in reality the compiler will emit a sequence of low-level instructions to the same effect.

The problem is that the low-level implementation of integer division in APP SDK is grossly unreasonable. According to my dumps, dividing a 64-bit by a 32-bit takes 870 instructions and dividing a 64-bit by a 64-bit takes 900 instructions. There's some branching, so it's possible that the real cost will be somewhat lower, but even half of 900 is way too much.

The reason I know that it's unreasonable is that NVIDIA, while operating within similar constraints (mostly 32-bit native low-level ops and nothing directly relevant for the division), gets the same task done in just over 70 instructions.

Consequently, once again according to my tests, Radeon 6870 (1120 processing units @ 900 MHz) can handle about 700M 64-bit integer divisions per second, whereas nominally slower NVIDIA GTX 470 (448 processing units @ 1215 MHz) can handle 6 billon of the same divisions per second.

I would appreciate it if someone got around to fixing the APP SDK compiler for some future release.

APPENDIX. Low-level code generated for the task on both platforms:

http://www.ideone.com/MXFc9

http://www.ideone.com/roFD3

Can't vouch for the second version, since it's been generated by a third-party disassembler tool.

0 Likes
15 Replies
nou
Exemplar

try look at native_div()

0 Likes

Originally posted by: nou try look at native_div()

 

 

native_divide() is "only" 195 instructions, but it is inexact.

0 Likes

As further proof that integer division can be done better, here's a simple inline function that does 64-bit by 32-bit division about 5 times faster than the native '/', and manages to do that correctly except perhaps in some corner cases.

 

ulong do_fast_div(ulong v1, uint v2) { ulong rem = (ulong) ( ((float)v1)/((float)v2) ); ulong check = rem * v2; ulong retval = rem + ((int)(v1-check))/((uint)v2); if(v1<check && retval*v2 != v1) retval -= 1; return retval; }

0 Likes

did you tried it with realy big numbers? which are too big fit into a float type?

0 Likes

uint64_t will always fit in a float type. 32-bit float goes up to 10^38. uint64_t only goes up to 10^19. The only issue here is the loss of precision during the int to float cast (hence the need for lines 2 through 5).

This code will blow up if v1-check does not fit into 32 bit. Which is possible, I suppose, with v1 larger than 2^55 (~10^16). That problem should be rectifiable at a relatively low cost by using double instead of float.

Edit: but, that, of course, presumes that double-precision FP is supported by the GPU, which is, according to the  OpenCL guide, not the case for my 6800 ... ugh ... Is that a hardware or a software limitation?

0 Likes

Originally posted by: eugenek uint64_t will always fit in a float type. 32-bit float goes up to 10^38. uint64_t only goes up to 10^19. The only issue here is the loss of precision during the int to float cast (hence the need for lines 2 through 5).

 

This code will blow up if v1-check does not fit into 32 bit. Which is possible, I suppose, with v1 larger than 2^55 (~10^16). That problem should be rectifiable at a relatively low cost by using double instead of float.

 

Edit: but, that, of course, presumes that double-precision FP is supported by the GPU, which is, according to the  OpenCL guide, not the case for my 6800 ... ugh ... Is that a hardware or a software limitation?

 

float is not able to represent integers which are greater than 2 ^ 24 and not a power of 2.

so your code will not work for range of integers.

0 Likes

eugenek,
We have to be correct first and fast second. So far 64bit division has not been a bottleneck for a majority of the developers, so we have focused on things that are bottlenecks. We will add your request to the list of things to improve, but we can't give any timeline on if/when it gets implemented.
0 Likes

OK,


On the same subject, ulong to float and float to ulong casts aren't done well either (80 and 40 instructions, respectively). The following inlines seem to work correctly, and they take only 11 and 6 instructions.

inline float cast_ulong_to_float(ulong x) { return (float)(uint)(x & 0xFFFFFFFF) + 4294967296.*(float)(uint)(x>>32); } inline ulong cast_float_to_ulong(float x) { float y = x * (1./4294967296.); uint hiword = (uint)y; uint loword = (uint)(x-4294967296.*(float)hiword); return (((ulong)hiword)<<32)|loword; }

0 Likes

And what exactly is going on here? Why am I not getting a single instruction (RCP_e) and a store?

 

 

__kernel void do_cast(float v1, __global float* v2) { *v2 = 1./v1; } ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(32) KCACHE0(CB1:0-15) 0 x: AND_INT T0.x, KC0[0].x, (0x80000000, -0.0f).x z: AND_INT ____, KC0[0].x, (0x807FFFFF, -1.175494211e-38f).y w: AND_INT ____, KC0[0].x, (0x7F800000, 0.inff).z t: LSHR R1.x, KC0[1].x, (0x00000002, 2.802596929e-45f).w 1 x: SETE_INT ____, PV0.w, 0.0f y: SETE_INT T0.y, PV0.w, (0x7F800000, 0.inff).x z: OR_INT ____, PV0.z, (0x3F800000, 1.0f).y w: SUB_INT T0.w, (0x3F800000, 1.0f).y, PV0.w 2 y: CNDE_INT R123.y, PV1.x, PV1.z, T0.x w: OR_INT T1.w, PV1.y, PV1.x 3 x: CNDE_INT R123.x, T0.y, PV2.y, KC0[0].x y: CNDE_INT T0.y, PV2.w, T0.w, 0.0f 4 x: ASHR T0.x, PV3.y, (0x00000017, 3.222986468e-44f).x t: RCP_e ____, PV3.x 5 x: ADD_INT T1.x, T0.y, PS4 z: AND_INT T0.z, (0x80000000, -0.0f).x, PS4 w: AND_INT ____, (0x7FFFFFFF, 0.nanf).y, PS4 6 x: OR_INT T2.x, PV5.z, (0x7F800000, 0.inff).x y: ASHR ____, PV5.w, (0x00000017, 3.222986468e-44f).y 7 z: ADD_INT ____, PV6.y, T0.x 8 y: SETGE_INT ____, 0.0f, PV7.z w: SETGE_INT ____, PV7.z, (0x000000FF, 3.573311084e-43f).x 9 y: CNDE_INT R123.y, PV8.y, T1.x, T0.z z: CNDE_INT R123.z, T1.w, PV8.w, 0.0f 10 x: CNDE_INT R0.x, PV9.z, PV9.y, T2.x 01 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) VPM END_OF_PROGRAM

0 Likes

becuase it must compute memory address where it store.

0 Likes

Originally posted by: nou becuase it must compute memory address where it store.

 

 

Nope. Compare:

 

 

__kernel void do_cast(float v1, __global float* v2) { *v2 = v1; } ; -------- Disassembly -------------------- 00 ALU: ADDR(32) CNT(3) KCACHE0(CB1:0-15) 0 x: LSHR R1.x, KC0[1].x, (0x00000002, 2.802596929e-45f).x t: MOV R0.x, KC0[0].x 01 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) VPM

0 Likes

eugenek,
This is simply because the hardware recip does not follow all of the OpenCL requirements. If you have only the hardware recip, use native_recip.
0 Likes

Does it fail the accuracy requirement? What is the accuracy of hardware reciprocal?

0 Likes

OK, so here's the best I could do in the 64 bit by 32 bit case. (The 64 bit by 64 bit case is very similar.) I believe that my implementation is exact. I've tested the code on a 6970 with 10^9 random input values all over the valid range, and didn't see any mismatch errors. It compiles into 47 VLIW instructions on Cayman and 42 VLIW instructions on Barts. (vs. 397 and 360 for the default implementation.) Performance wise, on Cayman, it beats the default implementation by about the factor of 4.

A major constraint is that the algorithm is heavy in integer multiplications, and all AMD architectures can only do one integer multiplication per VLIW. (AMD's can do 4 or 5 floating-point multiplications per VLIW, but I don't see how that helps me here.) Therefore, AMD's nominal advantage in processing units does not apply here. Unless something changes at the hardware level, I'd expect AMD to remain slower than NVIDIA in integer arithmetics.

Edit: but, on the second thought, Fermi takes two clocks per integer multiplication, so, theoretically speaking, the 6970 should have the throughput of 384 x 880 =  338G multiplications per second, which should still beat NVIDIA 470's 448 x 1215 / 2 = 272G multiplications per second. Hrm. Need to take a closer look at disassemblies...

inline ulong mul_ulong_by_float(ulong v, uint mantissa, uint exponent) { uint2 product; uint low_dword; product.y = mul_hi(as_uint2(v).y, mantissa); product.x = as_uint2(v).y*mantissa; low_dword = mul_hi(as_uint2(v).x, mantissa); ulong retval = as_ulong(product) + low_dword; retval >>= exponent; return retval; } inline ulong do_fast_div(ulong v1, uint v2) { if(v2==1) return v1; float inv_v2 = native_recip((float)v2); uint mantissa = (as_uint(inv_v2) << 😎 | 0x80000000; uint exponent = 126 - (as_uint(inv_v2) >> 23) ; mantissa -= 512; ulong div = mul_ulong_by_float(v1, mantissa, exponent); v1 -= div*v2; ulong retval = div; // max error 2^41 div = mul_ulong_by_float(v1, mantissa, exponent); v1 -= div*v2; retval += div; // max error 2^19 uint div32 = mul_hi((uint)v1, mantissa) >> exponent; v1 -= div32*v2; retval += div32; // max error 1 if((uint)v1 >= v2) retval += 1; return retval; }

0 Likes

Well, surprise surprise. mul_hi(ulong,ulong) isn't implemented right, either. Why would you possibly use 11 multiplications to do that? You need 7, no more and no less.

I'm starting to get a feeling that there's a bunch of software people in need of getting fired somewhere.

I do hope I won't need to reimplement ulong+ulong manually, at least.

0 Likes