This is a continuation of my thread in the OpenCL subforum.
AMD GPU architecture does not have any instructions that can do integer divisions. That, by itself, is understandable (sometimes you can't have an instruction for everything). So when I try to do an integer division in a GPGPU kernel, in reality the compiler will emit a sequence of low-level instructions to the same effect.
The problem is that the low-level implementation of integer division in APP SDK is grossly unreasonable. According to my dumps, dividing a 64-bit by a 32-bit takes 870 instructions and dividing a 64-bit by a 64-bit takes 900 instructions. There's some branching, so it's possible that the real cost will be somewhat lower, but even half of 900 is way too much.
The reason I know that it's unreasonable is that NVIDIA, while operating within similar constraints (mostly 32-bit native low-level ops and nothing directly relevant for the division), gets the same task done in just over 70 instructions.
Consequently, once again according to my tests, Radeon 6870 (1120 processing units @ 900 MHz) can handle about 700M 64-bit integer divisions per second, whereas nominally slower NVIDIA GTX 470 (448 processing units @ 1215 MHz) can handle 6 billon of the same divisions per second.
I would appreciate it if someone got around to fixing the APP SDK compiler for some future release.
APPENDIX. Low-level code generated for the task on both platforms:
Can't vouch for the second version, since it's been generated by a third-party disassembler tool.