This content has been marked as final. Show 4 replies
This optimization occurs based on the hardware support for the mad24 operation. Not all hardware supports both signed and unsigned version. Also, if your program is not bound by ALU, you might not see any performance gain.
Thanks for your reply Micah.
The code runs on a HD6970. Yes, it is memory bound - but since I have lots of those a*b+c I thought I might see at least *some* difference. No?
This is not directly related since you have a 6970, but it should be noted anyway.
On the 5800 series, signed mul24(a,b) is turned into (((a<<8)>>8)*((b<<8)>>8)). This makes it noticeably SLOWER than simply using a*b. Unsigned mul24(a,b) uses a native function. mad24 is similar. I made some kernels which just looped the same operation over and over:
signed a * b: 0.9736s
unsigned mul24(a,b): 0.9734s
signed mul24(a,b): 2.2771s
I don't think mul24 should EVER be slower than 32-bit multiplication. The OpenCL spec says that if the inputs are don't fit into a 24-bit number the answer is implementation-defined. Doing all those bit-shifts will only affect the inputs if they are in the "implementation-defined" range, where the answer can't be relied on anyway. If the inputs already fit into a 24-bit number the shifts will not affect the result and therefore simply waste time.
If your code is memory bound, then your ALU is basically free.