I have a physical problem that can be solved nearly solely by doing massive amounts of bitwise operations. That is doin 1e13 number of iterations, in each doing roughly 1e14 bitwise operations, all driven by a single single random number. I already have a GPU accelerated cude using OpenCL, however, I have seen that 24-bit integer operations are significantly faster than their 32-bit counterparts.
What is a general approach to hint to the compiler that I wish to do all my calculations on vectors of 24-bit integers? If I try extracting bits from a 32-bit integer (either with shifting or masking) I am bound to using 32-bit operations, and my guess is that the compiler cannot help me in this regard.
Generally since the problem at hand is very pathologic in the sense that there is roughly 5 assembly operations that dominate 99% of the runtime, I'm looking to optimize this part to it's fullest, having every bit of every ASM operation do useful work. I am somewhat free to rearrange my data if neccessary.
The data is practically a 2 dimensional surface of zeros and ones, in a bitcoded manner. I randomly access small 4-bit parts of this surface and modify it's values based on random numbers.
How can I get the compiler to only use 24-bit operations?
I believe that only 32 bit multiply instructions are slow, all other 32 bit operations are single clock execution (on average),
as fast as it gets. 24 bit multiply can be invoked in opencl by the x=mul24(x,y) function.
Do you have access to assembly code? Have you tried the bit field extract / insert instructions?
They can also be quite fast.