The opencl specification does not provide in version 1.1 as posted on the AMD site, as far as i see it, a method to obtain the top 16 bits from a mul24 / mad24. This where the manuals do prove that the GPU does have this instruction available. It is page 267 of the 6900 series instruction set architecture manual. Instruction name: MULHI_UINT24
The GPU can right now deliver a significance of 64 bits per streamcore, whereas combined with the top16 it can deliver 96 bits per cycle per streamcore. So it would be big progress if this is available.
Now there is many ways to solve this problem of obtaining the top 16 bits:
a) the instruction is available in opencl yet i missed it or it is not yet documented. Of course that's my hope, as that will solve my problem of doing multiplications faster (i'm multiplying multiple bit prime numbers, so i manually code it out).
b) there is a way to write opencl in a manner that the compiler optimizes to this instruction as it is a clever compiler
The best way is of course to add it to opencl. I'll email also the Khronos homepage with this request, in the meantime i'm most interested in a solution to this problem. It's quite possible that solution B already exists if A doesn't. If so what is the solution to that?
Please note that all my questions are with respect to unsigned integer calculations; to emulate multibit precision of course signed integers are not getting used at all, it is all unsigned calculations.
Note that to get 64 bits output per cycle per streamcore one stores 16 bits in an integer and multiplies this with another 16 bits integer as the resulting output of a mul24/mad24 is 32 bits. You can safely ignore mul24 here as well, it's all mad24 as that saves out an instruction of course.
We need this badly to compete against the Nvidia implementation