cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

diepchess
Adept I

Top 16 bits from mad24 / mul24 in opencl

Hello, 

 

The opencl specification does not provide in version 1.1 as posted on the AMD site, as far as i see it, a method to obtain the top 16 bits from a mul24 / mad24. This where the manuals do prove that the GPU does have this instruction available. It is page 267 of the 6900  series instruction set architecture manual.  Instruction name: MULHI_UINT24

 

The GPU can right now deliver a significance of 64 bits per streamcore, whereas combined with the top16 it can deliver 96 bits per cycle per streamcore. So it would be big progress if this is available.

 

Now there is many ways to solve this problem of obtaining the top 16 bits:

 a) the instruction is available in opencl yet i missed it or it is not yet documented. Of course that's my hope, as that will solve my problem of doing multiplications faster (i'm multiplying multiple bit prime numbers, so i manually code it out).

 b) there is a way to write opencl in a manner that the compiler optimizes to this instruction as it is a clever compiler

 

c) other?

 

The best way is of course to add it to opencl. I'll email also the Khronos homepage with this request, in the meantime i'm most interested in a solution to this problem. It's quite possible that solution B already exists if A doesn't. If so what is the solution to that?

 

Please note that all my questions are with respect to unsigned integer calculations; to emulate multibit precision of course signed integers are not getting used at all, it is all unsigned calculations.

 

Note that to get 64 bits output per cycle per streamcore one stores 16 bits in an integer and multiplies this with another 16 bits integer as the resulting output of a mul24/mad24 is 32 bits. You can safely ignore mul24 here as well, it's all mad24 as that saves out an instruction of course.

 

We need this badly to compete against the Nvidia implementation

0 Likes
12 Replies

diepchess,
There currently is no way to directly access this hardware instruction directly. If you have a test case that you can provide and believe should be compiled down to this instruction, we will see what we can do to optimize this case.
0 Likes

Originally posted by: MicahVillmow diepchess, There currently is no way to directly access this hardware instruction directly. If you have a test case that you can provide and believe should be compiled down to this instruction, we will see what we can do to optimize this case.


 

That is a very generous offer that i take of course!

Please create an intrinsic function mul24_hi 

and mad24_hi (or any name of your choosing that follows naming conventions)

that's doing exactly this and implements the assembler opcode

MULHI_UINT24 as being

 

gentype mul24_hi ( gentype a, gentype b)

 

Another correct manner would be of course seemingly doing it in a way that code that the compiler would be able to automatically conclude from that it can optimize to this instruction. Yet that's not possible in an efficient manner in case of the top16 bits instruction. Let me prove that:

as generic code would look like this:

 

unsigned int tz;

tx = .. & 0xffffff;

ty = .. & 0xffffff;

tz = mul_hi(tx,ty); // or mad_hi

 

Obviously one would be able to generate a MULHI_UINT24 here,

yet additionally i'd also want the 2 ANDs removed from the code,

as they hurt bigtime. 3 instructions instead of 1.

 

I realize it is not possible for a generic optimization for the compiler to realize it can remove the AND's, because only as a programmer i realize that the thing

has just 24 bits in most cases. 

 

A hack would be to define a new datatype uint24 and uint4_24 which for all multiplications assumes a maximum of 24 bits, but not for any other instructions. So * will go automatically to mul24 and mulhi will go to MULHI_UINT24. Yet i guess such hacks are going to make things more complex and i'm not a big fan of that (note that finding a solution here is more important than any method of how to do it).

 

So the best is probably create a new intrinsic function here and also try to get it into the OpenCL standard (which probably will take many months if not longer, and if that fails then it is simply an AMD intrinsic), as i realize all too well that every programmer will solve such bit fiddling in a different way and that compiler would need very extensive logics for this.

In the long run i'm pretty sure also trying to allow inline assembler is the way to go, which would be another solution for a lot of other problems, yet i guess that's going to be like designing a space shuttle here to deliver a small parcel in Australia.

 

Am i correct there that a quick intrinsic which hopefully makes it to the OpenCL standard, that this is the best way to get this done?

It's simply something that OpenCL

0 Likes

Ok i editted my post too much, it gets messed up; i had for a few seconds a short hope that also a MULADDHI_UINT24 would exist, yet that's not the case

 

I hope the proposal is clear nevertheless. A new intrinsic and get it into the OpenCL standard, as for cpu's it would be easy to implement this as well.

 

Regards,

Vincent Diepeveen

0 Likes

I'll also e-mail when i have it available (as the new possibility will create new code for OpenCL) a 72x72 bits multiplication and maybe another few multiplications around 72 bits (and a squaring especially) with code where i like to call the MULHI_UINT24 at specific spots, yet this code doesn't exist yet. 

 

As it's meant for the public domain in the long run anyway it's not a problem to share such codes publicly.

 

Where can i email to, otherwise drop me an email where i can mail to.

 

Regards,

Vincent Diepeveen

email: diep@xs4all.nl

skype: diepchess

0 Likes

Originally posted by: MicahVillmow If you have a test case that you can provide and believe should be compiled down to this instruction, we will see what we can do to optimize this case.


Not exactly the same request but, 16 bits multiplies uses the 32 bit multiplier instead of 24 bit mul, on Cayman this means using four slots instead of one.

For 24 bit mul high, a new built-in function, please.

For the hardware, if it's possible a cheap 16x16=>32, for completeness

 

ushort a, b, c; //Initialize a, b c = a * b;

0 Likes

Originally posted by: eduardoschardong
Originally posted by: MicahVillmow If you have a test case that you can provide and believe should be compiled down to this instruction, we will see what we can do to optimize this case.


 

Not exactly the same request but, 16 bits multiplies uses the 32 bit multiplier instead of 24 bit mul, on Cayman this means using four slots instead of one.

 

I see you address 2 points. Please confirm me whether i understand it correct.

Do you mean that the Cayman has no special  transistors for the 16 bits result of the MULHI_UINT24 instruction and uses the 32 x 32 bits mul_hi logics to emulate the result in a virtual manner?

So only 1 out of 4 units is capable of doing this multiplication?

 

Question 2: the 32 x 32 bits == 32 bits mul_hi command needs all 4 PE's to form the result, so you can't pair this instruction with other instructions to obtain the result?

How do you obtain this knowledge?

You have the transistor layout at hand of the Cayman?

Where can i also take a look there?

 

For 24 bit mul high, a new built-in function, please.

 

This makes of course only sense when the hardware has transistors to obtain these results; if it keeps busy all 4 pe's just to output 16 bits that's going to be massive waste of time, except for a new 22 nm GPU that'll release maybe 2013. 

 

For integer calculations of course we need fast multiplication on each PE.

Or even on 2 PE's to obtain highbits of course.

 

For the hardware, if it's possible a cheap 16x16=>32, for completeness

 

0 Likes

For the hardware, if it's possible a cheap 16x16=>32, for completeness

 

 

Well you know, how to explain this. To emulate big calculations with many bits if you use more units that also means a massive increase in number of multiplications.

 

To emulate 32 x 32 bits == 64 bits with just 16 bits opcodes is 4 multiplications and some shifting. Sure Karatsuba can do with 3 multiplications there but more complicated shifting. 

 

So if there is hardware support for 1 of each streamcores PE to do the 32 x 32 bits calculation for low bits and 1 for highbits, meanwhile 2 others are free to execute simple instructions, then that's a lot faster of course.

 

If all 4 PE's are blocked while and work together to create the 32 x 32 == 64 output, that would be very bad news. Is it possible to get an explanation there on how this works at hardware viewpoint, as the 6900 hardware instruction manual has nothing there?

 

0 Likes

?
English (auto-detected) » English

0 Likes

Originally posted by: ED1980

Not such useful diagram as it says nothing about high bits and reveals nothing

that the 5000 series couldn't do either except for the '64 bits FMA'. 

question is also what a 64 bits FMA is. Whether that is a 64 x 64 == 128

bits with 128 bits output, so the 64 bits *not* being a vector, or creative

you can say: "heh it's a vector". 

It says 1x 64 bits multiplication. So if i define a 64 bits unsigned integer,

i get 1 output for it per cycle? Shifting creates complicated code then (unions huh?) 

 

Does this automatically deduce to that if i pair a multiplication (32 x 32 = 32 lowbits) with mul_hi (32 x 32 == 32 highbits) that those get executed as well in the same cycle?

 

From the diagram that isn't logical and/or the OpenCL compiler doesn't yet know this for the 6900 series. 

 

Right?

 

Regards,

Vincent

 

0 Likes

Sorry is late here. The 64 bits FMA is floating point.

 

Main point is it says nothing about the high bits for integers,

Neither for 32 bits * 32 bits, nor for MULHI_UINT24.

And that's the relevant discussion here i'd argue.

 

Going back to 16 bits, to achieve a 80 bits * 80 bits multiply (square) == 160 bits,

is something like 15 multiplications and add to that loads of shifts and adds 

and a bunch of overflow checks. Soon 50+ instructions.

 

Regards,

Vincent

0 Likes

Originally posted by: diepchess

Do you mean that the Cayman has no special transistors for the 16 bits result of the MULHI_UINT24 instruction and uses the 32 x 32 bits mul_hi logics to emulate the result in a virtual manner?



No, I mean Cayman have all necessary hardware to get the low part of a 16x16 bits multiplication with just one unit but instead the compiler emits a 32x32 bits multiplication wich uses all four units. 

Originally posted by: diepchess

So only 1 out of 4 units is capable of doing this multiplication?



No, all 4 units are used for doing this multiplication.

Originally posted by: diepchess

Question 2: the 32 x 32 bits == 32 bits mul_hi command needs all 4 PE's to form the result, so you can't pair this instruction with other instructions to obtain the result?



Yes.

Originally posted by: diepchess

How do you obtain this knowledge?



I'm sure it's on the manual somewhere...

 

Anyway, when you want to know how a code will be compiled you can use SKA, it's simple and fast.

Originally posted by: diepchess

You have the transistor layout at hand of the Cayman?



Transistor layout? Oh no... I wouldn't understand it anyway

Originally posted by: diepchess

Where can i also take a look there?



I supose, you can't.

Originally posted by: diepchess

This makes of course only sense when the hardware has transistors to obtain these results; if it keeps busy all 4 pe's just to output 16 bits that's going to be massive waste of time, except for a new 22 nm GPU that'll release maybe 2013.



The hardware does support mul24_hi, only takes one unit, 16 bits problem is just a compiler generating sub-optimal code.

 

Originally posted by: diepchess

If all 4 PE's are blocked while and work together to create the 32 x 32 == 64 output, that would be very bad news. Is it possible to get an explanation there on how this works at hardware viewpoint, as the 6900 hardware instruction manual has nothing there?



All 4 PE's work togheter to produce either the high part or the low part, you need 2 cycles with all four units to produce the full 64 bits result.

I suppose that, to calculate the high part the low part must also be calculated so AMD could have included a instruction using all four PE's to produce the full 64 bits results at little cost, but they didn't.

 

0 Likes

Your answer does not provide any answer to the question i had, as your answer is ambigu in all respects. I suppose you do this deliberate.

I'm just interested in what the instruction MULHI_UINT24 can mean for me and in case did not try to deliberate confusion then you cannot see the difference between a question what the hardware can deliver versus what the software implementation (the compiler) actually delivers.

Right now you seem to mix mul24 with mul_hi with MULHI_UINT24 with 32 x 32 == 32 low bits.

 

Let me create a ticket for this as this forum gets nowhere.

Regards,

Vincent

0 Likes