cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

diepchess
Adept I

Top 16 bits from mad24 / mul24 in opencl

Sorry is late here. The 64 bits FMA is floating point.

 

Main point is it says nothing about the high bits for integers,

Neither for 32 bits * 32 bits, nor for MULHI_UINT24.

And that's the relevant discussion here i'd argue.

 

Going back to 16 bits, to achieve a 80 bits * 80 bits multiply (square) == 160 bits,

is something like 15 multiplications and add to that loads of shifts and adds 

and a bunch of overflow checks. Soon 50+ instructions.

 

Regards,

Vincent

0 Likes
eduardoschardong
Journeyman III

Top 16 bits from mad24 / mul24 in opencl

Originally posted by: diepchess

Do you mean that the Cayman has no special transistors for the 16 bits result of the MULHI_UINT24 instruction and uses the 32 x 32 bits mul_hi logics to emulate the result in a virtual manner?



No, I mean Cayman have all necessary hardware to get the low part of a 16x16 bits multiplication with just one unit but instead the compiler emits a 32x32 bits multiplication wich uses all four units. 

Originally posted by: diepchess

So only 1 out of 4 units is capable of doing this multiplication?



No, all 4 units are used for doing this multiplication.

Originally posted by: diepchess

Question 2: the 32 x 32 bits == 32 bits mul_hi command needs all 4 PE's to form the result, so you can't pair this instruction with other instructions to obtain the result?



Yes.

Originally posted by: diepchess

How do you obtain this knowledge?



I'm sure it's on the manual somewhere...

 

Anyway, when you want to know how a code will be compiled you can use SKA, it's simple and fast.

Originally posted by: diepchess

You have the transistor layout at hand of the Cayman?



Transistor layout? Oh no... I wouldn't understand it anyway

Originally posted by: diepchess

Where can i also take a look there?



I supose, you can't.

Originally posted by: diepchess

This makes of course only sense when the hardware has transistors to obtain these results; if it keeps busy all 4 pe's just to output 16 bits that's going to be massive waste of time, except for a new 22 nm GPU that'll release maybe 2013.



The hardware does support mul24_hi, only takes one unit, 16 bits problem is just a compiler generating sub-optimal code.

 

Originally posted by: diepchess

If all 4 PE's are blocked while and work together to create the 32 x 32 == 64 output, that would be very bad news. Is it possible to get an explanation there on how this works at hardware viewpoint, as the 6900 hardware instruction manual has nothing there?



All 4 PE's work togheter to produce either the high part or the low part, you need 2 cycles with all four units to produce the full 64 bits result.

I suppose that, to calculate the high part the low part must also be calculated so AMD could have included a instruction using all four PE's to produce the full 64 bits results at little cost, but they didn't.

 

0 Likes
diepchess
Adept I

Top 16 bits from mad24 / mul24 in opencl

Your answer does not provide any answer to the question i had, as your answer is ambigu in all respects. I suppose you do this deliberate.

I'm just interested in what the instruction MULHI_UINT24 can mean for me and in case did not try to deliberate confusion then you cannot see the difference between a question what the hardware can deliver versus what the software implementation (the compiler) actually delivers.

Right now you seem to mix mul24 with mul_hi with MULHI_UINT24 with 32 x 32 == 32 low bits.

 

Let me create a ticket for this as this forum gets nowhere.

Regards,

Vincent

0 Likes