cancel
Showing results for 
Search instead for 
Did you mean: 

Archives Discussions

Ceq
Journeyman III

MAD instruction, RV770 vs RV870

According to Stream KernelAnalyzer, when the target is the Radeon 58xx the generated assembly will take advantage of the hardware MAD capabilities.

Radeon 48xx family also has MAD capabilities, but the IL compiler does not generate the proper instructions to use them.

Is this normal? Is there any reason for this behavior?

0 Likes
10 Replies
genaganna
Journeyman III

Originally posted by: Ceq According to Stream KernelAnalyzer, when the target is the Radeon 58xx the generated assembly will take advantage of the hardware MAD capabilities.

 

Radeon 48xx family also has MAD capabilities, but the IL compiler does not generate the proper instructions to use them.

 

Is this normal? Is there any reason for this behavior?

 

Ceq,

    Please give us the kernel code which shows this problem.

0 Likes

The simplest example is this trivial kernel:


kernel void
MADD(float a< >, float b< >, float c< >, out float s< > )
{
    s = a * b + c;
}

 

Using Stream Kernel Analyzer you can check that this kernel would take 2 ALU instructions in the RV7xx family, but only 1 ALU instruction in the RV8xx family. According to the documentation in R700-Famility_Instruction_Set_Architecture, the RV7xx family should support both MULADD and MULADD_IEEE, so I think that both architectures should be able to take advantage of the hardware MADD capabilites.

 

0 Likes

CAL IL compiler sometimes inserts mad instructions by itself. But usually it doesn't. You need to use mad instruction to force it. And automatic conversion to mads isn't too good. Mad has different computational properties from mul followed by add (a*b+c). Such an automatic conversion could lead to huge problems in more advanced numerical algorithms.

0 Likes

The main reason why this optimization is not done is as hazeman alluded to. According to the OpenCL spec, a * b + c must be correctly rounded, but the mad instruction has no restriction on the precision. Since our hardware mad does not generate a correctly rounded result in all cases, we cannot convert to a mad instruction in OpenCL.
0 Likes

Thanks for your answers, Hazeman and Micah. In this case I was talking about Brook+, but I think it should work in the same way as in OpenCL.

Note that when targeting the RV870, the IL compiler usually generates MAD instructions. In the simple kernel I used above, the code for the Radeon 5xxx series is optimized, but not for the Radeon 4xxx. So I asume there is some kind of bug/deviation in the MULADD instruction that only affects the RV770 and prevents its utilization by default. Is this correct?

0 Likes

Originally posted by: Ceq Note that when targeting the RV870, the IL compiler usually generates MAD instructions. In the simple kernel I used above, the code for the Radeon 5xxx series is optimized, but not for the Radeon 4xxx. So I asume there is some kind of bug/deviation in the MULADD instruction that only affects the RV770 and prevents its utilization by default. Is this correct?


I would not think of it as a bug, as it was doing what was specified for RV770 and all previous implementations (actually, nvidia hardware handles it the same way). Cypress has simply more powerful multipliers (besides other improvements), which allows to enable this optimization. Cypress can even add to the non rounded intermediate value from the multiplication (having a 48bit mantissa), which offers a higher precision than the "normal" instruction sequence with rounding to a 24bit mantissa (first bit is implied) after the multiplication. This is called a "fused" multiply add or short fma and is one of the improvements which came with Cypress (but fma is not available on Juniper and the smaller GPUs as it obviously shares some resources necessary for the double precision support in Cypress).

Changing mul_ieee to mul in IL makes compiler combine mul + add into a single mad instruction.

So, the required precision is only guaranteed if you use ieee instructions.

0 Likes

Gaurav, I'm sorry but there is something I couldn't understand clearly. If I don't change that, in the simple kernel above, I get this:

RV770:

3    y: MOV    R1.y, 0.0f
    z: MUL_e    ____, R1.x, R2.x
4    x: ADD    R1.x, R0.x, PV3.z

Cypress:

3    x: MULADD_e R0.x,  R1.x,  R2.x,  R0.x
    y: MOV    R0.y,  0.0f

So ceratinly, the RV770 is not taking advantage of the MULADD instruction but the RV870 is using it.

Now supose that I change MUL_IEEE to MUL in the IL code as you said, we get the following code:

RV770 and Cypress:

3    x: MULADD R0.x,  R1.x,  R2.x,  R0.x
    y: MOV    R0.y,  0.0f

But that is not exactly what we want, as Gipsel pointed out we can have some precission loss doing so due to internal conversions and IEEE conformance.

Now supose that I change the MUL_IEEE and the following ADD instruction in the IL file into a single MAD_IEEE. Surprise! now the IL compiler generates the same assembly as Cypress in the first case for both architectures:

RV770 and Cypress:

3    x: MULADD_e R0.x,  R1.x,  R2.x,  R0.x
    y: MOV    R0.y,  0.0f

 

This is the strange part, I thought that there was some bug or issue preventing the generation of MULADD_e instruction by the IL compiler that only affects the RV770. But seeing how that small change in the IL code can lead to the same code in both architectures using that instruction is intringuing.

If this optimization shouldn't be done automatically (as happens in the RV770) due to equivalency issues, assuming the assembly instruction have the same behavior in both architectures, I think it should be the same in both.

Is it some kind of bug in the IL compiler, or maybe it is related to the instruction latency in the RV770? MULADD_IEEE has the same properties in both architectures?. There is no other information in the documentation than the instruction microcode.

 

 

Note:

- If somebody wants to try it, just paste the following kernel into Stream KernelAnalyzer v1.4:

kernel void
MADD(float a< >, float b< >, float c< >, out float s< > )
{
    s = a * b + c;
}

- Now select IL as the output in the object code window and paste the generated IL in the source window again, then modify the following lines:

mul_ieee r273.x___, r269.x000, r270.x000
add r274.x___, r273.x000, r271.x000

- Instead of changing mul_ieee into mul, delete both lines and write:

mad_ieee r274.x___, r269.x000, r270.x000, r271.x000

Done, now the generated assembly will be the same code in both architectures with IEEE conformance, as the optimized code generated by default for the Cypress architecture.

0 Likes

If this optimization shouldn't be done automatically (as happens in the RV770) due to equivalency issues, assuming the assembly instruction have the same behavior in both architectures, I think it should be the same in both.

Is it some kind of bug in the IL compiler, or maybe it is related to the instruction latency in the RV770? MULADD_IEEE has the same properties in both architectures?. There is no other information in the documentation than the instruction microcode.



This instruction is not equivalent in both architectures as mentioned by Gipsel.

0 Likes

Thanks! My misunderstanding was due to the MAD_IEEE instruction doing different things on both architectures (not the normal MAD), because I thought the IEEE version would dictate not only the behavior of special cases, but also the precision of the operation, as normally does on CPUs.

0 Likes