AnsweredAssumed Answered

mad_hi(uint) slower than mul_hi + addition

Question asked by Bdot on Oct 4, 2012
Latest reply on Oct 16, 2012 by realhet

Hi,

 

I'm working with 96-bit unsigned integers by using 3x32-bit uints, several of them in parallel in a vector:

 

#define CON2(a,b) a##b

#define CONC(a,b) CON2(a,b)

#define VECTOR_SIZE 2

// #define VECTOR_SIZE 4

#define AS_UINT_V CONC(as_uint, VECTOR_SIZE)

 

typedef struct _int96_t

{

  CONC(uint, VECTOR_SIZE) d0,d1,d2;  // e.g. uint2 d0,d1,d2;

}int96_t;

 

Now I have a function that calculates the lower half of the product of two of such 96-bit vectors:

 

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

  __private uint_v tmp;

 

  res->d0  = a.d0 * b.d0;

  res->d1  = mul_hi(a.d0, b.d0);

 

  res->d2  = mul_hi(a.d1, b.d0);

 

  tmp = a.d1 * b.d0;

  res->d1 += tmp;

  res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);  // carry

 

  res->d2 += mul_hi(a.d0, b.d1);

 

  tmp = a.d0 * b.d1;

  res->d1 += tmp;

  res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);  // carry

 

  res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

 

In order to optimize performance, I tried to use mad_hi instead of mul_hi and an addition:

 

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

  __private uint_v tmp;

 

  res->d0  = a.d0 * b.d0;

 

  tmp = a.d1 * b.d0;

  res->d1  = mad_hi(a.d0, b.d0, tmp);

 

  res->d2  = mad_hi(a.d1, b.d0, AS_UINT_V((tmp > res->d1)? 1 : 0));

  res->d2  = mad_hi(a.d0, b.d1, res->d2);

 

  tmp = a.d0 * b.d1;

  res->d1 += tmp;

  res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);

 

  res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

 

However, this second function is between 2 and 15% slower in various kernels, probably depending on the code surrounding the function call. This is in Catalyst 12.6, HD5770, on both Win64 and Linux64. I just updated to 12.8, but this makes no difference.

 

I already found out that mad_hi will be translated (for example) to

33  t: MULHI_UINT  ____,  R10.x,  R2.w
34  x: ADD_INT ____,  T0.z,  PS33

But then it should be the same as if I used mul_hi and + myself ??? Why is it so much slower? In other places that are executed just once per kernel, I noticed big differences as well, somtimes making it a bit faster, sometimse much slower than mul_hi plus addition.

 

Under which conditions would it use native mad_hi instructions?

 

Also, I have rather bad ALU Packing (~75%), coming from loads of MULLO_INT and MULHI_UINT that only run in the t-unit, leaving x-w empty. Can anyone suggest how to improve that generally?

 

Thanks a lot,

Bdot

Outcomes