Hi,

I'm working with 96-bit unsigned integers by using 3x32-bit uints, several of them in parallel in a vector:

#define CON2(a,b) a##b

#define CONC(a,b) CON2(a,b)

#define VECTOR_SIZE 2

// #define VECTOR_SIZE 4

#define AS_UINT_V CONC(as_uint, VECTOR_SIZE)

typedef struct _int96_t

{

CONC(uint, VECTOR_SIZE) d0,d1,d2; // e.g. uint2 d0,d1,d2;

}int96_t;

Now I have a function that calculates the lower half of the product of two of such 96-bit vectors:

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

__private uint_v tmp;

res->d0 = a.d0 * b.d0;

res->d1 = mul_hi(a.d0, b.d0);

res->d2 = mul_hi(a.d1, b.d0);

tmp = a.d1 * b.d0;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry

res->d2 += mul_hi(a.d0, b.d1);

tmp = a.d0 * b.d1;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry

res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

In order to optimize performance, I tried to use mad_hi instead of mul_hi and an addition:

void mul_96(int96_t * const res, const int96_t a, const int96_t b)

/* res = a * b */

{

__private uint_v tmp;

res->d0 = a.d0 * b.d0;

tmp = a.d1 * b.d0;

res->d1 = mad_hi(a.d0, b.d0, tmp);

res->d2 = mad_hi(a.d1, b.d0, AS_UINT_V((tmp > res->d1)? 1 : 0));

res->d2 = mad_hi(a.d0, b.d1, res->d2);

tmp = a.d0 * b.d1;

res->d1 += tmp;

res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);

res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;

}

However, this second function is between 2 and 15% slower in various kernels, probably depending on the code surrounding the function call. This is in Catalyst 12.6, HD5770, on both Win64 and Linux64. I just updated to 12.8, but this makes no difference.

I already found out that mad_hi will be translated (for example) to

33 t: MULHI_UINT ____, R10.x, R2.w | |||

34 x: ADD_INT | ____, T0.z, PS33 |

But then it should be the same as if I used mul_hi and + myself ??? Why is it so much slower? In other places that are executed just once per kernel, I noticed big differences as well, somtimes making it a bit faster, sometimse much slower than mul_hi plus addition.

Under which conditions would it use native mad_hi instructions?

Also, I have rather bad ALU Packing (~75%), coming from loads of MULLO_INT and MULHI_UINT that only run in the t-unit, leaving x-w empty. Can anyone suggest how to improve that generally?

Thanks a lot,

Bdot

If the kernel analyzer shows that the instructions are the same but their orders are not the same for the two kernels, how about re-arranging them so that the compiler translates them into exactly the same thinng? What would happen after this, speaking of the performance?