Hi,
I'm working with 96-bit unsigned integers by using 3x32-bit uints, several of them in parallel in a vector:
#define CON2(a,b) a##b
#define CONC(a,b) CON2(a,b)
#define VECTOR_SIZE 2
// #define VECTOR_SIZE 4
#define AS_UINT_V CONC(as_uint, VECTOR_SIZE)
typedef struct _int96_t
{
CONC(uint, VECTOR_SIZE) d0,d1,d2; // e.g. uint2 d0,d1,d2;
}int96_t;
Now I have a function that calculates the lower half of the product of two of such 96-bit vectors:
void mul_96(int96_t * const res, const int96_t a, const int96_t b)
/* res = a * b */
{
__private uint_v tmp;
res->d0 = a.d0 * b.d0;
res->d1 = mul_hi(a.d0, b.d0);
res->d2 = mul_hi(a.d1, b.d0);
tmp = a.d1 * b.d0;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry
res->d2 += mul_hi(a.d0, b.d1);
tmp = a.d0 * b.d1;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0); // carry
res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;
}
In order to optimize performance, I tried to use mad_hi instead of mul_hi and an addition:
void mul_96(int96_t * const res, const int96_t a, const int96_t b)
/* res = a * b */
{
__private uint_v tmp;
res->d0 = a.d0 * b.d0;
tmp = a.d1 * b.d0;
res->d1 = mad_hi(a.d0, b.d0, tmp);
res->d2 = mad_hi(a.d1, b.d0, AS_UINT_V((tmp > res->d1)? 1 : 0));
res->d2 = mad_hi(a.d0, b.d1, res->d2);
tmp = a.d0 * b.d1;
res->d1 += tmp;
res->d2 += AS_UINT_V((tmp > res->d1)? 1 : 0);
res->d2 += a.d0 * b.d2 + a.d1 * b.d1 + a.d2 * b.d0;
}
However, this second function is between 2 and 15% slower in various kernels, probably depending on the code surrounding the function call. This is in Catalyst 12.6, HD5770, on both Win64 and Linux64. I just updated to 12.8, but this makes no difference.
I already found out that mad_hi will be translated (for example) to
33 t: MULHI_UINT ____, R10.x, R2.w | |||
34 x: ADD_INT | ____, T0.z, PS33 |
But then it should be the same as if I used mul_hi and + myself ??? Why is it so much slower? In other places that are executed just once per kernel, I noticed big differences as well, somtimes making it a bit faster, sometimse much slower than mul_hi plus addition.
Under which conditions would it use native mad_hi instructions?
Also, I have rather bad ALU Packing (~75%), coming from loads of MULLO_INT and MULHI_UINT that only run in the t-unit, leaving x-w empty. Can anyone suggest how to improve that generally?
Thanks a lot,
Bdot
If the kernel analyzer shows that the instructions are the same but their orders are not the same for the two kernels, how about re-arranging them so that the compiler translates them into exactly the same thinng? What would happen after this, speaking of the performance?
Hi,
Even a small change in source code can do big changes in the generated ISA asm. So it woulb be better to see all the disasm of those functions to understand what and how the compiler does it with or without mad.
Can anyone suggest how to improve that generally?
Yea, do not use those 32bit mul's, they are exactly slow like double precision operations. (On the 7xxx they will be more slower, depending on the reduced DP/SP ratio on cheaper models).
- 24bit muls only, but 4x (or more) faster.
- They are fast as SP rate (using the same 24bit float multiplier circuits)
- Can go to the xyzw units.
But it needs some pre and post processing and I'm affraid it's not worth it for only 96bit precision.
I've tried it on 384bit prec, and it was 30% faster than the mul32 method.
For preprocessing you need to split the input data to 24bit parts: there is an AMD specific instruction for it called align_byte (or something like this)
For postprocessing you have to propagate carries (top 8 bits) and combine the 4x 24bit into the 3x 32bit results.
Thinking again, this extra work can be too much, but it can use that 25% under utilization... Who knows
Thanks a lot for your answers.
It turns out that reordering the instructions so that the sequence is the same (with mad_hi separated into mul_hi and addition) makes no difference at all (EDIT: compared to the previous mul_hi example. The difference to the mad_hi code remains). This kind of confirms that the compiler is quite good at finding independent instructions and reordering them anyway.
But then, combining one of the mul_hi + add into a mad_hi makes the performance of my kernels drop by 1-2% per mad_hi.
I'll check the IL-code for that, but have not enough time right now ... I'll post again what the difference is.
Also thanks for the hint to amd_bitalign(). I have many constructs similar to
nn.d2 = (nn.d2 << 23) + (nn.d1 >> 9);
in my code. When replacing them by amd_bitalign(nn.d2, nn.d1, 9) (for this example), then there is not a big change in performance, but if there's a change, then it is a tiny bit faster. As I have a lot of shifts like that, I could improve overall performance by ~1.2%. Thanks again!
When you say 24-bit muls only, do you mean to use 24-bit integers and the combination mul24/mad24 + mul_hi, or would you use 16-bit integers and only mul24/mad24? I tried both, and the 24-bit approach is rather slow, not (yet) sure why.
For the other attempt I used 15-bit integers in order to leave room for carries, and this is quite fast for 5x15bits, but for 6x15bits the kernels use too many registers, and it starts to slow down dramatically. I need to reduce the vector size to lower the register usage, but that also lowers the ALU packing ...
How did you do that for 384bit? Could I see an example?
I just started examining at how this runs on GCN (I have a 7850 for that). There, I have pretty low kernel occupancy due to hight VGPR usage. I have low SGPR and no LDS usage. I'll probably start another thread with that, but it seems I'll need help with that too ...
MUL_HI and MUL_LO is the slow 32bit versions, they are running don the Double Precision circuits.
Somehow you have to force openlc to use MUL_INT24 and MULHI_INT24 isa instructions. (They have to be at the AMD specific OpenCL instructions section in the docs).
It's 48bits mul output in 2 cycles, compared to 64bit mul output in 4 cycles.
Oh nevermind!!! According to the 69xx manual (It's VLIW4 too), mul24 has only a signed version, no unsigned version. So It's GCN only, sorry for misleading.
>How did you do that for 384bit? Could I see an example?
Well it was a bit of a fail project for me I've tried to achieve maximum alu utilization while multiplying big fractions which can't fit inside the VGPRS. So I made the multiply algo to work on 12dword parts at a time. It worked like:
- fetch 12dwords from operand A (pointer goes forward)
- fetch 12dwords from operand B (started from the end of the number, the ptr goes backwards)
- split those 12dwords into 16dwords where 24bit resides in each dword
- do the school grade multiplication column by column, and accumulate every 16 column. (that's a long unrolled code)
- adjust the carries: every topmost 8bit added to the next 'limb' and zeroed out.
- go to the next block 'downwards'
- the last block is processed differently (it's a half square shape, not a full square).
All I've learned that this is a bad memory access pattern (one pointer up and one pointer down simultaneously). And because of that it was funny that the overall performance was faster when I only used 1024 threads and not more. (on a hd7970 it's not much ) Too much thread meant too much fight for memory access.
mul/mad24 are available as both signed and unsigned, also in VLIW4 andVLIW5, that is not the problem.
I take it from your description that you worked mostly with local or global memory. So far that was not needed (except for reading the input once and writing a few bytes output) in my case. It is not required for interactions as my threads all work independently.
It's 48bits mul output in 2 cycles, compared to 64bit mul output in 4 cycles.
Isn't that rather like 48bits in 5 cycles, compared to 64bit in 8 cycles:
mul24 (for the low32 bits) - 1 cycle
mul_hi(for the high 16 bits) - 4 cycles
compared to
mul (for the low 32 bits) - 4 cycles
mul_hi(for the high 32 bits) - 4 cycles
That's 9.6 bits per cycle, compared to 8 bits per cycle. If you add the effort for splitting and combining the integer's 24-bit chunks, there's not a lot of advantage left.
But if I understand it right, on 77xx and 78xx, a 32-bit mul[hi] takes 16 cycles? If that is true, then it could explain why my 7850 is only marginally faster than a 5770 on integer code with lots of multiplications.
BTW, the amd_bitalign is quite a failure as well. While the kernel I tested first with that was indeed 1.2% faster, another kernel that shares some of the amd_bitalign-optimized functions is 5% slower now. Optimizing code for VLIW5 is a pain, you never know in advance if/how a certain change will influence runtime. I think I'll just leave the code for VLIW as it is and try optimizing for GCN, hoping that it behaves more consistently. Where the 4-to16-cycle difference for multiplications will also make it quite hard to come up with a single "best" approach for GCN ...
Have unsigned 24bit mul -> Then the 6900 isa doc is a bit incomplete (or I can't search in it).
On VLIW5 the bit_align is T-unit exclusive, yea that's a limitation.
"It's 48bits mul output in 2 cycles, compared to 64bit mul output in 4 cycles."
My error, sry
There IS a 24bit variant exists also for mul_hi, thats why the 24->48bit mul tokk 2 cycles. And the 32->64bit mul (as you said) is 8 cycles. The perf ratio is then (24/2)=12 input bits/cycle versus (32/8)=4 input bits/cycle (If I'm calculating right), so the 24bit version is 3x effective. (It just have some packing side effects, but 24bit is also a win when you accumulating values -> don't have to check carry at every addition)
>Optimizing code for VLIW5 is a pain, you never know in advance if/how a certain change will influence runtime.
Yea, it's voodoo magic, because the optimizer at CAL level is very complicated. Also 'taming' GCN needs some black magic. It's a really good architecture, but it's so far away from OpenCL. From OpenCL you just can't use S alu, but it has 105 more registers you can use for storing and precalculating per/WorkGroup constants. With the S alu you can also do x86-like flexible flow-control.
I now tested mad_hi on a pitcairn, and it's even worse there. I'd say this is really a bug:
When I change my source code from
k.d0 = mad24(t, 4620u, k_base.d0);
k.d1 = mul_hi(t, 4620u) + k_base.d1 + AS_UINT_V((k_base.d0 > k.d0)? 1 : 0);
to
k.d0 = mad24(t, 4620u, k_base.d0);
k.d1 = mad_hi(t, 4620u, k_base.d1) + AS_UINT_V((k_base.d0 > k.d0)? 1 : 0);
then the ISA code differences (apart from different register usage) are:
from
v_mul_hi_u32 v7, v3, s7
v_mul_hi_u32 v8, v2, s7
to
v_mul_hi_u32 v5, v2, s1
v_mul_lo_u32 v4, v2, s1
v_mul_hi_u32 v7, v3, s1
v_mul_lo_u32 v6, v3, s1
v_lshr_b64 v[4:5], v[4:5], 32
v_lshr_b64 v[5:6], v[6:7], 32
So it seems, a mad_hi(uint, uint, uint) results in calculating the whole 64-bit product and then shifting that result by 32 bits. No wonder it is so much slower here ...
(The v_mad_u32_u24 before and the addition after these code parts are the same, so I left them out.)
I'm not that good at reading IL-code, but it appears this problem is already in there. The mul_hi code (first major difference, I hope I caught the right spot):
mov r72.__z_, l14
ishl r72.___w, r1.x, r72.z
mov r73.xy__, r72.w
umul_high r73.__zw, r70.zwzw, r73.xyxy
iadd r72.xy__, r73.zwzw, r72.xyxy
mov r72.___w, r72.000y
ult r71._y__, r72.w, r71.y
mov r71._y__, r71.y
mov r72.___w, r72.000x
ult r71.___w, r72.w, r71.w
mov r71.___w, r71.w
mov r73.__zw, r71.w
iadd r73.__zw, r73.00z0, r71.000y
mov r71._y__, l15
mov r74.xy__, r71.y
iand r73.__zw, r73.zwzw, r74.xyxy
mov r71._y__, l16
mov r74.__zw, r71.y
mov r75.xy__, r72.z
cmov_logical r73.__zw, r73.zwzw, r75.xyxy, r74.zwzw
mov r71.___w, r70.000y
ult r68._y__, r68.y, r71.w
The mad_hi code:
mov r72.__z_, r72.00y0
mov r72.___w, r68.y
mov r73.x___, l14
iadd r73.__zw, r72.00w0, r73.000x
mov r72.___w, l15
ishl r73._y__, r1.x, r72.w
mov r74.x___, r73.y
iadd r74.xy__, r74.x000, r73.0x00
i64mul r73.__zw, r73.zwzw, r74.xyxy
mov r74.__z_, l16
u64shr r73.__zw, r73.zwzw, r74.z
mov r73.__z_, r73.00z0
iadd r72.__z_, r73.z, r72.z
ult r71._y__, r72.z, r71.y
mov r71._y__, r71.y
mov r72.x___, r72.x000
mov r72._y__, r71.z
iadd r73.__zw, r72.00y0, r73.000x
i64mul r73.__zw, r73.zwzw, r74.xyxy
u64shr r73.__zw, r73.zwzw, r74.z
mov r72._y__, r73.0z00
iadd r72.x___, r72.y, r72.x
ult r71.___w, r72.x, r71.w
mov r71.___w, r71.w
mov r73.__zw, r71.w
iadd r73.__zw, r73.00z0, r71.000y
mov r71._y__, l17
mov r74.xy__, r71.y
iand r73.__zw, r73.zwzw, r74.xyxy
mov r75.xy__, r73.x
mov r75.__zw, r72.w
cmov_logical r73.__zw, r73.zwzw, r75.zwzw, r75.xyxy
mov r71._y__, r70.0y00
ult r68._y__, r68.y, r71.y
The source variables are declared as
typedef struct _int96_1t
{
uint d0,d1,d2;
}int96_1t;
typedef struct _int96_t
{
uint2 d0,d1,d2;
}int96_t;
#define uint_v uint2
#define AS_UINT_V as_uint2
__private uint_v t; // read from input array
__private int96_t k;
const int96_1t k_base (input parameter)
The amd_il code generated from ocl usually hurt my eyes
Umm well, it's not too good that opencl's mad_hi() compiles to amd_il's i64mul and u64shr. (and why mul_hi() don't)
And that's translates to gcn isa: v_mul_hi_u32 v5, v2, s1 \ v_mul_lo_u32 v4, v2, s1 \ v_lshr_b64 v[4:5], v[4:5], 32
Here the cal-compiler noticed that the shr's parameter is constant 32, but wasn't clever enough to optimize out 2 instructions. So it's pretty sure that there's no mad24_lo and mad24_hi in opencl.
(And why a signed i64mul becomes an unsigned v_mul_hi_u32? o.O)