currently I am trying to improve my kernels by inserting assembly code - for Vega GPUs by using clrxasm or testing them on rocm with inline assembly and for Navi I am testing my codes by just inline asm code into my cl code.
That said I have one situation, where I need to perform certain uint64 add operations involving registers from neighbored lanes. To achieve that I am using the VOP2 variants of v_add_co_u32 and v_addc_co_u32. Concretely this is valid code in rocm OpenCL for Vega:
__asm volatile("v_add_co_u32_e32 %0, vcc, %1, %2 \n" : "=v" (r.s0) : "v" (i.s0), "v" (i.s2) : "vcc");
__asm volatile("v_addc_co_u32_e32 %0, vcc, %1, %2, vcc \n" : "=v" (r.s1) : "v" (i.s1), "v" (i.s3) : "vcc");
My problem is that this does not get accepted on Navi "instruction not supported on this GPU / not a valid operand" , but the RDNA Isa only lists a VOP2 code for add with carry in and out "v_add_co_ci_u32", but the other codes are only listed in VOP3B segment (although in the text its mentioned a VOP2 version should exist).
Anyways. Right now I am not able to figure out which are the right opcodes for my ulong add on Navi at all. As a temporary fix I am using "v_mov_b32" with DPP modifier plus the normal OpenCL add, but that doubles my total instruction count.
So I would like to know: what would be the right Navi equivalent of the Vega code I posted above
Thank you for the query. I've forwarded it to the OpenCL compiler team. I'll let you know once I get any reply from them.
I wanted to give a quick update on this matter.
First of all I made one mistake when using v_add_co_ci_u32_dpp (which actually exists and works fine) by using "vcc" as input and destination register for the carry. It seems this instruction expects a 32 bit register (due to wave 32 mode) for the carry and thus only works when using "vcc_lo". That said: the compiler (amdgpu-pro 20.20) returned the error message "instruction not supported on this GPU", which is very misleading - it should rather complain about its input and output arguments, then I may have found the real issue earlier.
v_add_co_ci_u32_dpp (Navi) is the real and working new opcode for v_addc_co_u32_dpp (Vega).
Now the question is still open what about v_add_co_u32_dpp? I used llvm to show me the disassembly of a Navi kernel and saw that on a ulong add it ALWAYS uses v_add_co_u32_e64 - even if the destination s_reg is vcc_lo. So it seems v_add_co_u32_e32 (and thus also the dpp version) is either missing in Navi ISA completely, or it was forgotten to tell the compiler about it. How ever this two options may have happened or remain undiscovered for over a year now.
My current workaround is to use s_mov_b32 vcc_lo 0 + v_add_co_ci_u32_dpp for the low 32 bits addition now. Adding one extra instruction on every appearance to my sources. For the instance it is working correct, but I hope the compiler team finds a better solution / work around.
Regarding your original query, the compiler team has shared the below feedback.
The only add and sub instructions supporting carry and having VOP2 form (thus also promotable to DPP form) on Navi are v_add_co_ci_u32_e32 and v_sub_co_ci_u32_e32. Therefore, a pair of such instructions has to be used for ulong add. A user shall make sure carry-in is zero for the first instruction.
1. The above example for Vega does not use DPP unlike what is stated.
2. If compiled for wave32 vcc_lo has to be used instead of vcc.
3. DPP is different on Navi. There is DPP16 and DPP8, but at maximum it can do DPP across 16 lanes.
Otherwise these are the instructions to use with and w/o DPP:
llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 'v_add_co_ci_u32 v5, vcc_lo, v1, v2, vcc_lo quad_perm:[3,2,1,0] row_mask:0x0 bank_mask:0x0'
v_add_co_ci_u32_dpp v5, vcc_lo, v1, v2, vcc_lo quad_perm:[3,2,1,0] row_mask:0x0 bank_mask:0x0 ; encoding: [0xfa,0x04,0x0a,0x50,0x01,0x1b,0x00,0x00]
llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 'v_add_co_ci_u32 v5, vcc_lo, v1, v2, vcc_lo'
v_add_co_ci_u32_e32 v5, vcc_lo, v1, v2, vcc_lo ; encoding: [0x01,0x05,0x0a,0x50]
To overcome the need of a zero carry-in the first instruction can be v_add_co_u32, but this is VOP3 and cannot be DPP:
llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding <<< 'v_add_co_u32 v5, vcc_lo, v1, v2'
v_add_co_u32_e64 v5, vcc_lo, v1, v2 ; encoding: [0x05,0x6a,0x0f,0xd7,0x01,0x05,0x02,0x00]
In wave64 a null register can be used as a carry-in:
llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding -mattr=+wavefrontsize64 <<< 'v_add_co_ci_u32 v5, vcc, v1, v2, null'
v_add_co_ci_u32_e64 v5, vcc, v1, v2, null ; encoding: [0x05,0x6a,0x28,0xd5,0x01,0x05,0xf6,0x01]
Null can be used in wave32 as well.
llvm-mc -arch=amdgcn -mcpu=gfx1010 -show-encoding -mattr=+wavefrontsize32 <<< 'v_add_co_ci_u32 v5, vcc_lo, v1, v2, null'
v_add_co_ci_u32_e64 v5, vcc_lo, v1, v2, null ; encoding: [0x05,0x6a,0x28,0xd5,0x01,0x05,0xf6,0x01]
That does not help to get VOP2 or DPP form anyway. On practice that is not any better than v_add_co_u32 though because use of any non-vcc operand for carry pushes it into VOP3 land.