currently I am trying to improve my kernels by inserting assembly code - for Vega GPUs by using clrxasm or testing them on rocm with inline assembly and for Navi I am testing my codes by just inline asm code into my cl code.
That said I have one situation, where I need to perform certain uint64 add operations involving registers from neighbored lanes. To achieve that I am using the VOP2 variants of v_add_co_u32 and v_addc_co_u32. Concretely this is valid code in rocm OpenCL for Vega:
__asm volatile("v_add_co_u32_e32 %0, vcc, %1, %2 \n" : "=v" (r.s0) : "v" (i.s0), "v" (i.s2) : "vcc");
__asm volatile("v_addc_co_u32_e32 %0, vcc, %1, %2, vcc \n" : "=v" (r.s1) : "v" (i.s1), "v" (i.s3) : "vcc");
My problem is that this does not get accepted on Navi "instruction not supported on this GPU / not a valid operand" , but the RDNA Isa only lists a VOP2 code for add with carry in and out "v_add_co_ci_u32", but the other codes are only listed in VOP3B segment (although in the text its mentioned a VOP2 version should exist).
Anyways. Right now I am not able to figure out which are the right opcodes for my ulong add on Navi at all. As a temporary fix I am using "v_mov_b32" with DPP modifier plus the normal OpenCL add, but that doubles my total instruction count.
So I would like to know: what would be the right Navi equivalent of the Vega code I posted above