Thanks for a detailed replay. I understand that computing two addresses and then scheduling work might provide no real savings, but there are more cases where using LDS_READ2_RET would increase performance. For example, this kind of pattern:
float result = localBuf[ndx * stride] + localBuf[ndx * stride + offset]; // instead of addition it can be any operation that takes two inputs, and produces one output
could compile into:
R0.x = ndx
R0.y = stride
R0.z = offset
R0.w = pointer to localBuf
R1.x = result
0 x: MUL_UINT24 ____, R0.x, R0.y
y: MULADD_UINT24 ____, R0.x, R0.y, R0.z
1 x: MULADD_UINT24 ____, PV0.x, 4, R0.w
y: MULADD_UINT24 ____, PV0.y, 4, R0.w
2 x: LDS_READ2_RET QAB, PV1.x, PV1.y
3 x: ADD R1.x, QA[2].pop, QB[2].pop
In cases where stride and offset are known at compile time and pointer to localBuf is 0, it could be optimized even more:
R0.x = ndx
R1.x = result
0 x: MUL_UINT24 ____, R0.x, <stride * 4>
y: MULADD_UINT24 ____, R0.x, <stride * 4>, <offset * 4>
1 x: LDS_READ2_RET QAB, PV0.x, PV0.y
2 x: ADD R1.x, QA[2].pop, QB[2].pop
It's safe to use 24-bit integer mul/muladd, because LDS is 32K in size, so all variables used in computation of address should use no more than 15-bits. Also ndx and stride should be unsigned, as there is no signed 24-bit integer mul/muladd.
As for LDS_WRITE_REL, it takes only one address, so there's no need to compute two addresses, so compiler shouldn't split it into two LDS_WRITE instructions, unless there's a good reason for that.