Yes, that can be done, except that you need to compute two address prior to the READ2. In other words, the compiler would schedule the code as:
LDS_READ2_RET address1, address2
This would provide no real savings as you need to schedule work before accessing the first request anyway. If you perform several LDS reads back-to-back, you'll see a pattern like this:
LDS_READ2_RET address2, address3
LDS_READ2_RET address4, address5
Thanks for a detailed replay. I understand that computing two addresses and then scheduling work might provide no real savings, but there are more cases where using LDS_READ2_RET would increase performance. For example, this kind of pattern:
float result = localBuf[ndx * stride] + localBuf[ndx * stride + offset]; // instead of addition it can be any operation that takes two inputs, and produces one output
could compile into:
R0.x = ndx R0.y = stride R0.z = offset R0.w = pointer to localBuf R1.x = result 0 x: MUL_UINT24 ____, R0.x, R0.y y: MULADD_UINT24 ____, R0.x, R0.y, R0.z 1 x: MULADD_UINT24 ____, PV0.x, 4, R0.w y: MULADD_UINT24 ____, PV0.y, 4, R0.w 2 x: LDS_READ2_RET QAB, PV1.x, PV1.y 3 x: ADD R1.x, QA.pop, QB.pop
In cases where stride and offset are known at compile time and pointer to localBuf is 0, it could be optimized even more:
R0.x = ndx R1.x = result 0 x: MUL_UINT24 ____, R0.x, <stride * 4> y: MULADD_UINT24 ____, R0.x, <stride * 4>, <offset * 4> 1 x: LDS_READ2_RET QAB, PV0.x, PV0.y 2 x: ADD R1.x, QA.pop, QB.pop
It's safe to use 24-bit integer mul/muladd, because LDS is 32K in size, so all variables used in computation of address should use no more than 15-bits. Also ndx and stride should be unsigned, as there is no signed 24-bit integer mul/muladd.
As for LDS_WRITE_REL, it takes only one address, so there's no need to compute two addresses, so compiler shouldn't split it into two LDS_WRITE instructions, unless there's a good reason for that.