at section 9.3.1
LDS Direct reads occur in vector ALU (VALU) instructions and allow the LDS to
supply a single DWORD value which is broadcast to all threads in the wavefront
and is used as the SRC0 input to the ALU operations. A VALU instruction
indicates that input is to be supplied by LDS by using the LDS_DIRECT for the
SRC0 field.
I am interested to know how many clock cycles penalty does it have compared to using a data which is already in a register?
Does ALUs have some hidden registers to receive the data in SRC0? or where does the broadcasted data gets stored?
Solved! Go to Solution.
Hi,
src_lds_direct takes exactly the same amount of time as a vector or a scalar register. (measured with s_memtime)
It is like when you broadcast a scalar register to the whole WF but basically you can have up to 16KB constants, not only 103*4 bytes, while the ALU can work at maximum utilization.
SRC0 can select from 512 different things: 256 vregs, 128sregs and 128 special things (I guess those are cam from the scalar alu also). lds_direct is on of these specials. There are many int, float constants, debug/trap registers, and state flags and even a thing that marks immediate data right after the instruction dword.
Hi,
src_lds_direct takes exactly the same amount of time as a vector or a scalar register. (measured with s_memtime)
It is like when you broadcast a scalar register to the whole WF but basically you can have up to 16KB constants, not only 103*4 bytes, while the ALU can work at maximum utilization.
SRC0 can select from 512 different things: 256 vregs, 128sregs and 128 special things (I guess those are cam from the scalar alu also). lds_direct is on of these specials. There are many int, float constants, debug/trap registers, and state flags and even a thing that marks immediate data right after the instruction dword.