src_lds_direct takes exactly the same amount of time as a vector or a scalar register. (measured with s_memtime)
It is like when you broadcast a scalar register to the whole WF but basically you can have up to 16KB constants, not only 103*4 bytes, while the ALU can work at maximum utilization.
SRC0 can select from 512 different things: 256 vregs, 128sregs and 128 special things (I guess those are cam from the scalar alu also). lds_direct is on of these specials. There are many int, float constants, debug/trap registers, and state flags and even a thing that marks immediate data right after the instruction dword.