1 Reply Latest reply on May 27, 2014 8:51 AM by realhet

    LDS Direct Read performance


      In http://developer.amd.com/wordpress/media/2012/12/AMD_Southern_Islands_Instruction_Set_Architecture.pdf

      at section 9.3.1


      LDS Direct reads occur in vector ALU (VALU) instructions and allow the LDS to

      supply a single DWORD value which is broadcast to all threads in the wavefront

      and is used as the SRC0 input to the ALU operations. A VALU instruction

      indicates that input is to be supplied by LDS by using the LDS_DIRECT for the

      SRC0 field.


      I am interested to know how many clock cycles penalty does it have compared to using a data which is already in a register?


      Does ALUs have some hidden registers to receive the data in SRC0? or where does the broadcasted data gets stored?

        • Re: LDS Direct Read performance


          src_lds_direct takes exactly the same amount of time as a vector or a scalar register. (measured with s_memtime)

          It is like when you broadcast a scalar register to the whole WF but basically you can have up to 16KB constants, not only 103*4 bytes, while the ALU can work at maximum utilization.


          SRC0 can select from 512 different things: 256 vregs, 128sregs and 128 special things (I guess those are cam from the scalar alu also). lds_direct is on of these specials. There are many int, float constants, debug/trap registers, and state flags and even a thing that marks immediate data right after the instruction dword.