3 Replies Latest reply on Mar 25, 2013 6:46 AM by realhet

    How to do wave shuffle?


      Hi boys and girls,


      I read the GCN ISA manual and came to found an DS_SWIZZLE instruction. which is capable for doing inter-thread data exchange without touching LDS memory.


      But the instruction is not exported into amd app sdk's opencl language. So, How to use it?


      It's a great feature, which is exactly the AMD version of the "warp shuffle" feature of NV's kepler cards.

      So it's better to use it.


      Thank you.

        • Re: How to do wave shuffle?

          OpenCL is a open standard. It still does not support this swizzling concept. It does not even support wavefront/warp yet.

          So, You cannot use this feature in OpenCL.


          There are others who try to code in IL. They may be able to help you out here.

            • Re: How to do wave shuffle?

              Thank you, Himanshu.


              It's a pity that AMD didn't introduce any extensions for that. It's waste, sure.


              And now, I don't want to learn AMD IL which is expected to be soon deprecated. I'm waiting for the new HSA IL. maybe I could use that for shuffling.


              Now my question was completely answered by you.

            • Re: How to do wave shuffle?

              That's a nice find!


              Although I don't know any IL instr which explicitly uses DS_Swizzle.

              I was checked it, maybe other instructions are there and found some new undocumented gems (introduced whatever after cat11.12):

              96bit, 128bit (continuous) DS_ instructions with one offset.


              v_floor/ceil/trunc for f64

              s_cbranch_debug_system, s_cbranch_debug_user : Maybe this is windows's "int 3" one byte debug equivalent.

              ds_wrap_rtn_b32 : another complex ds opetarion

              v_mad_i64_i32  ->  64bit(32bit * 32bit) + 64.bit, now that's great for 64bit address arithmetic, I guess it takes only 4 cycles and is made of reusing some parts of the f64 unit. With mul_lo, mul_hi, add, addc it would take 10 cycles.

              flat_* : Memory IO operations: I think it only needs a flat 64bit address, but IDK...