10 Replies Latest reply on Feb 24, 2012 10:31 AM by sarnath

    AMD IL - Instruction Schedule

    sarnath

      Hi,

      Is the instruction schedule in the IL code the final one?

      OR will the schedule change as it passes to subsequent compilation phases.

       

      I have a code that shows a lot of depedency-chain in IL.

      I am not sure if this is a performance-bottleneck.

      How can I verify this?

       

      Thanks for any info,

      Best Regards,

      Sarnath

        • Re: AMD IL - Instruction Schedule
          MicahVillmow

          Analyze the ISA and not the IL for the final instruction schedule or to determine bottlenecks.

          1 of 1 people found this helpful
          • Re: AMD IL - Instruction Schedule
            sarnath

            Thanks for the answers both of you. I went through the ISA as well as the Cayman ISA documentation and I have reasons to believe that the instruction sequence below suffers heavily from "dependence chain" problems.

             

            The "PV" operand clearly indicates that instructions may have to wait for completion of previous instructions - rendering the ALU pipeline idle for most of the time.

            I am new to AMD arch. Can somebody enlighten me? Thanks much!

             

                    112  z: MULADD_e    R127.z,  R17.x,  R2.x,  R29.w

                         w: MULADD_e    R127.w,  R13.x,  R2.x,  R7.z      VEC_210

                    113  x: MULADD_e    R127.x,  R13.y,  R2.y,  PV112.w

                         y: MULADD_e    R127.y,  R15.x,  R2.x,  R7.x

                         z: MULADD_e    R127.z,  R17.y,  R2.y,  PV112.z      VEC_210

                    114  x: MULADD_e    R127.x,  R17.z,  R2.z,  PV113.z

                         y: MULADD_e    R127.y,  R15.y,  R2.y,  PV113.y

                         z: MULADD_e    R127.z,  R13.z,  R2.z,  PV113.x      VEC_210

                         w: MULADD_e    R127.w,  R19.x,  R2.x,  R8.x

                    115  x: MULADD_e    R127.x,  R17.w,  R2.w,  PV114.x

                         y: MULADD_e    R127.y,  R13.w,  R2.w,  PV114.z      VEC_210

                         z: MULADD_e    R127.z,  R19.y,  R2.y,  PV114.w

                         w: MULADD_e    R127.w,  R15.z,  R2.z,  PV114.y

                    116  x: MULADD_e    R127.x,  R15.w,  R2.w,  PV115.w

                         y: MULADD_e    R127.y,  R16.x,  R3.x,  PV115.x

                         z: MULADD_e    R127.z,  R12.x,  R3.x,  PV115.y      VEC_210

                         w: MULADD_e    R127.w,  R19.z,  R2.z,  PV115.z

                    117  x: MULADD_e    R127.x,  R14.x,  R3.x,  PV116.x

                         y: MULADD_e    R127.y,  R12.y,  R3.y,  PV116.z

                         z: MULADD_e    R127.z,  R16.y,  R3.y,  PV116.y      VEC_210

                         w: MULADD_e    R127.w,  R19.w,  R2.w,  PV116.w

                    118  x: MULADD_e    R0.x,  R16.z,  R3.z,  PV117.z

                         y: MULADD_e    R127.y,  R14.y,  R3.y,  PV117.x

                         z: MULADD_e    R127.z,  R18.x,  R3.x,  PV117.w

                         w: MULADD_e    R127.w,  R12.z,  R3.z,  PV117.y      VEC_210

                    119  x: MULADD_e    R127.x,  R18.y,  R3.y,  PV118.z

                         y: MULADD_e    R127.y,  R21.x,  R2.x,  R28.y

                         z: MULADD_e    R7.z,  R12.w,  R3.w,  PV118.w

              • Re: AMD IL - Instruction Schedule
                sarnath

                Each ALU Group has almost 4 instructions in it (with the exception of first , second and the last group ).

                So, these must be combined as a VLIW packet and issued straightaway in a single cycle.

                Moreover, Each ALU group has a dependency on the previous ALU group.

                Let us assume that a wavefront would be issued in 4 cycles (as 4 quarter wave-fronts).

                By the end of 4 cycles the first ALU group1 would have been scheduled for all 64 threads.

                 

                The workgroup size of my kernel is 256 and is very register-intensive.

                32 registers per thread. 256*32 = 8K.

                Since 8K is such a round number, the number of active wavefronts could either be just 4 or 8. (Is there a way to figure this out?)

                Assuming 4 active wave-fronts, the CU would schedule group1 within 16 cycles.

                At the begining of 17th cycle, "dependencies" start playing up and will stall the wavefronts.

                Now, the question is, will the GPU finish a MULADD instruction within 16 cycles or not?

                 

                If the number of active-wavefronts is 8, then I would 32 cycles of leisure-time before dependencies start playing up.

                Not too sure how deep is the pipeline and how much latency a MULADD instruction has.

                 

                Can somebody throw light? Thanks!

                  • Re: AMD IL - Instruction Schedule
                    MicahVillmow

                    As long as he has two wavefronts active, then his code will not stall the machine on a per ALU cycle basis. The instruction latency is 8 cycles, and each wavefront executes 64 work-items over 4 cycles(16 work-items per cycle), so two wavefronts covers ALU latency. The problem comes not at the ALU latency, which in his case is pretty well packed, but at the clause boundary.

                    1 of 1 people found this helpful
                      • Re: AMD IL - Instruction Schedule
                        sarnath

                        Hi MIcah,

                        Thanks for your answer. Great to know that! And, so, dependency is not killing the pipe here. So, all my performance bottlenck is coming from Memory - which is quite understandable. What i have is a strided memory-access pattern...

                        I did optimize for cache... and got around 80% cache-hit in the profiler... But still my ALU is not busy.....and i am not having a dependency

                        issue as well.... How do I interpret that? Any clues?

                         

                        Hi Lihan,

                        Thanks for the tip. I am running linux and sprofile does not show the occupancy. I think windows has this feature. Anyway, Thanks a lot for pointing me to that. It is going to be useful someday to me.

                         

                        Thanks all of you,

                        Best Regards,

                        Sarnath

                      • Re: AMD IL - Instruction Schedule
                        lbin

                        You can use APP Profiler to find out theoretical number of active waves per CU on Pre-SI hardware. The calculation is based on GPR usage, LDS usage and work group size.

                         

                        Note that APP Profiler occupancy calculator doesn't work with Catalyst 12.1.