3 Replies Latest reply on Feb 24, 2015 11:39 PM by realhet

    Comprehensive instruction latency table

    Meteorhead

      Hi!

       

      I am creating slides for a university course and I was looking to compare various instruction latencies on CPUs and GPUs. Inside the OpenCL Optimization Guide, there is a very short table for VLIW instruction latencies. Is there any place where I could find a comprehensive table of VLIW4-VLIW5-GCN1.0-etc. instruction latencies on various HW? Same goes for Bulldozer derivate CPUs. Intel has very nice documentation on instruction latencies in their HW, but I fail to find the counterpart from AMD's side.

       

      Anyone have a clue?

        • Re: Comprehensive instruction latency table
          maxdz8

          What a coincidence, I am wondering about the same thing those days.

           

          Have you looked at the ISA related documents? I have a file "AMD_Southern_Islands_Instruction_Set_Architecture1.pdf" which seems more likely to include this information but I never really looked at the details as I don't plan to go lower level than CL any time soon.

            • Re: Comprehensive instruction latency table
              Meteorhead

              Yes, I checked, but that document is about the ISA, and not about HW implementing the ISA. I checked both the Southern Islands and the Sea Islands documentation, but only the simplest operations had latency, but INTADD is 1 cycle on any post 1980 architecture. I was a lot more curious about the transcendent operations like sin() or tan().

            • Re: Comprehensive instruction latency table
              realhet

              Hi,

              I see you have found docs for CPU, but maybe You'll find this pdf even better? ->  http://www.agner.org/optimize/instruction_tables.pdf

               

              But on the GPU the ideal case instr speeds are really that simple as stated in the OCL Guide.

              In addition to the ideal instruction speeds: On VLIW the clauses/loops can degrade ideal throughputs. And on the GCN there are some Vector/Scalar instruction combinations that can introduce stalls. Oh, and there are some special GCN instructions dealing with cycles/latencies: s_memtime, s_wait, s_sleep. Conditional/Unconditional jumps are super fast on GCN: loop overhead is only 1 cycle. (Loops on the VLIW: 40cycles or something).