Archives Discussions

Meteorhead · ‎02-24-2015

Hi!

I am creating slides for a university course and I was looking to compare various instruction latencies on CPUs and GPUs. Inside the OpenCL Optimization Guide, there is a very short table for VLIW instruction latencies. Is there any place where I could find a comprehensive table of VLIW4-VLIW5-GCN1.0-etc. instruction latencies on various HW? Same goes for Bulldozer derivate CPUs. Intel has very nice documentation on instruction latencies in their HW, but I fail to find the counterpart from AMD's side.

Anyone have a clue?

realhet · ‎02-24-2015

Hi,

I see you have found docs for CPU, but maybe You'll find this pdf even better? -> http://www.agner.org/optimize/instruction_tables.pdf

But on the GPU the ideal case instr speeds are really that simple as stated in the OCL Guide.

In addition to the ideal instruction speeds: On VLIW the clauses/loops can degrade ideal throughputs. And on the GCN there are some Vector/Scalar instruction combinations that can introduce stalls. Oh, and there are some special GCN instructions dealing with cycles/latencies: s_memtime, s_wait, s_sleep. Conditional/Unconditional jumps are super fast on GCN: loop overhead is only 1 cycle. (Loops on the VLIW: 40cycles or something).

View solution in original post

maxdz8 · ‎02-24-2015

What a coincidence, I am wondering about the same thing those days.

Have you looked at the ISA related documents? I have a file "AMD_Southern_Islands_Instruction_Set_Architecture1.pdf" which seems more likely to include this information but I never really looked at the details as I don't plan to go lower level than CL any time soon.

Meteorhead · ‎02-24-2015

Yes, I checked, but that document is about the ISA, and not about HW implementing the ISA. I checked both the Southern Islands and the Sea Islands documentation, but only the simplest operations had latency, but INTADD is 1 cycle on any post 1980 architecture. I was a lot more curious about the transcendent operations like sin() or tan().

realhet · ‎02-24-2015

Hi,

I see you have found docs for CPU, but maybe You'll find this pdf even better? -> http://www.agner.org/optimize/instruction_tables.pdf

But on the GPU the ideal case instr speeds are really that simple as stated in the OCL Guide.

In addition to the ideal instruction speeds: On VLIW the clauses/loops can degrade ideal throughputs. And on the GCN there are some Vector/Scalar instruction combinations that can introduce stalls. Oh, and there are some special GCN instructions dealing with cycles/latencies: s_memtime, s_wait, s_sleep. Conditional/Unconditional jumps are super fast on GCN: loop overhead is only 1 cycle. (Loops on the VLIW: 40cycles or something).

Archives Discussions

Comprehensive instruction latency table