What a coincidence, I am wondering about the same thing those days.
Have you looked at the ISA related documents? I have a file "AMD_Southern_Islands_Instruction_Set_Architecture1.pdf" which seems more likely to include this information but I never really looked at the details as I don't plan to go lower level than CL any time soon.
Yes, I checked, but that document is about the ISA, and not about HW implementing the ISA. I checked both the Southern Islands and the Sea Islands documentation, but only the simplest operations had latency, but INTADD is 1 cycle on any post 1980 architecture. I was a lot more curious about the transcendent operations like sin() or tan().
I see you have found docs for CPU, but maybe You'll find this pdf even better? -> http://www.agner.org/optimize/instruction_tables.pdf
But on the GPU the ideal case instr speeds are really that simple as stated in the OCL Guide.
In addition to the ideal instruction speeds: On VLIW the clauses/loops can degrade ideal throughputs. And on the GCN there are some Vector/Scalar instruction combinations that can introduce stalls. Oh, and there are some special GCN instructions dealing with cycles/latencies: s_memtime, s_wait, s_sleep. Conditional/Unconditional jumps are super fast on GCN: loop overhead is only 1 cycle. (Loops on the VLIW: 40cycles or something).