v_sqrt_f32 took 4 ticks on a HD7970 (where 1 tick is a simple instruction like, v_mul_f32. 1 tick=4 harware clock cycles). And every lanes can do it in a CU simultaneously. Not sure if it's 1/4 SP rate or exact DP rate...
FYI here are some more:
v_sin_f32 -> 4
v_mul_lo_i32 -> 4 (DP rate)
v_cvt_f32_i32, v_cvt_i32_f32 -> 1 (it's fast on GCN)
v_rcp_f32 -> 4
v_rsq_f32 -> 4
s_nop 7 -> 8 (so it works)
buffer_atomic_umax v0, v1, s[4:7], 0 offen -> 1 (the whole kernel writes to the same place, so it's always cached, and it's faaaaaaast )
"How can one tell..."
If you have much free time you can fiddle with ISA and measure it with s_memtime instruction (which is like QueryPerformanceCounter on windows).
Or just write a very big kernel loaded with the inspected instruction and compare its time against simple well-known instructions. You can do it because there is no penalties based on v_instruction ordering. There is penalty when you use many S instructions interleaved with V instructions AND using more than 83 vregs, but OpenCL is not 'abusing' the S alu that much, so it's not a problem.
I wonder why AMD wouldnt document these...
So if v_sqrt_f32 took 4 clock cycles, am I understanding it correctly that with 2048 ALUs (32 x 4 x 16), at 1ghz, it does roughly 512gflops? Is there a faster sqrt instruction which uses less amount clock cycles? (for float)
Yes, half TFlops/s.
No, v_sqrt_f32 is the only one, there's no faster but less precise version of it, as I know.
Why AMD not document these stuff much?
I think there is not much profit they can get out of this low-levelism. They rather spend time on high level stuff so they can address the masses.
Well, both Intel and AMD have latency / throughput figures for CPU instructions, and also Nvidia have figures for their GPUs. This information is sort of important for being able to calculate how efficient our programs are running compared to theoretical limits of the devices.
You know, the philosophy is like this: OpenCL compiler knows it better, you don't have to bother with it, you're more productive
I'm only sad because there's no official way to use ISA code, in a way like traditional x86 languages+intrinsics. Although I know its quiet impossible to make with the current OpenCL->LLVM->AMD_IL->ISA chain. And if you go ISA only, then you have to know the whole undocumented elf_in_an_elf file format. But this excellent hw architecture worth the effort.