Hi,

How can one tell how many clock cycles an ISA operation would take? For example, v_sqrt_f32 ? (if there were no bottlenecks) and how many of them can a compute unit run simultaneously? I found some documentations but none had figures showing the execution speeds. Why not?

Thanks,

Evren

Hi,

v_sqrt_f32 took 4 ticks on a HD7970 (where 1 tick is a simple instruction like, v_mul_f32. 1 tick=4 harware clock cycles). And every lanes can do it in a CU simultaneously. Not sure if it's 1/4 SP rate or exact DP rate...

FYI here are some more:

v_sin_f32 -> 4

v_mul_lo_i32 -> 4 (DP rate)

v_cvt_f32_i32, v_cvt_i32_f32 -> 1 (it's fast on GCN)

v_rcp_f32 -> 4

v_rsq_f32 -> 4

s_nop 7 -> 8 (so it works)

buffer_atomic_umax v0, v1, s[4:7], 0 offen -> 1 (the whole kernel writes to the same place, so it's always cached, and it's faaaaaaast )

"How can one tell..."

If you have much free time you can fiddle with ISA and measure it with s_memtime instruction (which is like QueryPerformanceCounter on windows).

Or just write a very big kernel loaded with the inspected instruction and compare its time against simple well-known instructions. You can do it because there is no penalties based on v_instruction ordering. There is penalty when you use many S instructions interleaved with V instructions AND using more than 83 vregs, but OpenCL is not 'abusing' the S alu that much, so it's not a problem.