thanks for listening. here's the riddle:
i have written an opencl kernel, which is 99% alu bound, using only bit operation instructions,
(v_xor, v_not, v_or, v_bfi)
no access to memory (including lds). uses only v_ instructions, with the exception of s_cmp, s_add,
s_branch for the for loop. there are 100 alu instructions in the loop body, but the measurements
(valu utilization aka VALUBusy) do not change if there are 1000.
the loop is executed 10000 times per kernel invocation.
there are minimal read after write conflicts and all read after write accesses to registers happen in
the loop body is a mix of ~70 VOP3 (v_bfi) instructions and ~30 VOP2 (and, or, xor).
the whole kernel program size is a mere 3k (fits instruction cache).
there is no thread divergance.
i cannot determine if the 8 wavefronts per CU are executed concurrently or sequentially (4 then 4)
(because i dont see how i can access s_memtime from opencl)
yet: sprofile only measures VALUBusy of 60%. doing the math myself (number of instructions vs kernel time)
i come to the same conclusion.
i am curious, how does a kernel with more than 90% VALUBusy look like. any examples.