I'm planning to do an ISA optimization on my existing kernel, and just wondering if anybody have some info on how to feed the S and V ALU's to achieve maximum utilization.
On the GCN .ppt slides I've seen that there is an instruction 'arbitrator' which is feeding 4 Vector alu's and 1 Scalar alu. It also tells that the S alu is working 4x faster than the V alu's. In total it gives 1:1 V:S alu operation capacity.
For example: what if I interleave independent V_ and S_ instructions in a pattern like SVSVSVSV?
Will the instruction decoder in the SIMD engine be able to feed all four V alu's in every 4 cycles while also feeding the S alu with one instruction every cycle (for each of the 4 V alu's)?
What if I use larger VOP3 instructions and/or 32bit immediate constants, when the instruction decoder will have to look more dwords ahead? Is there any specifications on the capabilities/limitations of the instruction decoder/arbitration unit?
Why am I doing this: I have a quiet big kernel (25KB on 7970) which is working at 98% alu utilization on the 6970, but unfortunately on 7970 it runs out of VRegs (above 128 vgprs I noticed a 'task scheduler' bottleneck because it can't put 2 kernels in the queues. When a task is done, the simd engines can't start immediately another task. It means -30% performance loss in my case). My plan is to use more sgprs instead of some of the vgprs to get below the 128 vgprs 'limit'. I can swap many calculations out to S registers which are the same values for every 16 simd vector threads, I just have th learn, how effectively schedule the SOP's and VOP's to achieve maximum alu utilization.
In my worst expectations maybe there is no S/V paralellism, and in every 4 cycles either 1 VOP or 1 SOP can be executed. This way there is no need for hardwaredependency checks or special compiler instruction reordering.
I also noticed the s_buffer_load_dwordx16 thing, another reason to go down below CAL
Thank you for your answers!