AnsweredAssumed Answered

7970 ISA Vector/Scalar instruction level paralellism

Question asked by realhet on Mar 20, 2012
Latest reply on Apr 7, 2012 by realhet



I'm planning to do an ISA optimization on my existing kernel, and just wondering if anybody have some info on how to feed the S and V ALU's to achieve maximum utilization.

On the GCN .ppt slides I've seen that there is an instruction 'arbitrator' which is feeding 4 Vector alu's and 1 Scalar alu. It also tells that the S alu is working 4x faster than the V alu's. In total it gives 1:1 V:S alu operation capacity.


For example: what if I interleave independent V_ and S_ instructions in a pattern like SVSVSVSV?

Will the instruction decoder in the SIMD engine be able to feed all four V alu's in every 4 cycles while also feeding the S alu with one instruction every cycle (for each of the 4 V alu's)?

What if I use larger VOP3 instructions and/or 32bit immediate constants, when the instruction decoder will have to look more dwords ahead? Is there any specifications on the capabilities/limitations of the instruction decoder/arbitration unit?


Why am I doing this: I have a quiet big kernel (25KB on 7970) which is working at 98% alu utilization on the 6970, but unfortunately on 7970 it runs out of VRegs (above 128 vgprs I noticed a 'task scheduler' bottleneck because it can't put 2 kernels in the queues. When a task is done, the simd engines can't start immediately another task. It means -30% performance loss in my case). My plan is to use more sgprs instead of some of the vgprs to get below the 128 vgprs 'limit'. I can swap many calculations out to S registers which are the same values for every 16 simd vector threads, I just have th learn, how effectively schedule the SOP's and VOP's to achieve maximum alu utilization.


In my worst expectations maybe there is no S/V paralellism, and in every 4 cycles either 1 VOP or 1 SOP can be executed. This way there is no need for hardwaredependency checks or special compiler instruction reordering.


I also noticed the s_buffer_load_dwordx16 thing, another reason to go down below CAL


Thank you for your answers!