I was pleased to see that the Evergreen instruction set was published with the 2.0 release. But try as I may, I can't find any documents with information on how to optimally pack the ALU instructions.
For instance, I assume that the cosine instruction can only be issued in the t unit, although this is not stated. Whats more, the IL specification talks about the cosine instruction operating on a vector (xyzw of a register) - which seems to conflict with the microcode operating on a single 32 bit register.
The kind of ducumentation I am looking for would be:
How many and which instructions can be coissued in a VLIW.
Which instructions are only legal in the xyzw units
Which intructions are only legal in the t unit.
Which instructions can be issued to any unit.
In short information needed to get fuller utilization of the stream cores in the ALU clauses. Currently my kernels very often use 4 or less out of 5 units ( <80%) - even when there is no data dependancy, and I am trying to understand which changes I can make to get closer to 100% utilization.
Any pointers will be much appreciated.