I was pleased to see that the Evergreen instruction set was published with the 2.0 release. But try as I may, I can't find any documents with information on how to optimally pack the ALU instructions.
For instance, I assume that the cosine instruction can only be issued in the t unit, although this is not stated. Whats more, the IL specification talks about the cosine instruction operating on a vector (xyzw of a register) - which seems to conflict with the microcode operating on a single 32 bit register.
The kind of ducumentation I am looking for would be:
How many and which instructions can be coissued in a VLIW.
Which instructions are only legal in the xyzw units
Which intructions are only legal in the t unit.
Which instructions can be issued to any unit.
In short information needed to get fuller utilization of the stream cores in the ALU clauses. Currently my kernels very often use 4 or less out of 5 units ( <80%) - even when there is no data dependancy, and I am trying to understand which changes I can make to get closer to 100% utilization.
Any pointers will be much appreciated.
Hi, frankas
As far as I know, sine/cosine/exp/log and 32bit float - integer conversion are only legal on t slot. Normal 32bit arithmetic operations such as add and mul are legal on any slot. 64bit operations may take 2 or 4 slots to run together.
As you mentioned, cos_vec needs 4 instructions, all on t unit. However, the shader compiler that compiles IL to ISA does heavy optimization, trying to utilize as much slots as possible. It can automatically detect data dependancy and reorder instruction, or remove useless instructions.
You can inspect the ISA code with SKA (stream kernel analyzer) and see which part of your code waste ALU slots most.
Originally posted by: the729 Hi, frankas
As far as I know, sine/cosine/exp/log and 32bit float - integer conversion are only legal on t slot. Normal 32bit arithmetic operations such as add and mul are legal on any slot. 64bit operations may take 2 or 4 slots to run together.
As you mentioned, cos_vec needs 4 instructions, all on t unit. However, the shader compiler that compiles IL to ISA does heavy optimization, trying to utilize as much slots as possible. It can automatically detect data dependancy and reorder instruction, or remove useless instructions.
You can inspect the ISA code with SKA (stream kernel analyzer) and see which part of your code waste ALU slots most.
Thank you, I knew this much. but my question really is where can I find more information on which intructions will fit in a particular slot....
I had another look in the stream kernelanalyzer, and it seems that what I observed is a weakness in the IL compiler (v1.3?)
For instance:
"mov r324, r271 \n"
"mov r325, r272 \n"
"mov r326, r273 \n"
compiles to:
47 x: MOV R3.x, R80.x
y: MOV R3.y, R80.y
z: MOV R3.z, R80.z
w: MOV R3.w, R80.w
48 x: MOV R4.x, R81.x
y: MOV R4.y, R81.y
z: MOV R4.z, R81.z
w: MOV R4.w, R81.w
49 x: MOV R5.x, R7.x
y: MOV R5.y, R7.y
z: MOV R5.z, R7.z
w: MOV R5.w, R7.w
Where the compiler seems unable to split / merge the move instructions, and leaves the "t" slot unused. But other places where there are similar moves, but with swizzles, the t slot is used for moving. The same goes for xor.
Example of a fully packed move elsewhere:
162 x: MOV R55.x, R55.y
y: MOV R55.y, R55.z
z: MOV R55.z, R55.w
w: MOV R56.w, R55.x
t: MOV R55.w, R54.x
163 x: MOV R54.x, R54.y
y: MOV R54.y, R54.z
z: MOV R54.z, R54.w
w: MOV R54.w, R53.x
t: MOV R53.x, R53.y
164 x: MOV R52.x, R52.y
y: MOV R53.y, R53.z
z: MOV R53.z, R53.w
w: MOV R53.w, R52.x
t: MOV R52.y, R52.z
I will upgrade to 2.0 soon, and hope this has been corrected in the latest version.
Originally posted by: MicahVillmow frankas, I do not believe there is a list anywhere, but you should be able to derive this information from each instruction in the ISA doc.
Actually there is such a list. In the earlier R600 (and maybe also the R7xx) documentation there were such a listing of the instructions separate for xyzw ALUS only, xyzwt, or t only instructions.
In the current ISA reference guide for Evergreen GPUs it is organized a bit differently, but one finds this information in chapter 2.2 starting on page 2-27 for instructions with up to two source operands:
Opcodes 0..95 can be used in either the Vector or Trans unit. Opcodes 128..159 are Trans only. Opcodes 160..255 are vector only.
with an opcode list following that sentence. For instructions with three source operand the list starts on page 2-34.
By the way, some nice improvements for the instruction set in Evergreen, which can reduce the amount of needed instructions or increase the ILP for certain tasks. But I found already a glitch in the IL documentation for the bitalign instruction:
bitalign dst, scr0, src1, src2
Description
Aligns bit data for video. This is a special instruction for multi-media video.
dst = (src0 << src2.x) || (src1 >> (32-src2.x))
src2.x must be 0, 8,16, 24, or 32.
The last sentence is not true. It works with arbitrary values (at least I have successfully tested it with 20). The ISA documentation also does not mention such a limitation. If it were true, it would be a redundant instruction anyway, as bytealign should be enough in that case. And it should be a single "|" for the bitwise or and not "||" for the logical one in the description
Yeah, I'm also curious about bitalign documentation. I've tested it with almost every values from 1 to 31 and it works perfectly. One bitalign replacing 3 other instructions when doing 32-bit cyclic rotations and it's quite welcome in cryptography.
Originally posted by: empty_knapsack Yeah, I'm also curious about bitalign documentation. I've tested it with almost every values from 1 to 31 and it works perfectly. One bitalign replacing 3 other instructions when doing 32-bit cyclic rotations and it's quite welcome in cryptography.
I also noticed that there is an "NSA instruction" (bit population count) which is extremely useful in cryptography.
But am I cursed with a temporary blindness or are the IL instructions for doing 24bit integer multiplies missing? One finds it in the ISA documentation but nowhere in the IL docs. What's up with them?
Originally posted by: Gipsel But am I cursed with a temporary blindness or are the IL instructions for doing 24bit integer multiplies missing? One finds it in the ISA documentation but nowhere in the IL docs. What's up with them?
I was looking for "24" over the whole IL v2 document -- also found nothing about muladd_uint24 or mul_int24 or something similar.