NVidia introduced two instructions that have massive importance for cryptographic and integer compute with their Maxwell architecture:
- LOP3.LUT which lets applications execute FPGA style 32-bit lookup tables. This lets you execute any 3-input logic operation such as SHA-256's Ch (chose/bitselect), SHA-256's Maj (majority), SHA3's Chi, bitslice S-Boxes, etc in a single op. While AMD does have the bitselect operation (which can be used to compose the general LUT operation), this is a vastly inferior option.
- IADD3 which lets applications add 3 32-bit integers. Simple and effective, but widely applicable due to functions like SHA-256 which have long 32-bit adder chains.
Are we getting anything like this with Arctic Islands? We'd like to get a head start on optimizing assembly if possible.
I second OP. LOP3.LUT would be a great addition for tripcode generators, or any applications that make extensive use of bitwise logical operations.
It would also be great to have to have LOP2.LUT - the ISA could be greatly simplified since all 2 or 3 input logic operations could be performed in terms of LOP2.LUT and LOP3.LUT.
Additionally, developers could do single-cycle operations like XNOR ~(A ^ B), AND NOT (A & ~B), NAND ~(A & B), NOR ~(A | B), etc.
Does anybody know whether the upcoming GCN 4.0 has an equivalent of LOP3.LUT?
I still couldn't find anything about GCN 4.0 ISA at AMD's website:
AMD Team, Would there be an ETA by chance for when the GCN4 ISA (Instruction Set Architecture) manual might be published? It would would help out many of us waiting to get a head start on programming/planning for the updated architecture.
It's about 3 operand and all the 16*16=256 possible bitwise operations between them, so it would be super usefull for crypto stuff. And the other cool thing in NV is the 3 operand integer sum.
But it's all about sacrificing die space...
Oh, nevermind! Later I realized you mentioned 2op LUT. Well, that can be done with 3 op LUT too by ignoring one of the operands.
Yes! Also the basic circuit for LOP3.LUT is relatively cheap (from a die-space standpoint) as it's one 8-input MUX * N bits. This design is neat since you can implement backwards compatibility by hard-coding input truth-tables for legacy logic opcodes. Most standard cell libraries include MUXes but I'd guess AMD uses more optimized custom cells.
IADD3 should also be relatively inexpensive since they already have the logic for MADD (multiply + add) - MADD is a 3-input opcode that usually is minimally implemented as a series of carry-save adders.
Of course these are all assumptions as I'm not privy to AMD's internal design trade-offs.