NVidia introduced two instructions that have massive importance for cryptographic and integer compute with their Maxwell architecture:
- LOP3.LUT which lets applications execute FPGA style 32-bit lookup tables. This lets you execute any 3-input logic operation such as SHA-256's Ch (chose/bitselect), SHA-256's Maj (majority), SHA3's Chi, bitslice S-Boxes, etc in a single op. While AMD does have the bitselect operation (which can be used to compose the general LUT operation), this is a vastly inferior option.
- IADD3 which lets applications add 3 32-bit integers. Simple and effective, but widely applicable due to functions like SHA-256 which have long 32-bit adder chains.
Are we getting anything like this with Arctic Islands? We'd like to get a head start on optimizing assembly if possible.