In some resource-limited, high-performance, and low-latency scenarios, we strive for lower power consumption and higher performance without losing accuracy for AI inference. Low power consumption and high accuracy are especially critical in edge applications and low-latency ADAS. While 8-bit quantization can produce high accuracy, it requires more hardware resources. Extremely low-bit quantization, such as binary or ternary, often has a large accuracy degradation. Therefore, a full-process hardware-friendly quantization solution of 4-bit activations and 4-bit weights (4A4W) is proposed as a better accuracy/resource trade-off. With INT4 optimization, Xilinx can achieve up to a 77% performance boost on real hardware in comparison with INT8 and can achieve comparable accuracy to the full-precision models.
more