Open-source Quantization with Advanced Data Formats

Lindsey_Brown · ‎01-24-2024

Author: Michael Schulte, Sr. Fellow Engineer for AMD
Recent research has demonstrated that narrow data formats (e.g., formats using 8 bits or less per number) have the potential to greatly improve the performance and energy efficiency of AI training and inference with a negligible impact on accuracy [1][2]. Narrow data formats also help reduce the amount of memory used for AI models, while increasing the effective memory and network bandwidth, since fewer bits of data need to be stored and transmitted.

To increase the effective range of narrow data formats and improve the accuracy of the models that use them, narrow data formats are often scaled at the tensor, channel, and/or block level. Harnessing the full potential of these narrow formats, requires software tools to quantize wider data formats to narrow data formats, determine scales, and emulate the use of new advanced data formats for research and development.

Brevitas is an open-source PyTorch library for neural network quantization and emulation with support for both post-training quantization (PTQ) and quantization-aware training (QAT) [3][4]. With PTQ, the neural network model is quantized after it has been trained, while with QAT, the model is trained or fined tuned with the goal of quantizing it. Brevitas provides composable building blocks at multiple levels of abstractions to model quantized neural networks. It also features first-class support for custom datatypes and operators at the AI framework level, including support for Integer, floating-point, and scaled datatypes, along with the ability to specify user-defined datatypes and operators. Brevitas integrates with multiple inference toolchains through export to a variety of intermediate representations, which enable models quantized with Brevitas to run on CPUs, GPUs, FPGAs, and AI Engines.

Recently, AMD and other companies formed the Microscaling Formats (MX) Alliance to create and standardize advanced data formats for AI training and inferencing [5][6]. Experimental results demonstrate that the MX data formats can be effectively used for inference and training with a variety of deep learning models including generative language, image classification, speech recognition, recommendation, and translation models [1].

Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-5.

References:

Bita Darvish Rouhan, et al. “Microscaling Data Formats for Deep Learning,” Oct. 19, 2023. https://arxiv.org/abs/2310.10537.
Shivam Aggarwal et al., “Post-Training Quantization with Low-precision Minifloats and Integers on FPGAs”, Nov. 21, 2023. https://arxiv.org/abs/2311.12359.
Alessandro Pappalardo et al. “Brevitas: Neural Network Quantization in PyTorch”, Dec. 8, 2023, https://github.com/Xilinx/brevitas.
Alessandro Pappalardo, “Neural Network Quantization with Brevitas”, Tutorial from TVMCon 2021, Dec. 22, 2021. https://www.youtube.com/watch?v=wsXx3Hr5kZs.
“AMD, Arm, Intel, Meta, Microsoft, NVIDIA, and Qualcomm Standardize Next-Generation Narrow Precision Data Formats for AI,” Oct. 17, 2023. https://www.opencompute.org/blog/amd-arm-intel-meta-microsoft-nvidia-and-qualcomm-standardize-next-g....
“OCP Microscaling Formats (MX) Specification, Version 1.0,” Sept. 7, 2023. https://www.opencompute.org/documents/ocp-microscaling-formats-mx-v1-0-spec-final-pdf.