cancel
Showing results for 
Search instead for 
Did you mean: 

AI Discussions

LevSvalov
Journeyman III

Ryzen-AI: FPS performance issue after model quantization

Hello! 

I have faced an unexpected behavior in the inference performance while testing classification models targeting the IPU device.

Problem: FPS performance for the INT8-quantized model by vai_q_onnx (with IPU as target) is worse than for the original FP32 model without quantization.
Ryzen series: 7940HS
RyzenAI version: 0.8
Description: 
There are 2 models: Resnet50 and Mobilenet_V2 from the Torchvision have been tested in the Ryzen-AI project. 
Firstly, pre-trained models were converted to the ONNX format, and then the main development flow from the docs was applied.
The VitisAI-ONNX quantizer was used for INT8 quantization. The inference tests were as with IPU targeting, as with CPU targeting. 
To reproduce the issue,
The quantization parameters were as follows:
         - For IPU (the recommended config from the docs): 

vai_q_onnx.quantize_static(
   model_input,
   model_output,
   calibration_data_reader,
   quant_format=QuantFormat.QDQ,
   calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
   activation_type=QuantType.QInt8,
   weight_type=QuantType.QInt8,
   enable_dpu=True,
   extra_options={'ActivationSymmetric': True}
)

         - For CPU:

vai_q_onnx.quantize_static(
input_model_path,
output_model_path,
data_reader,
calibrate_method=vai_q_onnx.PowerOfTwoMethod.NonOverflow,
quant_format=QuantFormat.QDQ,
activation_type=QuantType.QInt8,
weight_type=QuantType.QInt8,
)

 The inference was evaluated on the stratified ImageNet data (1000 images, 1 image per class). 
 The results are as follows: 

LevSvalov_0-1697622966611.png

LevSvalov_1-1697623054376.png 
The original model in ONNX format performs faster than the quantized versions which should not be the case.

Do you have any suggestions on how to adjust the flow to resolve the issue?
Looking forward to the collaboration. Thank you!

 

0 Likes
3 Replies
Headbandwig
Journeyman III

  1. Optimize the model architecture for efficiency by simplifying layers and reducing size.

  2. Utilize hardware acceleration, such as GPUs or TPUs, for faster inference.

  3. Adjust quantization parameters, experimenting with different settings to balance accuracy and speed.

  4. Profile and benchmark the model to identify bottlenecks and make data-driven optimizations.

  5. Consider model pruning, caching, framework-specific tools, and parallelization to further enhance performance while evaluating trade-offs between accuracy and speed.

0 Likes
Uday_Das
Staff

Hello LevSvalov

We can check on this and let you know. 

But I have a few questions: 

1. Why you are quantizing with CPU target and using runtime ONNX IPU (which I think Vitis AI EP)? Or why you are quantizing with IPU target and using runtime ONNX CPU (CPU EP). In my thinking if it is quantized for the target CPU we should use CPU EP, and when the target IPU we should use Vitis AI EP only. So I cannot justify the reason for those rows. 

2. What is the difference between "run1-no quantization" and "run 2-no quantization" .. A no-quantized model should be the same in your run1 and run2, so this is just exact same run two times? 

3. What is %IPU column? How do you measure it? 

4. We have 0.9 release now.. Can you try with that release? ONNX Quantization and EP are changed from 0.8

5. When Quantizing target CPU, can you use the following setting: 

vai_q_onnx.quantize_static(
    model_input,
    model_output,
    calibration_data_reader,
    quant_format=QuantFormat.QDQ,
    calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
    activation_type=QuantType.QUInt8,
    weight_type=QuantType.QInt8
)

 

0 Likes
hanerry
Journeyman III

thanks you much for provide this information

0 Likes