Implementing AI in an Embedded Project

amd_adaptivecomputing · ‎11-16-2022

This article was originally published on May 1, 2020.

Motivation

Regardless of the final target technology (e.g., FPGA or CPU), available resources are typically restricted. Thus, an optimized architecture and network design is highly necessary when integrating neural network-based approaches into embedded projects.

This article covers the Solectrix AI workflow, including both the proper network handling and the transfer to the chosen target technology. An example is given for an object detection task running on XILINX MPSoC technology using Vitis AI.

AI Ecosystem

The Solectrix AI Ecosystem, as shown in Figure 1, provides a complete framework covering all machine learning-related steps, ranging from model generation, training, network pruning, to model deployment.

The following section covers an example workflow for transferring a deployed network to a Xilinx MPSoC technology.

Workflow for Xilinx ZCU102 Evaluation Kit based on Vitis AI

As an example target hardware, the Xilinx ZCU102 evaluation kit was chosen. The corresponding workflow is illustrated in Figure 2.

Using the above-mentioned AI Ecosystem, a network is selected, trained, pruned, and finally deployed. The deployment results in a Tensorflow frozen graph model. Next, the frozen graph is fed into the Vitis AI Tensorflow Translation Tools.

Further, the Deep Learning Processor Unit (DPU) IP core is configured and implemented into the FPGA image.

Then, after quantizing the deployed network structure, the Vitis AI compiler is used in order to translate the Tensorflow network format for the DPU core.

After compiling the network structure, the desired application can be implemented within Vitis and the compiled network is integrated into the application.

Using a suitable petalinux operation system, the application is finally cross-compiled for the target hardware.

Experimental Results

In the following, an object detection application is briefly described and evaluated. For that, a MobileNetV2 network was used together with an SSD detector for the final localization.

Initially, the designed model consists of 880124 parameters.

By network pruning, the number of parameters was reduced to 773692, meaning a decrease of about 12 %.

For inference time evaluation, a DPU core configuration with a maximum of 4096 operations per clock was chosen.

Figure 3 compares the performance between the trained and the pruned network. While the trained model is shown in blue, the pruned model is shown in orange.
As can be seen, the inference time of the pruned network is about 7 % smaller compared to the purely trained network, which clearly differs from the 12 % reduction in parameters. Thus, a reduction in the number of parameters does not automatically lead to the same reduction in inference time on the target hardware. Especially for early network layers, the pruning leads to a clear reduction in parameters but no reduction in inference time.

As a consequence, it might be beneficial to profile the inference time of differently designed networks prior to network training. By doing so, a model can be found that performs more or less optimally on the configured DPU core structure. Further, the pruning might be also controlled by the profiling results.

Summary & conclusion

The Solectrix AI Ecosystem covers all relevant aspects related to neural networks. The proposed workflow gives us the full control and optimization flexibility for different target technologies without the need of external black box solutions. As an example, the workflow for network inference on a ZCU102 evaluation board based on Vitis AI was described. For that, experimental results were briefly discussed for an object detection task. Since network selection, training, and pruning are fully controllable, an optimal solution can be found that takes into account the trade-off between resulting inference time and achievable accuracy.

Learn more about Solectrix https://www.fpga.sx/ai.