How AMD Drives Higher Hardware Efficiency in AI Algorithm Development?

AMD_AI · ‎06-07-2024

Paper Share: A Unified Progressive Depth Pruner for CNN and Vision Transformer for AAAI 2024

Blog authors: Ji Liu, Dehua Tang

AMD, as one of the world's largest semiconductor suppliers, has been recognized by users around the world for its leadership chip architecture design and development tools for artificial intelligence. With the rapid evolution of AI, designing high-performance algorithms that better suit AMD hardware is becoming one of our missions.

The top conference AAAI 2024 has successfully accepted a recent paper from the AMD algorithm development team entitled, “A Unified Progressive Depth Pruner for CNN and Vision Transformer,” (https://arxiv.org/pdf/2401.06426.pdf). In this blog, we will share about keys of an efficient depth pruning method for both CNN and vision transformer, and how this method helps outstanding compression performance across various AI models.

Motivation

Deep neural networks (DNNs) have made significant strides across various tasks, culminating in remarkable successes within industrial applications. Among these applications, the pursuit of model optimization stands out as a prevalent need, offering the potential to elevate model inference speed while minimizing accuracy trade-offs. This pursuit encompasses a range of techniques, notably model pruning, quantization, and efficient model design. Model pruning has emerged as a prevalent strategy for optimizing models in industrial applications. Serving as a primary acceleration approach, model pruning focuses on the deliberate removal of redundant weights while maintaining accuracy. The conventional channel-wise pruning method faces challenges with depth-wise convolutional layers due to sparse computation and fewer parameters. Moreover, now model platforms favor a higher degree of parallel computing like GPUs, and channel-wise pruning methods would make efficient models thinner and sparser, which leads to low hardware utilization and thus inferior achievable hardware efficiency. To address these issues, DepthShrinker and Layer-Folding are proposed to optimize MobileNetV2 by reducing model depth through reparameterization techniques. However, these methods exhibit certain limitations, including: (1) The mechanism of finetuning subnet with removing activation layers directly could potentially compromise the integrity of baseline model weights, hindering the attainment of high performance; (2) These methods come with usage constraints; they are unable to prune models with some normalization layers like LayerNorm; and (3) These methods cannot be applied to vision transformer models for optimization due to the existence of LayerNorm layer.

To alleviate these problems, we propose a progressive training strategy and novel block pruning method for our depth pruning approach that can prune both CNN and vision transformer models. The progressive training strategy can smoothly transfer the baseline model structure to the subnet structure with high utilization of baseline model weights, which leads to higher accuracy. Our proposed block pruning method can handle the existing normalization layer issue, which can handle all activation and normalization layers in theory. Thus, the AMD method can prune vision transformer models, which is not suitable for existing depth pruning methods.

Key Technologies

The AMD depth pruning approach aims to reduce model depth by proposed novel block pruning strategy with reparameterization technique rather than directly omitting the block. As shown in Figure 2, AMD block pruning strategy converts a complex and slow block into a simple and fast block in block merging. For a block, we replace the activation layer with identity layer and replace the LayerNorm (LN) or Group-Norm (GN) layer with a BatchNorm (BN) layer and insert an activation layer with a BatchNorm layer at the end of block to create conditions for reparameterization. Then, the reparameterization technique can merge the BatchNorm layers, adjacent Convolutional or Full-connection layers and skip connections, as shown in Figure 2.

Figure 2: Framework overview of AMD proposed depth pruner. Each pruned baseline block will gradually evolve into a smaller merged block to speed up and save memory. Four baselines are experimented, including three CNN-based networks (ResNet34, MobileNetV2 and ConvNeXtV1) and one vision transformer network (DeiT-Tiny).

The approach primarily consists of four main steps, which are Supernet training, Subnet searching, Subnet training, and Subnet merging. First, we construct a Supernet, based on the baseline model, where we make block modification as shown in Figure 2. After Supernet training, a search algorithm is used to search an optimal subnet. Then, we adopt a proposed progressive training strategy to optimize the optimal Subnet with less accuracy loss. In the end, the Subnet would be merged into a shallower model with the reparameterization technique.

Benefits

The main contributions can be summarized as follows: (1) We proposed a unified and efficient depth pruning method for optimizing both CNN and vision transformer models; (2) We proposed a progressive training strategy for subnet optimization, coupled with a novel block pruning strategy using reparameterization technique; and (3) Conducting comprehensive experiments on both CNN and vision transformer models to showcase the superb pruning performance of our depth pruning method. As shown in Table 3^[1], where P6 indicates pruning 6 blocks of the model, we obtained three pruned ConvNeXtV1 models with the AMD method applying on ConvNeXtV1, which surpasses popular models with comparable inference performance. Further, as shown in Table 4^[2], our method outperforms other state-of-the-art methods in both accuracy and speedup ratio. Our proposed depth pruner achieves up to 1.26X speedup on AMD Instinct^TM MI100 GPU accelerator, with only a 1.9% top-1 accuracy drops.

Table 3: Performance of ConvNeXtV1 depth pruning results on ImageNet. Speedups are tested on an AMD Instinct MI100 GPU with a batch size of 128. Adopt the slowest network in the table (EfficientFormerV2) as the baseline (1.0 speedup) for comparison.

Table 4: DiT depth pruning results on ImageNet. The results of S²ViTE (Tang et al. 2022) and WD-Pruning (Yu et al. 2022) refer to their paper. SCOP (Tang et al. 2020), HVT (Pan et al. 2021), and XPruner (Yu and Xiang 2023) do not publish their results about the number of parameters and speedup ratio. ”*” denotes that the results come from (Yu et al. 2022).

Conclusion

We present a unified depth pruner for both efficient CNN and vision transformer models to prune models in the depth dimension and have applied this method to several CNN models and transformer models. The SOTA pruning performance demonstrates advantages of this method. In the future, we will explore our method on more transformer models and tasks.

“UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer,” is available to the public, please see https://arxiv.org/pdf/2401.06426.pdf, or reach out to amd_ai_mkt@amd.com with any questions.

Note:

[1][2]: The data source is from the public paper UPDP: A Unified Progressive Depth Pruner for CNN and Vision Transformer https://arxiv.org/pdf/2401.06426.pdf