Greedy Search Algorithm for Mixed Precision in Post-Training Quantization of Convolutional Neural Network Inspired by Submodular Optimization

Satoki Tsuji (Fujitsu Labratories Ltd.)*; Hiroshi Kawaguchi (Kobe University); Atsuki Inoue (Kobe University); Yasufumi Sakai (Fujitsu Laboratories Ltd.); Fuyuka Yamada (Fujitsu Laboratories Ltd.)

PMLR Page

Abstract

For lower bit-widths such as less than 8-bit, many quantization strategies include re-training in order to recover accuracy lost due to significant amount of data reduction. However, the re-training works against rapid deployment for wide distribution of quantized models. Therefore, post-training quantization has been getting more attention in recent years. In one example, partial quantization according to the layer sensitivity based on the accuracy after each quantization has proposed; however, the impact of one layer quantization on the other layers is not taken into account. To further suppress the accuracy loss, we propose a quantization scheme that considers the impact of partial quantization on other layers by continuously updating the accuracy after each layer quantization. Moreover, for more data compression, we extend that scheme to mixed precision, which applies a layer-by-layer fitted bit-width. Since the search space for bit allocation per layer increases exponentially with the number of layers N, conventional methods require computationally intensive approach such as network training by stochastic gradient descent. Here, we derive practical solutions to the bit allocation problem in polynomial time O(N^2) using a deterministic greedy search algorithm inspired by submodular optimization without any training. For example, the proposed algorithm completes a search on ResNet18 for ImageNet in 1 hour with a single GPU. Compared to the case without updating the layer sensitivity, our method improves the accuracy of the quantized model by more than 1% with multiple convolutional neural networks. For examples, 6-bit quantization of MobileNetV2 achieves 80.1% reduction of model size with -1.10% accuracy loss. 4-bit quantization of ResNet50 achieves 82.9% size reduction with -0.194% accuracy loss. Furthermore, results show that the proposed method reduces the accuracy loss by more than about 0.7% compared to various latest post-training quantization strategies.