Performance Evaluation of Low-Precision Quantized Neural Networks

(1)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/363801475

Performance Evaluation of Low-Precision Quantized LeNet and ConvNet Neural Networks

Conference Paper · August 2022

DOI: 10.1109/INISTA55318.2022.9894261

CITATIONS

2

READS

1,445

3 authors, including:

Guner Tatar

Fatih Sultan Mehmet Vakif Üniversitesi 12PUBLICATIONS 53CITATIONS

SEE PROFILE

Salih Bayar Marmara University 32PUBLICATIONS 215CITATIONS

SEE PROFILE

(2)

Performance Evaluation of Low-Precision Quantized LeNet and ConvNet Neural Networks

Guner TATAR

Dept. of Electrical-Electronics Eng., Fatih Sultan Mehmet Vakıf University

Istanbul, Turkey [email protected]

Salih BAYAR

Dept. of Electrical-Electronics Eng., Marmara University

Istanbul, Turkey [email protected]

Ihsan CICEK Dept. of Electronics Eng., Gebze Technical University

Kocaeli, Turkey [email protected]

Abstract—Low-precision neural network models are crucial for reducing the memory footprint and computational density. How- ever, existing methods must have an average of 32-bit floating- point (FP32) arithmetic to maintain the accuracy. Floating-point numbers need grave memory requirements in convolutional and deep neural network models. Also, large bit-widths cause too much computational density in hardware architectures. More- over, existing models must evolve into deeper network models with millions or billions of parameters to solve today’s problems.

The large number of model parameters increase the computational complexity and cause memory allocation problems, hence existing hardware accelerators become insufficient to address these problems. In applications where accuracy can be traded- off for the sake of hardware complexity, quantization of models enable the use of limited hardware resources to implement neural networks. From hardware design point of view, quantized models are more advantageous in terms of speed, memory and power consumption than using FP32. In this study, we compared the training and testing accuracy of the quantized LeNet and our own ConvNet neural network models at different epochs. We quantized the models using low precision int-4, int-8 and int-16.

As a result of the tests, we observed that the LeNet model could only reach 63.59% test accuracy at 400 epochs with int-16. On the other hand, the ConvNet model achieved a test accuracy of 76.78% at only 40 epochs with low precision int-8 quantization.

Index Terms—Convolutional neural networks, Quantized neural networks, FPGA, Hardware accelerators, Floating point arithmetic, Fixed point arithmetic, LeNet, ConvNet

I. INTRODUCTION

Convolutional neural networks (CNNs) have demonstrated significant improvements in performance in a variety of applications, including object detection, recognition, and classification [1], [2]. CNN models often have an extensive computational density. Hence, software optimization and powerful hardware accelerator architectures are needed to alleviate this computational complexity. Neural network designers can opt for CPU/GPU clusters [3], FPGAs [4], and application-specific integrated circuits (ASICs) [5]–[7] for training and inference of CNN models. Customized accelerators in FPGAs have

demonstrated promising throughput and power efficiency when compared to traditional CPU/GPU architectures, owing to inherent parallel processing capabilities [8].

Fig. 1. The trade-off between computational complexity and memory requirements . [9]

Designers have developed more extensive architectures to increase the performance of CNNs architectures. In an example classification problem, using a wider and deeper CNN architecture in this manner, reduces the classification error rate. In [9], the authors clearly demonstrated the relationship between computational density and memory requirement in the ImageNet classification example using different CNN models [10], [12]. As it can be seen from Fig. 1, the ImageNet classification error rate drops from 17% to 2.9%. Likewise, it is obvious that the computational density increases as the network model grows. In addition, the memory requirement increases noticeably. General-purpose CPUs can’t handle CNNs with such computational complexity and memory requirements. One of the other problems is that CNN models consist of millions of parameters, which brings along the bandwidth problem. This situation also makes memory accesses difficult.

Model pruning or weight and activation function quantization methods are used to reduce computational density and number of parameters of CNN models [11]. While the pruning method prevents the mesh from being overfitted during training, the quantization method causes a decrease in the accuracy of the

(3)

inference, if not used properly. The designer should realize a goal-oriented model by considering these trade-offs.

The quantization of CNN model weights and activation functions has a drawback that may reduce inference accuracy.

Because the model parameters have FP32 (32 bits floating- point arithmetic) in the first case, the model exhibits suc- cessful CNN inference after training. If we set our network to int-8 (8 bits integer) or lower precision instead of FP32, there would be low inference even if the network is well trained. On the other hand, the memory requirement of the quantized network is much lower than that of the floating point counterpart, and the system will consume less energy.

Hence, this makes the quantized model more suitable for the battery-powered embedded devices. In [13], the authors made image classification using the ResNet model. They compared the inference accuracy of the ResNet model implemented using the floating-point arithmetic against the integer quantized network implementation. The authors of [13] consequently showed that the network model with floating-point arithmetic gives better results than the integer quantized network in terms of accuracy. On the other hand, in the study conducted in [14], the authors stated that a quantized neural network reduces computation time and energy consumption.

In this study, we used the LeNet [15] and ConvNet model to classify the CIFAR-10 dataset for the FPGA-based hardware accelerators. We set the weights and activation functions of the models to be int-4, int-8 and int-16 low precision integers and performed the performance comparison of the network.

We used the PyTorch framework to build our model and the Brevitas library to quantize weights and activation functions [16]. Brevitas is a PyTorch library for quantization-aware training (QAT). The idea of QAT is actually to reduce the effect of data loss in neural networks during training. This way, the model will only affect the accuracy of the inference less negatively during inference. The QAT representation scheme is given in Fig. 2. Here, we obtained int-32 by adding int- 8 valued inputs and biases. By using the QuantReLU at the output, quantization can be performed according to the intended int-bit value. We used QuantReLU method from the Brevitas library to quantize the activation functions of the models. To quantize the bias values of the network, we imported the int8Bias quantizer from the Brevitas library and adjust the network appropriately. The quantized activation function QuantReLU is defined by the equations below [17].

Onnxruntime [18], FINN [19], [20], and PyTorch’s quantized inference operators can be used to transfer the quantized model to FPGA hardware platforms.

QuantReLU_(x,z_x_,y_x_,k)=

(zy if x < zx

zy+k(x−zx) if x≥zx

(1) Whenz_x= 0, z_y= 0 andk= 1, the commonly used ReLU in deep learning models is actually a special case of the above

Fig. 2. Representative quantization-aware training scheme.

definition.

ReLU(x,0,0,1)=

(0 if x < z_x 1 if x≥zx

(2) Then we’ll look at how to calculate the QuantReLU mathematically.

y=ReLU(x,0,0,1) (3)

=

(0 if x < zx

1 if x≥z_x

=sy(yq−zy)

=ReLU(sx(xq−zx),0,0,1)

=

(0 if sx(xq−zx)<0 (s_x(x_q−z_x) if s_x(x_q−z_x)≥0

=

(0 if x_q < z_x sx(xq−zx) if xq ≥zx

Accordingly;

yq =

(zy if (xq < zx

z_y+^s_s^x

y(x_q−z_x) if x_q ≥z_x (4)

=ReLU(xq, zx, zy,sx

sy

)

As a result, to perform the QuantReLU corresponding to the floating-point yq =ReLU(x,0,0,1), we simply need to perform;

yq=ReLU(xq, zx, zy,sx

sy

) (5)

Where;x_q is quantized matrices,z_xandz_y are zero points,s is a positive floating-point scale value.

(4)

The remaining scenario of the paper is as follows: In section II, we addressed the most recent advances and methods in quantization. Besides, we have released information about LeNet models with low precision integer quantizations. We discussed the benefits and drawbacks of the floating-point and low-precision quantized network models. In section III, we set up the environment for training and inference of the LeNet and ConvNet models. In addition, we made statements about the challenges we encountered and how we overcame them.

Throughout section IV, We assessed the performance of the LeNet and ConvNet models. We attempted to show the effects of the training steps. Finally, we provided a brief overview of our work, contribution to the literature, and knowledge regarding our future studies in section V.

II. BACKGROUND ANDMOTIVATION

Researchers use CNN models for many purposes, including feature extraction, classification, or object recognition by passing the input data through more than one type of layers.

Neurons are used in each layer to hold the input feature values and transfer them to the next layer. The layers of a CNN architecture can be composed of convolution, pooling, normalization, fully connected, or flattened, depending on the intended use. Convolutional and fully connected layers often require high computational density and large memory size.

Generally, we can classify CNN architectures as thin neural networks, if they have less than 50 layers, as medium, if they have 50 to 100 layers, and finally, as deep for more than 100 layers [9]. For battery-powered portable systems, small memory requirements and computational density are crucial. Because prolonged computational density consumes more power. The quantization of the CNN model extends the life of such devices.

Studies have shown that low-precision fixed-point provides almost as much inference accuracy as floating-point arithmetic [22], [23]. Accurate quantization of the weight and activation functions of the neural network model is advantageous in terms of inference accuracy as well as memory requirements and power consumption. In the study [21], the authors argued that the best scale for the network layers of the model is 16-bit fixed-point quantization. In the study conducted in [22], the authors showed that for the LeNet model, 62.75%

accuracy was achieved with 8-bit quantization, while 68.70%

accuracy was obtained with 16-bit quantization. In addition, they showed that a 68.6% accuracy rate was achieved in the model they used FP32. In the study [23], the authors proposed 16-bit fixed-point arithmetic for the best scale factor at each layer of the model. They showed that the use of fixed-point has a loss of accuracy of about 2% compared to floating- point. Inference accuracy and latency depend on software optimizations and the usage of hardware accelerators. The designer should make his work goal-oriented. A one-to-one comparison of studies in the literature is often not possible.

Moreover, the equipment preferred for any model may not give good results in different models or vice versa. For example, Adam optimization may give the desired results in one model,

while SGD may show better results in another model. We can determine this situation by trial and error method.

III. MODELSETUP, TRAINING ANDINFERENCE

We built our model by considering the LeNet infrastructure given in Fig. 3. In contrast to the conventional LeNet, we have added a dropout layer after the fully-connected (FC) layers in our model. We observed that the network received proper training without memorization using this approach.We realized that, followed by the addition of the dropout layer, the network is trained normally up to a certain number of Epochs (50-60), after which it begins memorizing at high Epochs (more than 60). We determined the model’s cost function in classification with categorical cross-entropy, which is the mean of the cross- entropy for all training data-set. It is used when the label of the categorical cross-entropy data-set is one-hot encoded. This indicates that only one bit of data at a time is correct, such as [0, 0, 1], [0, 1, 0], and [1, 0, 0]. We utilized stochastic gradient descent (SGD) for optimization in our model. We also used the Adam optimization method, but we noticed that the network ran slower than the SGD during training. We also observed that the SGD performed better when the accuracy of the inference was taken into account. We can define categorical cross-entropy and SGD mathematically as in Eq. 6 and Eq. 7, respectively.

Loss=−

N

X

i=1

y_ilog(ˆy_i) (6) Where; y_i is the true label, and yˆ_i is the predicted label, respectively.

θ=θ−η.∇θJ(θ;x⁽ⁱ⁾;y⁽ⁱ⁾) (7)

∇= ∂

∂xˆı+ ∂

∂yȷˆ+ ∂

∂kzˆ (8)

where∇is the gradient operator,ηis the learning rate,x⁽ⁱ⁾is the training example andy⁽ⁱ⁾represents the label, respectively.

Fig. 3. LeNet model classifier architecture. [15]

We trained the models according to the algorithmic flowchart presented in Fig. 4. In the flowchart, the weight and

(5)

activation functions of the model are determined randomly first. Then, we performed feed-forward propagation according to the input data set. Afterwards the error between the predicted result and the actual result is computed. We end the training if the error is the smallest possible number. We continue to update the weights if the error is not within the preset region. Finally, if the epoch value is insufficient, it is increased.

Fig. 4. Algorithmic flow of the model training process.

As it can be observed in Table I, the LeNet model achieved 66% training accuracy with int-16 bit-width weights and activation functions at 400 epochs values, while testing accuracy was 63.59%. As it is commonly known, LeNet is the first and simplest network model [24]. The reason for the low training and test accuracy was that the model used could not learn enough. Additionally, CIFAR-10 is one of the data sets that are very difficult to classify.

We included our own ConvNet model in our study to get better results and demonstrate the performance of the model.

Fig. 5 illustrates the ConvNet model that consists of four convolutions, three fully connected (FC), and two max-pooling layers. The number of parameters increases proportionally to the number of layers in the model. This directly affects the time used for training and inference. In our new model, we performed few pieces of training and inferences in 20, 30 and 40 epochs.

We updated the weight and activation functions for each training to be int-4, int-8 and int-16. Training took approxi- mately 11 minutes for each epoch and about 4.6 hours for total

TABLE I

PERFORMANCE COMPARISON OF LOW-PRECISION QUANTIZATION IN LENET

Activation Func.

& Weights bits Epochs Loss Train acc. Test acc.

4 100 1.0767 62 60.69

4 200 1.0395 63 60.97

4 300 1.0224 63 60.86

4 400 1.0147 64 60.59

8 100 1.0826 62 60

8 200 1.0337 63 61.13

8 300 0.9957 65 62

8 400 0.9906 65 62.477

16 100 1.0696 64 61.13

16 200 1.0353 65 61.94

16 300 0.0997 65 63.294

16 400 0.9876 66 63.59

Fig. 5. Proposed ConvNet model classifier architecture.

40 epochs. As it is obvious from these results, we were limited in only few training and inferences due to the time constraints.

Table II shows the performance results of the ConvNet model.

The ConvNet model clearly outperforms the other models even after only 20 epochs. Furthermore, Table II clearly shows that int-4 and int-8 quantization outperform int-16 quantization, with int-8 quantization outperforming both. It should be noted that high bit-width quantization for weights and activation functions has a negative impact on inference accuracy. We used a computer with the features in Table III for training of both models. From here, it should be noted that high bit-width quantization for weights and activation functions negatively affects the accuracy of the inference.

IV. PERFORMANCEEVALUATION

CNN and Deep Neural Network (DNN) based algorithms are among the most commonly used methods for solving artificial intelligence problems. Depending on the level and type of problem, CNN or DNN models can be simple or complex. Simple CNN models performed exceptionally well in problem-solving when machine learning and deep learning were first introduced. The complexity of today’s problems necessitates more complex and in-depth models for solution.

Deeper networks’ success in solving today’s problems is also

(6)

TABLE II

PERFORMANCE COMPARISON OF LOW-PRECISION QUANTIZATION IN PROPOSEDCONVNET

Activation Func.

& Weights bits Epochs Loss Train acc. Test acc.

4 20 0.6546 77 72.34

4 30 0.1016 96 74.56

4 40 0.0546 98 76.13

8 20 0.6493 77 72.56

8 30 0.1131 96 74.57

8 40 0.0543 98 76.78

16 20 0.6868 75 70.56

16 30 0.1178 95 73.48

16 40 0.0646 96 74.13

TABLE III

SPECIFICATIONS OF THE COMPUTER USED IN CALCULATIONS

Parameters Value Unit

CPU Manufacturer INTEL - CPU Variant i7-7700HQ - CPU Clock Frequency 2.8 to 3.8 GHz

CPU Core Size 4 -

Cache Size 6 MB

RAM Size 16 GB

GPU Manufacturer NVIDIA -

GPU Chipset GTX1050 -

GPU RAM Size 4 GB

dependent on software optimization and utilization of hardware accelerators. Since the model is so complex, it required more memory and causes performance bottleneck issues in hardware. It also causes power consumption issues, partic- ularly in battery-powered portable devices. These scenarios place the user in a bind. As a result, designers have devised techniques for reducing the computational density and power consumption of complex neural networks. Pruning, pooling, and thinning methods are frequently used by researchers to find solutions [25], [26]. In addition to these, by carefully balancing accuracy and complexity, model quantization can result in hardware implementable neural networks. Indeed, our study has shown that quantization can ease memory requirements, computational complexity. Since CIFAR-10 is already a difficult-to-classify data set, the highest test accuracy we obtained using the LeNet was 63.59%. In Table II, We achieved 76.78% test accuracy in just 40 epochs with our more advanced ConvNet model using low precision int-8.

V. CONCLUSION ANDFUTURE WORK

In this paper, we compare the performance of CIFAR-10 dataset classification using different low precision bit widths for use in FPGA-based hardware accelerator architectures.

Here, we first applied the parameter values shown in Table I to the LeNet model. As a result, we discovered that low precision int-16 quantization outperformed others. The accuracy and loss function during training of the LeNet model for 400 epochs are shown in Fig. 6-a and Fig. 6-b, respectively. Second, we applied the parameter values from Table II to our ConvNet

(a)

(b)

Fig. 6. Low precision int-8-LeNet test accuracy and cost function at 400 epochs.

model with lower epochs. Consequently, low precision int- 8 quantization outperformed others. We also presented the accuracy and loss function during training of the ConvNet model for 40 epochs in Fig. 7-a and Fig. 7-b, respectively.

We can conclude that a similar problem can be solved with a smaller bit-width rather than a large bit-width. As a result, we can reduce hardware computational complexity and memory requirements.

Artificial intelligence and its sub-branches ML and DL are becoming increasingly utilized in consumer electronics, industry, automotive, and defense. Our research aims to develop an infrastructure for the implementation of digital design-based advanced driver assistance systems including FPGAs for vehicles in use today. The automotive industry has already started the new era of self-driving vehicles, in which the driver is replaced by computers running ML and DL algorithms along with the operating system. All these enhancements increase the power consumption footprint of the autonomous car’s electrical infrastructure. As a consequence, the future AI problems requires solutions with not only enough accuracy for proper decision making, but also consume lower power which mandates hardware solutions. The balance set by quantization of models between the required accuracy and

(7)

(a)

(b)

Fig. 7. Low precision int-8-ConvNet test accuracy and cost function at 40 epochs.

hardware complexity will set the performance boundaries for what these algorithms can achieve in the future.

REFERENCES

[1] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”

in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

[2] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You only look once: Unified, real-time object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 779–

788.

[3] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, A.

Senior, P. Tucker, K. Yang, Q. V. Le et al., “Large scale distributed deep networks,” in Advances in neural information processing systems, 2012, pp. 1223–1231.

[4] K. Ovtcharov, O. Ruwase, J.-Y. Kim, J. Fowers, K. Strauss, and E.

S. Chung, “Accelerating deep convolutional neural networks using specialized hardware,” Microsoft Research Whitepaper, vol. 2, no. 11, pp. 1–4, 2015.

[5] T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam,

“Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning,” in ACM Sigplan Notices, vol. 49, no. 4, 2014, pp.

269–284.

[6] G. Tatar , S. Bayar and I. Cicek , ”FPGA Design of a High- Resolution FIR Band-Pass Filter by Using LabVIEW Environment”, Avrupa Bilim ve Teknoloji Dergisi, no. 29, pp. 273-277, Dec. 2021, doi:10.31590/ejosat.1016363

[7] G. Tatar , I. Cicek and S. Bayar , ”FPGA Design of a Fourth Order Ellip- tic IIR Band-Pass Filter Using LabVIEW”, Avrupa Bilim ve Teknoloji Dergisi, no. 26, pp. 122-127, Jul. 2021, doi:10.31590/ejosat.951601 [8] X. Wei, C. H. Yu, P. Zhang, Y. Chen, Y. Wang, H. Hu, Y. Liang,

and J. Cong, “Automated systolic array architecture synthesis for high throughput cnn inference on fpgas,” in Proceedings of the 54th Annual Design Automation Conference 2017, 2017, p. 29.

[9] Wu, Chen, et al. ”Low-precision Floating-point Arithmetic for High- performance FPGA-based CNN Acceleration.” ACM Transactions on Reconfigurable Technology and Systems (TRETS) 15.1 (2021): 1-21.

[10] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei- Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255.

Ieee, 2009.

[11] V´estias, M´ario P., et al. ”A fast and scalable architecture to run convolutional neural networks in low density FPGAs.” Microprocessors and Microsystems 77 (2020): 103136.

[12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z.

Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei,

“Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, 2015.

[13] Jacob, Benoit, et al. ”Quantization and training of neural networks for efficient integer-arithmetic-only inference.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.

[14] Nagel, Markus, et al. ”A white paper on neural network quantization.”

arXiv preprint arXiv:2106.08295 (2021).

[15] LeCun, Yann, et al. ”Gradient-based learning applied to document recognition.” Proceedings of the IEEE 86.11 (1998): 2278-2324.

[16] Alessandro Pappalardo. Xilinx/brevitas, 2021. URL https://doi.

org/10.5281/zenodo.3333552.

[17] Lei Mao, ”Quantization for Neural Networks,”

https://leimao.github.io/article/Neural-Networks-Quantization/

(accessed: May 10, 2022).

[18] ONNX Runtime developers. Onnx runtime. https://onnxruntime. ai/, 2021. Version: x.y.z

[19] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A framework for fast, scalable binarized neural network inference. In Pro- ceedings of the 2017 ACM/SIGDA International Symposium on Field- Programmable Gate Arrays, FPGA ’17, pages 65–74. ACM, 2017.

[20] Michaela Blott, Thomas B Preußer, Nicholas J Fraser, Giulio Gam- bardella, Kenneth O’brien, Yaman Umuroglu, Miriam Leeser, and Kees Vissers. Finn-r: An end-to-end deep-learning framework for fast explo- ration of quantized neural networks. ACM Transactions on Reconfig- urable Technology and Systems (TRETS), 11(3):1–23, 2018.

[21] Q. Xiao, Y. Liang, L. Lu, S. Yan, and Y.-W. Tai, “Exploring heteroge- neous algorithms for accelerating deep convolutional neural networks on fpgas,” in 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC). IEEE, 2017, pp. 1–6.

[22] K. Guo, L. Sui, J. Qiu, J. Yu, J. Wang, S. Yao, S. Han, Y. Wang, and H. Yang, “Angel-eye: A complete design flow for mapping cnn onto embedded fpga,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 37, no. 1, pp. 35–47, 2017.

[23] Y. Ma, Y. Cao, S. Vrudhula, and J.-s. Seo, “Optimizing the convolution operation to accelerate deep neural networks on fpga,” IEEE Transac- tions on Very Large Scale Integration (VLSI) Systems, vol. 26, no. 7, pp. 1354–1367, 2018.

[24] Bouti, Amal, et al. ”A robust system for road sign detection and classification using LeNet architecture based on convolutional neural network.” Soft Computing 24.9 (2020): 6721-6733.

[25] Liang, Tailin, et al. ”Pruning and quantization for deep neural network acceleration: A survey.” Neurocomputing 461 (2021): 370-403.

[26] Luo, Jian-Hao, and Jianxin Wu. ”Autopruner: An end-to-end trainable filter pruning method for efficient deep model inference.” Pattern Recog- nition 107 (2020): 107461.