A Novel Approximate Synthesis Flow for Convolutional Neural Networks

We propose an approximate synthesis technique for an energy-efficient FIR filter with an acceptable level of accuracy. In addition, we propose an approximate synthesis flow for convolutional neural networks (CNNs) to take advantage of the fault tolerance of neural networks. The proposed approximate synthesis flow is applied to CNN Multiply and Accumulate (MAC) operations to improve energy efficiency.

The power consumption of the MAC modules for convolution of 3 × 3 and 5 × 5 matrices improved by 46.4% and 43.4%, respectively, while the degradation of output quality was negligible for handwritten digit recognition. He is one of the nicest people in person I have ever met and is a knowledgeable and insightful researcher. Yesung Kang, Jaewoo Kim, and Seokhyeong Kang, “A New Approximate Synthetic Flow for Energy Efficient FIR Filter,” Proc.

Yesung Kang, Jaewoo Kim, and Seokhyeong Kang, "A New Synthesis Approximate Flow for Energy Efficient FIR Filter", Proc. Energy field of the proposed synthesis flow and exhaustive research of MAC modules for (a) 3×3 filter and (b) 5×5 filter.

List of Tables

List of Algorithms

Sensitivity-based approximate synthesis flow

Nomenclature

Chapter Ι Introduction

A popular idea for reducing complexity here is the multiplierless FIR filter [1], where the multiplication is done with shifters and adders instead of multipliers. In conventional FIR filters, all coefficients are expressed in signed power-of-two (SPT) space rather than binary sign, as SPT can reduce the number of non-zero digits. In SPT codes, it is well known that the canonical sign code (CSD) code effectively reduces the complexity of FIR filters.

They replaced certain modules with approximate modules based on lookup tables to reduce energy consumption with only a small degradation in output quality. 3] considered voltage scaling to save energy, but it was observed that the errors occurring along the critical path were generally more critical than those due to approximations. 5] applied approximate calculations to an FIR filter, but did not provide an automated synthesis flow for the approach.

If the size of the design of the FIR filter becomes larger, it becomes difficult to find optimal configurations for the approximate adders. In this paper, we propose a new approximate synthesis technique that reduces energy consumption by replacing conventional adders/subtractors in the FIR filter with approximate adders/subtractors with automated synthesis flow, as shown in Figure 1.1. An accuracy configurable adder/subtractor is proposed which is energy efficient and has relatively high accuracy.

The maximum error due to the proposed adder/subtractor configuration is analyzed to evaluate the quality of the output. By using the proposed approximate synthesis flow, we can save energy/power consumption and improve performance to ensure a reasonable level of accuracy. In Section III, our proposed approximate adder/subtractor is presented and its accuracy is analyzed.

In Section V, the approximate synthesis flow for the convolutional neural networks is described, and in Section VI we provide a summary of our work.

Chapter ΙΙ Related Work

7] introduced a fast adder with shorter carry chains that only considers the previous k bits of the input when calculating a carry bit. 8] proposed a variable latency speculative adder (VLSA), which is a reliable version of the Lu adder [7] with error detection and correction. 9] also proposed a data path redesign technique for different adders that reduces the length of critical paths in the carrier chain.

ETAI is divided into a correct part and an incorrect part to achieve approximate Figure 2.1: Schematic of FIR filter. ETAII reduces the carry spread to speed up the adder, and ETAIIM modifies ETAII by connecting the carry chains to correct MSB bits. 5] performed approximations at the transistor level and proposed approximate full stack cells to design multibit adders for video applications to save power and area.

The ACA adder can save power consumption in the approximate mode and provide precise results in the accurate mode. 11] proposed a systemic design methodology for approximation computation that eliminates certain nodes from the original set of nodes, and analyzed how the eliminated nodes affect accuracy and power consumption by approximation. 16] proposed a broken-state multiplier, but it has a low probability of producing the correct result rate.

The simplified 2×2 approximate multiplier has only five unit cells, while the precise multiplier has eight unit cells. The simplification not only reduces the length of the critical paths of approximate multipliers, but also consumes less power and performs better than precise multipliers.

Chapter ΙIΙ

APPROXIMATE SYNTHESIS FLOW FOR FIR FILTER

The accuracy of the adder/subtractor can be configured by changing parameter AP, the bit width of the estimated part. If AP is 0, the result of the proposed adder/subtractor is identical to that of the conventional adder/subtracter. As AP increases, the output accuracy decreases while reducing power consumption or improving performance.

However, if AP is greater than a certain value, the propagation delay of the approximate part becomes the delay of the precise part, and the benefits of further approximation are reduced. The largest approximation error occurs when all input bits are in the approximating part. In this case, the two input operands are 2AP −1. In the results, the maximum error that can occur in the approximate adder is 2AP − 1.

The goal of the synthesis flow is to find the optimal AP configurations of the approximate collectors. However, finding optimally configured aggregator APs is difficult because the number of possible configuration combinations is proportional to NMadder, where Madder is the number of aggregators and N is the bit width of the aggregators. The bit widths of the input, coefficients, and output in the example are 15, 12, and 28 bits, respectively.

Since the design size of the example is small and the number of adders is conventionally greater than six, there are too many possible combinations of AP configurations in conventional FIR filters to analyze in this example. Considering that the FAS of the FIR filter is much smaller than that of Madder, we can significantly reduce design space. In the first step, all adders in the baseline design are classified according to their AS (Rule 2).

The perturbed Verilog design is synthesized and the delay in the design is calculated (Lines 6-7).

Figure 3.4 shows a schematic diagram of the carry generator and the sum generator. If the carry is generated from previous carry generators, it passes to the next one

Chapter ΙV

Energy domain of the proposed synthesis current (red) and comprehensive investigation of the FIR filter (black). As shown in Figure 4.1, the proposed synthesis flow can successfully follow the minimum delay design. Furthermore, it can be demonstrated that the proposed synthesis flow can effectively reduce power and energy consumption.

Since the main concern of our work is to obtain high energy efficiency, we resynthesize the design obtained from the synthesis flow and implement it using different time constraints. We then select the result with the lowest energy consumption with a delay that does not exceed that of the base model. However, after resynthesis, the energy consumption of the droplet is greater than that of the final solution.

The running time of the proposed synthesis current is 84 minutes for a four-tap FIR filter. To verify our methodology, we apply the proposed synthesis flow to five different FIR filters [21–. Delay, power, and energy data for basic FIR filter designs are also summarized in Table 4.3.

The FIR filters are synthesized using the proposed synthesis flow, while the input bit width, coefficients, and output width are set to eight, 16, and 24 bits, respectively. The energy consumption of the FIR filters is reduced on average by up to 38.9% and 31.2%. To verify the output quality of the processed image, the peak signal-to-noise ratio (PSNR) is used.

This is because the proposed adder approximates the previous carry and the approximation error makes the result lower in value than the exact result.

Figure 4.1: Accuracy vs. delay domain of the proposed synthesis flow (red) and the exhaustive research of FIR Filter (black)

APPROXIMATIONS SYNTHESIS FLOW FOR CONVOLUTIONAL NEURAL NETWORKS

A kernel of C channels is convolved with an input feature map of C channels to produce a single channel of the output feature map. After additional processes, such as subsampling, calculated output feature maps are fed to the next convolution layer as input feature maps. Despite the excellent image recognition capabilities of CNN, the high computational intensity of CNN is one of the major obstacles to the hardware implementation of CNN.

One of the most important computational efforts is multiplication and accumulation (MAC) in the convolution layer. A similar scheme of the estimated synthesis flow for the FIR filter can also be applied to the artificial neuron without a multiplier. As in the case of the FIR filter, the adders belonging to the same addition step have the same AP bit of the estimated adder/subtractor to reduce the number of combinations to a computable level.

Our proposed approximate synthesis flow described in Section 3.2 finds the optimal approximate MAC module design with minimum delay and required accuracy. Also, the structure of the 5 × 5 core MAC module can be designed similarly to the 3 × 3 core MAC module. The 3 × 3 core and 5 × 5 core MAC module are designed using Verilog to verify the power consumption and energy of the approximate MAC modules obtained with the proposed synthesis flow.

To analyze the output quality of the proposed estimated CNN, we implemented the CNN model using C++ to evaluate the classification accuracy of estimated CNN. MAC modules for convolution with 3 × 3 kernel and 5 × 5 kernel are implemented using Verilog and accessed via the proposed synthesis flow to calculate the energy consumption. The black dots represent the results obtained by randomly configuring the APs and red dots represent the results of the proposed synthesis flow.

To account for the worst case error, a minimum precision of the MAC operations in the convolutional layer is assumed.

Figure 5.2: Computation of convolution layer [26]

Chapter VΙ CONCLUSION