A New Parallel Prefix Adder Structure With Efficient Critical Delay Path And Gradded Bits
Efficiency In CMOS 90nm Technology
H. Moqadasi Dept. Elect. Engineering
Shahed university Tehran- IRAN h.moqadasi AT shahed.ac.ir
M. B. Ghaznavi-Ghoushchi Dept. Elect. Engineering
Shahed university Tehran-IRAN ghaznavi AT shahed.ac.ir
Abstract— In this work we have proposed an efficient parallel prefix adder (PPA) that is a variation of the popular Brent-Kung PPA. In this proposed adder, with a glance on the sklansky adder and by varying the graph topology, we have reduced the number of stages in the critical delay path in comparison with Brent-Kung PPA and so lowered the delay. This advantage is along with a little increase in power consumption and area due to increasing the cells number so that the consequent Power Delay Product (PDP) is improved. Also the performance of our proposed PPA increases by increasing the input bits number from 32 bits up to higher orders. The experimental results indicate that our proposed adder is faster than the Brent–Kung adder by 10.1% in 32 bits, 17.2% in 64 bits and 21.3% in 128 bits and the consequent PDP is reduced by 8.8% in 32 bits, 15.6% in 64 bits and 19.5% in 128 bits. Circuit level simulations were performed with SPICE and CMOS 90nm technology.
Keywords: Parallel Prefix Adder; Brent-Kung Adder; Sklansky Adder; Power-Delay Product; CMOS Technology;
I. INTRODUCTION
VLSI adders have a lot of applications in many fields such as arithmetic and logic units (ALU’s), microprocessors, floating point arithmetic units, memory addressing and program counting units and etc [1], [2], [3], [4], [5] so, much attention has been paid to the increasing of their performances.
The prefix operation is an essential operation which has applications in the design of fast adders. Many fast adders were proposed based on the prefix computation [6]. Among all adders the parallel prefix adders (PPA’s) are in the spotlight recently [7]. That is because they use a tree network for reducing latency to log2(n) through the carry path compared to O(n) in the RCA where ‘n’ is number of input bits[8], [9], [10]
and also they have regular structure and simple cells [4].
Besides, they are appropriate for implementation of adders with wider word lengths [10]. Based on the prefix structure, various PPAs are presented up to now. Sklansky [11]. Kogge- Stone [12] and Brent-Kung [13] adders are the classic and representative PPAs. The Sklansky adder has the minimum logic depth which is log2 (n) (‘n’ is number of input bits) and also has the minimum wiring tracks. These performances are
combined with a drawback that is the large fan-out that increases linearly with logic depth. Therefore causes a big layout area and lowers the circuit speed [2]. Ladner-Fischer prefix network has higher logic depth in comparison with the Sklansky adder with a lower fan-out in the critical delay path nodes [8]. The Kogge-Stone adder is the fastest PPA and has the optimal logic depth as log2(n)and low fan-out in its prefix network, instead the wire connections between its stages are complex, this results layout implementation problems. Also their computational cells are numerous which cause to more chip area and more power consumption [2], [4], [14]. The Brent-Kung adder is proposed to reduce the drawbacks of Kogge-Stone adder with less carry merge cells and less complexity. But in turn it has some disadvantages. For example it has higher logic depth and hence more delay [15]. Han- Carlson adder is the combination of the Brent Kung and Kogge Stone adders, which trades between the wiring tracks (similarly the prefix cells) and the logic depth [14]. Hybrid Han-Carlson adder is a variation of the Han-Carlson adder which is suited for large word sizes. In this adder delay increases slightly but the complexity, silicon area and power consumption are reduced considerably [16]. Knowles has presented a family of adders that trades between wiring tracks and the fan-out of intermediate nodes [17]. Harris has presented a three- dimensional taxonomy that describes the trade-offs among existing parallel prefix adders [18].
In this paper we have presented a parallel prefix adder based on the Brent-Kung adder so that by taking idea form the Sklansky adder the lateral fan-out of some computational nodes in the prefix tree is increased and the number of stages in its critical delay path is reduced. Consequently the speed and also power delay product is improved. Our simulations show this PPA lets us have an improved efficiency by increasing the inputs word sizes from 32 bits up to higher bit numbers.
This paper is organized as follows: Section II describes the theme and basics of parallel prefix addition. Section III reviews the Brent-Kung adder, section IV presents our proposed PPA and section V is dedicated to the simulations. Finally section VI is conclusions.
Fig. 1: Three steps of addition operation in CLA and PPA adders [20].
II. PARALLEL PREFIX ADDITION
Parallel prefix adders also known as carry tree adders [19]
are based on the Carry Look Ahead (CLA) adder and similar to it perform addition in three steps shown in Fig .1 [20]. In the first step the carry generate and carry propagate signals which are intermediate variables [18]. Are pre-computed from input bits (pre-computation). Second step calculates all the carry signals in parallel (prefix-network), finally in the third step, sum signals are computed from the propagate and carry signals generated in the previous stages (post-computation) [21].
The addition logic of three stages in a PPA adder follows (1) and (2) and (3) where Ai and Bi and Cin are inputs, Si is the output sum and G and P are the generate and propagate prefix signals (i and k are indexes representing bit positions) [20].
Pre-Computation : Gi:i =Ai•Bi G0:0 =Cin
Pi:i = Ai⊕Bi P0:0 =0 (1) Prefix-Network : Gi:j =Gi:k +Pi::k •Gk−1:j
Pi::j =Pi::k •Pk−1:j (2) Post-Computation :
S
i= P
i⊕ G
i:j (3)In order to configure the prefix network, the dot operator
" " defined as (4) and the semi-dot operator " " defined as (5) are used based on [22] with a little modification for the semi-dot operation.
) ,
( g
i:kp
i::k (gm:j,pm:j)=) , ( ) ,
(gi:k + pi:k •gm:j pi:k•pm:j = Gi:j Pi:j (4)
) ,
( G
i:jP
i:j Ck =(gi:j + pi:j •Gi:j)=Ci (5) In (4), the dot operator works with two pairs of bits (gi:k, pi:k) and (gm:j, pm:j) where j< k ≤ m < i [22]. A new pair of bits results from this operator named group-generate and group- propagate signals and can be again combined with another pairs of bits by another dot or semi-dot operator. Semi-dotoperator works with the pair (Gi:j, Pi:j) and Ck. This procedural use of dot and semi-dot operators configures a prefix network which finally generates the carry signals [14].
The prefix operator has two main properties included idempotency and associativity that lets evaluate the prefix operations according to a binary tree and have various prefix networks. But it does not have the commutativity property which means the order of two neighboring input operands must not be altered [23], [24]. Fig.2 (a), (b), (c), (d) shows the gate- level structure of dot cell, semi-dot cell, pre-processor cell and post-processor (sum) cell respectively. These cells are used in the circuit level structure of PPAs.
III. BRENT-KUNG ADDER
According to section II descriptions the associativity property of prefix operator lets us have various prefix tree networks for carry generation. Brent-Kung adder is one of the main prefix structures was proposed by Brent and Kung with a regular layout [13]. This adder has been proposed to solve the disadvantages of Kogge-Stone adder. Fig. 3 shows the Graph topology of a 32-bit Brent-Kung adder.
This adder has the least computational nodes among all PPAs [8] for a certain number of input bits which results the reduced area. Instead this structure has maximum logic depth [25] which yields the increase in the adder’s latency in comparison with other structures [8] so lowers the speed [26].
In fact this adder trades area for logic depth and is an excellent instance for the PPA with maximum logic depth and minimum area [26]. Furthermore Brent-Kung adder is an efficient choice for nowadays synthesis tools and is the state of the art [27].
IV. PROPOSED ADDER
Fig. 4 shows the graph structure of our proposed PPA for 32 bits which is a variation of Brent-Kung adder, the critical delay path is shown by a solid red line. Our proposed adder is the same as Brent-Kung adder in most stages except in the stage 2 and hence the stage9, in fact the modification of stage 2 results in the modification of stage9. The objective of our modification was to reduce the PDP, based on variation in PPA graph topology and for this, with ideas from the sklansky adder and specially its second stage we have increased the lateral fan- out of dot cells in the first stage up to two, these cells are in bit positions (N) follow (6). That is like the manner of stage2 in the sklansky adder.
5 8 +
= k
N ; k =1,2,3,…,
2
L−3− 1
;L = log
2n (6)This modification reduces the number of stages in the critical delay path from 8 in the 32-bit Brent-Kung adder to 7 in our proposed adder, by the way decreases the delay of critical path in comparison with the Brent-Kung adder. Of- course a little increase in power consumption is another outcome of our modification which is due to an increase in the number of dot cells so that in overall, the product of power and delay improves as acceptable. In this paper we have presented our design for 32-bit inputs but this idea can be extended for
designing larger adders with larger word sizes and brings more efficiency. Section V describes the results and outcome performances with more details.
V. SIMULATION
In this section we have simulated the Brent-Kung adder and also our proposed design from 4 to 128 bits with HSPICE and in CMOS 90nm technology. Basic cells are implemented according to the logic structure shown in Fig. 2 so that the l of transistors are chosen as the lmin of used technology equal to 90nm, and PMOS transistors widths as two times of NMOS types. In all simulations supply voltage was considered as 1.2 volt. In order to show the preferences of our proposed adder we have examined the specifications of both adders and compared them in each case. As we mentioned in section IV the Brent- Kung adder and our proposed one are the same in 4 bits and also 8bits so the results become equal and we have no performance for these bit widths, but by increasing the word size from 16 bits, differences are appeared which result performances. Following, we examine and compare both
adders in various aspects.
Table I shows the delay of both Brent-Kung and proposed PPAs by varying the word sizes of inputs from 4 to 128 Bits and represents that in the word size of 4 and 8 delay values are the same because of the equality of both adders, but in larger bit lengths, our proposed design exhibits better and has lower delay. Also, It seems in the word sizes larger than16, the higher the word size the better the delay performance which means by increasing the inputs word sizes the percentage of delay lowering, increases too. This is shown in Fig. 5 (a);
furthermore this figure shows that the slope of delay increase is smoother in our work in comparison with the Brent-Kung PPA which means our proposed PPA performs better by grading the input word size from the aspect of speed.
Also our work has some drawbacks in the cost of delay lowering. Table II lists the number of total computation nodes for each of the Brent-Kung and proposed PPAs. In any PPA adder the number of semi-dot cells is equal to the length of inputs so the number of total cells is the number of dot cells plus the number of semi-dot cells (the word size length). Our
(a) (b) (c) (d)
Fig. 2: Gate level structure of (a) Dot cell, (b) Semi-Dot cell, (c) Pre-computation cell, (d) Post-computation cell.
Fig. 3: 32-Bit Brent-Kung adder.
Table II shows, in word sizes larger than 16 bits, our proposed
PPA caused the increase in the number of dot cells so increases the number of total cells, hence results a grown area in comparison with the Brent-Kung adder. In fact this is the drawback of our design in domain of parameter trading. The chart in Fig. 6 represents this disadvantage. Beside, increasing the cells number roles as a factor of increasing the average power consumption according to Table III. But this secondary drawback is not considerable. Fig. 5 (b) shows this drawback negligibility. The important parameter for us was not only the power, but also the delay, so we focused on the product of them (PDP) as the Figure Of Merit (FOM) in order to improve the Brent-Kung adder performance while keeping other parameters in acceptable level. Table IV and Fig. 5 (c) present the PDP variation of both PPAs versus word size. As shown Fig. 4: 32-Bit Proposed PPA.
TABLE I:DELAY (PICO SECOND).
4bits 8Bits 16Bits 32Bits 64Bits 128Bits Brent-Kung 478.83 621.33 889.15 1167.43 1455.88 1755.67
Proposed 478.83 621.33 771.29 1049.40 1204.58 1380.18 Efficiency 0% 0% 13.2% 10.1% 17.2% 21.3%
TABLE II:THE NUMBER OF COMPUTATION NODES.
4bits 8Bits 16Bits 32Bits 64Bits 128Bits
Brent-Kung 5 12 27 58 121 248
Proposed 5 12 28 61 129 269
Overhead 0% 0% 3.5% 4.9% 6.2% 7.8%
(a) (b) (c) Fig. 5: The variation of (a) Delay, (b) Average power, (c) PDP of Brent-Kung PPA and proposed PPA versus word size.
the PDP of our proposed design is lower than Brent-Kung adder for word sizes larger than 8 bits and from 32 bits this performance increases by growing the word size, in other words the growth slope is smoother than Brent-Kung adder.
The important property of our proposed design is reducing the number of stages in the critical delay in comparison with the Brent-Kung PPA which seems it was the main reason of delay and hence PDP reduction in our design. Considering the solid red line in Fig. 3 and Fig. 4 that shows the critical delay path we can see this stage reduction in the 32-bit Brent-Kung and proposed adders. Also Table V presents it by varying the word size and also shows the percentage of stage reduction in each case. The maximum fan-out of the critical delay path was another attractive parameter in our design which defined as the fan-out of the computational node with maximum fan-out existed within the critical delay path. As Table VI shows, in our design this parameter is kept the same as Brent-Kung adder in 4, 8, 16, 32 bit lengths but in the 64 and 128 bits case we have an increase in the fan-out but in overall did not destroyed our design performance. Since we dealt with multiple parameters include "N " as the number of computational nodes, "D " as delay, "P " as the average power, "PDP " as the power delay product, "L " as the number of stages within the critical delay path and "F " as maximum fan-out of the critical delay path, and by considering the results presented in Table I to Table VI we have sketched the Design space of a 64-bit Brent-Kung and proposed PPA shown in Fig. 7. We can see the contribution and trade-offs between various parameters. This figure obviously shows the performances and drawbacks of our
design versus Brent-Kung adder. As shown our design has lowered the stages (L), and increased the fan-out (F) but trading these two parameters resulted in reducing the delay.
From the other hand in our proposed PPA the number of cells (N) are increased somewhat, consequently augmented the average power consumption (P) negligibly. Overall, the total PDP is reduced.
An important point is that these improvements are only for a single adder as a logical building block and since in most logic applications such as multicore processors we use thousands of these blocks, this improvement becomes considerable.
VI. CONCLUSIONS
This research presents an efficient parallel prefix adder based on the Brent-kung adder where by taking idea from the Sklansky adder and varying the graph topology, the number of stages in its critical delay path is reduced. Consequently the speed and also power delay product is improved. Simulation results confirmed that the proposed adder is faster than the Brent–Kung adder by 10.1% in 32 bits, 17.2% in 64 bits and 21.3% in 128 bits and the consequent PDP is reduced by 8.8%
in 32 bits, 15.6% in 64 bits and 19.5% in 128 bits. About the proposed adders it seems that the higher the input bits number, the higher the improvement percentage of performance. The Fig. 6: The variation of computational nodes versus word size.
TABLE III:AVERAGE POWER (MICRO WATT)
4bits 8Bits 16Bits 32Bits 64Bits 128Bits Brent-Kung 3.43 7.55 16.24 33.31 67.21 136.01
Proposed 3.43 7.55 16.42 33.78 68.53 139.28
Overhead 0% 0% 1.1% 1.4% 1.9% 2.3%
TABLE IV:PDP(E-15J)
4bits 8Bits 16Bits 32Bits 64Bits 128Bits Brent-Kung 1.64 4.69 14.44 38.88 97.85 238.78
Proposed 1.64 4.69 12.66 35.45 82.54 192.23 Efficiency 0% 0% 12.3% 8.8% 15.6% 19.5%
TABLE V:THE NUMBER OF STAGES WITHIN THE CRITICAL DELAY PATH. 4bits 8Bits 16Bits 32Bits 64Bits 128Bits
Brent-Kung 3 4 6 8 10 12
Proposed 3 4 5 7 8 9
Efficiency 0% 0% 16.6% 12.5% 20% 25%
TABLE VI:MAXIMUM FAN-OUT IN THE CRITICAL DELAY PATH
4bits 8Bits 16Bits 32Bits 64Bits 128Bits
Brent-Kung 3 4 5 6 7 8
Proposed 3 4 5 6 8 12
Overhead 0% 0% 0% 0% 12.5% 33.3%
Fig. 7: Design space of Brent-Kung PPA and proposed PPA.
proposed design has some disadvantages in area and power consumption which is negligible and does not destroy the overall performance of adder in speed and PDP.
ACKNOWLEDGMENT
The authors wish to thank the following persons for their helps: Prof. David Money Harris from Harvey S. Mudd university, CA, USA. Dr. Naser Mohammadzadeh, Dr. Ali Asghar Bagheri Soulla from Department of Electrical and Computer Engineering Shahed university, Tehran, Iran. Dr Poornima Sumithra from JSS Academy of Technical Education Bangalore, India, Dr. Mohammad Sharifkhani from Sharif university of Technology, Tehran, Iran. Miss Bahare Qorbany and Miss Hamideh Amiri. The first author specially wishes to thank her beloved parents for their supports and encouragements.
REFERENCES
[1] A. Nagamani and B. Shivanand, "Design and performance evaluation of Hybrid Prefix Adder and carry increment adder in 90nm regime," in Nanoscience, Engineering and Technology (ICONSET), 2011 International Conference on, 2011, pp. 198-201.
[2] Y. Sun, D. Zheng, M. Zhang, and S. Li, "High performance low-power sparse-tree binary adders," in Solid-State and Integrated Circuit Technology, 2006. ICSICT'06. 8th International Conference on, 2006, pp. 1649-1651.
[3] P. Celinski, S. Al-Sarawi, D. Abbott, S. Cotofana, and S. Vassiliadis,
"Logical effort based design exploration of 64-bit adders using a mixed dynamic-cmos/threshold-logic approach," in VLSI, 2004. Proceedings.
IEEE Computer society Annual Symposium on, 2004, pp. 127-132.
[4] M. Moghaddam and M. B. Ghaznavi-Ghoushchi, "A new low power- delay-product, low-area, parallel prefix adder with reduction of graph energy," in Electrical Engineering (ICEE), 2011 19th Iranian Conference on, 2011, pp. 1-6.
[5] P. Celinski, J. F. Lopez, S.F. Al-Sarawi and D. Abbott, "A family of low depth, threshold logic, carry lookahead adders," Proceedings 2nd WSEAS International Conference on Instrumentation, Measurement, Control, Circuits and Systems 2002, pp. 1981-1983, Mexico, May 2002.
[6] H. M. El-Boghdadi, "A class of almost-optimal size-independent parallel prefix circuits," Journal of Parallel and Distributed Computing, vol. 73, pp. 888-894, 2013.
[7] N. H. Weste and D. M. Harris, CMOS VLSI design: a circuits and systems perspective: Pearson Education India, 2005.
[8] P. Ramanathan and P. Vanathi, "A novel logarithmic prefix adder with minimized power delay product," Journal of Scientific & Industrial Research, vol. 69, pp. 17-20, 2010.
[9] L. P. D. Bollepalli, "Design and Implementation of Fault Tolerant Adders on Field Programmable Gate Arrays," Master of Science thesis, The University of Texas, 2012.
[10] P. Ramanathan and P. Vanathi, "Hybrid Prefix Adder Architecture for Minimizing the Power Delay Product," International Journal of Electronics, Circuits & Systems, vol. 3, 2009.
[11] J. Sklansky, "Conditional-sum addition logic," Electronic Computers, IRE Transactions on, pp. 226-231, 1960.
[12] P. M. Kogge and H. S. Stone, "A parallel algorithm for the efficient solution of a general class of recurrence equations," Computers, IEEE Transactions on, vol. 100, pp. 786-793, 1973.
[13] R. P. Brent and H.-T. Kung, "A regular layout for parallel adders," IEEE Transaction on Computers, vol. C-31, no.3, March 1982, pp. 260-264.
[14] P. Ramanathan and P. Vanathi, "A Novel Power Delay Optimized 32-bit Parallel Prefix Adder For High Speed Computing," International Journal of Recent Trends in Engineering, vol. 2, 2009.
[15] R. Zimmermann, Binary adder architectures for cell-based VLSI and their synthesis: Citeseer, 1998.
[16] S. M. Sudhakar, K. P. Chidambaram, and E. Swartzlander, "Hybrid Han- Carlson adder," in Circuits and Systems (MWSCAS), 2012 IEEE 55th International Midwest Symposium on, 2012, pp. 818-821.
[17] S. Knowles, "A family of adders," in Computer Arithmetic, 1999.
Proceedings. 14th IEEE Symposium on, 1999, pp. 30-34.
[18] D. Harris, "A taxonomy of parallel prefix networks," in Signals, Systems and Computers, 2004. Conference Record of the Thirty-Seventh Asilomar Conference on, 2003, pp. 2213-2217.
[19] C. Cury and M. Nisanth, "Design of Parallel Prefix Adders using FPGAs," IOSR Journal of VLSI and Signal Processing, vol. 4, pp.45-51, May-Jun. 2014.
[20] K. Vitoroulis and A. J. Al-Khalili, "Performance of parallel prefix adders implemented with FPGA technology," in Circuits and Systems, 2007. NEWCAS 2007. IEEE Northeast Workshop on, 2007, pp. 498- 501.
[21] M. M. Basha, K. V. Ramanaiah, P. R. Reddy, and B. L. Reddy, "An efficient model for design of 64-bit High Speed Parallel Prefix VLSI adder," International Journal of Modern Engineering Research, vol. 3, pp. 2626-2630, Sep-Oct. 2013.
[22] G. Dimitrakopoulos and D. Nikolos, "High-speed parallel-prefix VLSI ling adders," Computers, IEEE Transactions on, vol. 54, pp. 225-231, 2005.
[23] J. Chen, "PARALLEL-PREFIX STRUCTURES FOR BINARY AND MODULO {2n− 1, 2n, 2n+ 1} ADDERS," Master of Science thesis, Oklahoma State University, 2008.
[24] J. E. Stine, Digital computer arithmetic datapath design using verilog HDL. vol. 1: Springer, 2004, pp. 49-53.
[25] A. Siliveru and M. Bharathi, "Design of Kogge-Stone and Brent-Kung adders using Degenerate Pass Transistor Logic," IJESE, vol. 1, pp. 38- 41, 2012.
[26] M. M. Ziegler and M. R. Stan, "A unified design space for regular parallel prefix adders," in Proceedings of the conference on Design, automation and test in Europe-Volume 2, 2004, p. 21386.
[27] V. C. Kurapati, Analysis of IP Based Implementations of Adders and Multipliers in Submicron and Deep Submicron Technologies: ProQuest, 2008.