Implementation of Floating Point Multiplier Using Dadda Algorithm

(1)

________________________________________________________________________________________________

Implementation of Floating Point Multiplier Using Dadda Algorithm

1Karthik.S, ²Sunilkumar B.S

1,2NCET, Bangalore Email: ¹[email protected] Abstract: Floating point multiplication is the most usefull

in all the computation application like in Arithematic operation, DSP application. To achieve higher speed of the mantissa multiplication is done using Dadda multiplier which works on the bases of Dadda Algorithm. Through this design achieves high speed with an maximum frequency of 526 MHz and also reduces the number of gates compared to existing multiplier and through floating point format we can handle the overflow and underflow cases. This multiplier is implemented using verilog HDL and targeted for Xilinx Virtex-5 FPGA. The multiplier is compared with the Xilinx floating point multiplier core.

Keywords: Dadda algorithm; floating point;

multiplication; single precision; FPGA, verilog HDL.

I. INTRODUCTION

In most of the DSP application, FIR filter, microprocessor etc needs floating point numbers multiplication. Floating point is the possible ways to represent in real numbers.

Floating point arithmetic operation is useful in applications where a large dynamic range is required .The floating Point Multiplier processors helps designers to perform the floating point Multiplication on FPGA represented in IEEE 754 format. They can be represented in two IEEE 754 standard format is binary interchange format and decimal interchange format.

These format is further classified into half precision, single precision, double precision and quad precision.

Single precision normalized binary interchange IEEE 754 format is implemented in this design. Single precision binary interchange format representation is shown in Figure 1. Starting from the MSB it has a one bit sign (S), eight bit of biased exponent (E), and a twenty three bit is a fraction (M or Mantissa). Adding an extra bit to the fractional part(mantissa) to form as significand. As the biased exponent is greater than 0 and smaller than 255, and there is 1in the MSB of the significand then that number is said to be a normalized number. the normalized number of the floating point is represented in a real number by equation (1) and (2).

Z = (-1^S) * 2 ^{(E - Bias)} * (1.M) ---(1)

Where M = n₂₂ 2^-1+ n₂₁ 2^-2+n₂₀ 2^-3+…+n1 2^-22+n₀ 2^-23; Bias value = 127 for single precision.

Value = (-1 ^{sign bit})* 2(exponent-127)*(1.Mantissa)--- (2)

Sig Biased exponent Mantissa

1bit 8bit 23bit

Figure 1. IEEE single precision floating point format Floating point multiplication of two numbers is performed as the first part of the floating point multiplier is sign bit which is determined by an exclusive OR gate function of the two input signs. The second part is the exponent which is calculated by adding directly the two input exponents, the extra bias is to be subtracted. The third part is significand (1. Mantissa) which is determined by multiplying the two input significands of floating point each with a “1” concatenated to it. But that “1” is the hidden bit. The main applications of floating points are in the field of medical image processing, biometrics, motion capture and audio applications, including broadcast, conferencing, musical instruments and professional audio.

The most interesting area is to implement Floating-point multipliers on FPGA. FPGA stands as Field Programmable Gate Array. It is a semiconductor device containing programmable logic component and programmable interconnects. The programmable logic components is programmed to duplicate the functionality of basic logic gates such as AND, OR, XOR, NOT or more complex combinational functions those is to be decoded to simple mathematical functions.

In most FPGA, these programmable logic components also include memory elements, which may be simple flip flops or more complete blocks of memories. A hierarchy of programmable interconnects allows the logic blocks of an FPGA to be interconnected as needed by the system designer. These logic blocks and interconnects can be programmed after the manufacturing process by the customer/designer so that the FPGA can perform whatever logical function is needed.

FPGAs are generally slower than their application specific integrated circuit (ASIC) counterparts, as they can't handle as complex a design, and draw more power.

However, they have several advantages such as a shorter time to market, ability to re-program in the field to fix bugs, and lower non recurring engineering cost costs.

(2)

________________________________________________________________________________________________

The development of these designs is made on regular FPGAs and then migrated into a fixed version that more resembles an ASIC. Complex programmable logic devices, or CPLDs, are another alternative.

II. FLOATING POINT MULTIPLIER ALGORITHM

The algorithm is used to multiply two floating point numbers. The following steps to multiply is given as 1. Significand multiplication of the two floating point;

i.e (1.M1*1.M2).

2. Placing the decimal point in the result.

3. Exponent‟s addition of biased floating point; i.e. (E1 + E2 - Bias).

4. Getting the sign bit operation of two floating point;

i.e. s1 xor s2.

5. Normalizing the result; i.e. obtaining 1 at the MSB of the results significand of floating point.

6. Rounding implementation resultant floating point.

7. Check for the underflow/overflow occurrence

III. IMPLEMENTATION OF SINGLE PRESCISION FLOATING POINT

MULTIPLIER

Consider the following two IEEE754 single precision floating point numbers to perform the multiplication.

a. a. Converting decimal to binary floating point.

A= 5.5 = 0 10000001 01100 B = -35= 1 10000100 00011

For multiplication of two significand numbers is A *B.

1. Significand Multiplication:

2. Normalizing the result if necessary:

1.1000000100

3. Adding two biased exponents:

10000001+10000100 =100000101

The result after adding two exponents is not true exponent and is obtained by subtracting bias value i.e 127.

The same is shown in following equations.

EA = EA-true + bias

E_B = E_B-true + bias

E_A + E_B = E_A-true + E_B-true + 2 x bias Therefore

Etrue= E_A + E_B – bias.

From this analysis bias is added twice so bias has to be subtracted once from the result.

100000101 -001111111=10000110

4. Sign bit of result is extracted by doing XOR operation of sign bit of two numbers:

1 10000110 01.1000000100

5. Then normalize the result so that there is a 1 just before the radix point (decimal point). Moving the radix point one place to the left increments the exponent by 1;moving one place to the right decrement the exponent by 1.

6. If the mantissa bits are more than 5 bits (mantissa available bits); rounding is needed. If we applied the truncation rounding mode then the stored value is:

1 10000110 10000.

From figure 2 shows the block diagram of the multiplier structure as exponents calculation, mantissa multiplier and sign bit calculator here all the processes are carried out in parallel.

Figure 2: floating point multiplier block diagram 1. MAIN BLOCKS OF FLOATING POINT MULTIPLIER

A. Sign bit calculation

The main component of Sign bit calculation is by using XOR gate. If any one of the numbers is negative of the sign bit of the floating point then result will be negative, if the result will be positive if two numbers of floating point are having same sign.

B. Exponent Adder

In this sub-block it under go addition operation of the exponents of the two floating point numbers and the Bias (127) is subtracted from the result to get true result i.e. EA + EB – bias. In this design the addition is done on two 8 bit biased exponents. In previous designs the most of the computational operation time is spending in the significand multiplication process (multiplying 24 bits by 24bits); so quicker result of the addition is not necessary. Thus we need a fast significand multiplier and a moderate exponent adder.

(3)

________________________________________________________________________________________________

To perform the addition operation of two 8-bit exponents an 8-bit ripple carry adder (RCA) is used. As shown in Figure 3 the ripple carry adder consists an array of one Half Adder (HA) i.e. to which two LSB bits are fed and Full Adders (FA) i.e. to which two input bits and previous carry are given.

The HA has two inputs and two outputs and FA has three inputs and two outputs. The carry out of each adder is fed to the next full adder and process is done for 8 bit.

Figure 3: ripple carry adder i. One Subtractor (OS):

The single bit subtractor is used for subtracting the bias in exponent addition. A normal subtractor has three inputs i.e minuend (X), subtrahend (Y), Borrow in (Bi) and two outputs i.e Difference (D), Borrow out (Bo).

The subtractor logic can be optimized if one of its inputs is a constant value which is in our cases, where the Bias constant is 12710 = 001111111two. Table I shows the truth table for a 1-bit subtractor with the input equal to 1 which we will call it as “one subtractor (OS)”.

Figure 4: ripple carry adder TABLE 1. 1-bit subtractor with input Y=1

Where D = X xor Bi and B0 = X+ Bi

Figure 5: 1 bit subtractor with the input Y=1 ii. Zero Subtractor (ZS)

A 1-bit subtractor with the input Y equal to 0 which will be called as “zero subtractor”.

Table II shows truth table for 1 bit subtractor with the input Y=0

where

Figure 6: 1 bit subtractor with input T=0

Figure 7: ripple borrow subtractor

From the figure 7 shows the bias subtraction of a chain of 7 one subtractor. If an underflow occurs then E_result<0 and the numbers range out of normalized numbers in this case the output is signalled to 0 and an underflow flag is generated.

C. Significand multiplication using Unsigned Multiplier Proposed multiplier:

Dadda multiplier:

The name Dadda came from the scientist dada who proposed a sequence of matrix heights that are predetermined to obtain the minimum number of reduction stages of a given matrix of partial product. To reduce the N by N partial product matrix, dada multiplier develops a sequence of matrix heights that are found by working back from the final two-row of the partial product matrix. In order to realize the minimum number of reduction stages, the height of each intermediate matrix (partial product) is limited to the

(4)

________________________________________________________________________________________________

least integer number that is not more than 1.5 times the height of its successor.

The reduction steps for dada multiplier is build using the following recursive algorithm.

Let d1=2 and dj+1=[1.5*dj], where dj is the matrix height for the jth stage from the end of matrix. Then need to find the smallest j such that at least one column of the original partial product matrix has more than dj elements.

From the end of the jth stage,we employ (3, 2) and (2, 2) counter to obtain a reduced matrix with no more than dj elements in any column of matrix.

Let us consider j = j-1, repeat step 2 until a matrix with only two rows is generated i.e we need to employ counters.

This method of reduction stage of a matrix then it attempts to compress each column, is called a column compression technique.

Another advantage of utilizing Dadda multipliers is that it utilizes the minimum number of (3, 2) counters.

Therefore, the number of intermediate stages(matrix height of partial product) is set in terms of lower bounds: 2, 3, 4, 6, 9 . . .

For Dadda multipliers there are N2 bit elements in the original partial product matrix and 4.N-3 bit elements in the final two row matrix. Since each (3, 2) counter i.e full adders takes three inputs and produces two outputs, the number of bits in the matrix is reduced by one with each applied (3, 2) counter therefore, the total number of (3,2) counters is given as (3, 2) = N2 – 4.N+3 the length of the carry propagation adder is CPA length=2.N–2.

The number of (2, 2) counters i.e half adders is to equals N-1.

Figure 8: dot diagram for 8 by 8 dadda multiplier

The 8 by 8 multiplier takes 4 reduction stages, with matrix height of partial product is 6, 4, 3 and 2. This reduction stages uses 35 of (3, 2) counters, 7 of (2, 2) counters and a 14-bit carry propagate adder. The total delay for the generation of the final product is the sum of all the adders and its delay and the delay through the final 14-bit carry propagate adder will arrive later, but effectively reduces the over all worst case delay. The decimal point is placed between bits 45 and 46th bits in the significand resultant. The critical path is used to determine the time taken by the Dadda multiplier. The critical path starts at the AND gate of the first partial products passes through the full adder of the each stage, then passes through all the merging adders in multiplier.

D. Normalizing Unit:

The normalisation is done by below given points. No shift is needed if the intermediate product is known to be a normalized number when the one is at bit 46 (i.e. to the left of the decimal point).

The exponent is incremented by 1 if the leading one is at bit 47 of the resultant obtained is shifted to the right.

This shifting is performed by combinational shift logic for the shift operation.

IV. OVERFLOW AND UNDERFLOW DETERMINATION

In some of the exceptions module, all of the special cases are checked for representing the floating point numbers, and if they are found, then the appropriate output is created, and the individual output signals of underflow, overflow, inexact, exception, and invalid will be asserted if the conditions for each case exist.

Underflow and Overflow means the resultant exponent is too smaller or larger to be represented in the exponent field. The resultant of exponent must be 8 bits in size and must be between 1 and 254 otherwise the value is not in a normalized form to represent in IEEE floating point standard format.

V. IMPLEMENTATION AND TESTING

The entire multiplier (top unit) is tested against the Xilinx floating point multiplier core. Xilinx core was customized to have two flags to indicate overflow and underflow cases and to have a maximum latency of three cycles

A stimulus program code is written and it is applied to implement the floating point multiplier and to the Xilinx core, results are compared. The floating point multiplier code was also checked using Xilinx14.5. The design was synthesized using Xilinx synthesis ISE tool and it is targeted on Virtex-5 FPGA (xc5vlx20t-2ff323).

VI. SIMULATION RESULTS:

The simulation is done using Xilinx 14.5. Considering the random floating point numbers of the significand.

Inputs: a = 00000000000000000100000

(5)

________________________________________________________________________________________________

b = 00000000000000000111000 ;

Output: result =00000000000111000000000;

Here the input „a‟ is negative, input „b‟ is positive, so the output result will be negative whose sign bit is „1‟.

Table 3: Comparision between dada multiplier and Xilinx core

Figure 9: significand multiplication simulation

VII. CONCLUSION AND FUTURE WORK

This paper describes on the implementation of floating point multiplier of single precision in the binary interchange format that supports IEEE 754. The significand multiplication is done by dadda algorithm through this the required time, area is reduced. And frequency is increased, therefore the also increased. The design has been implemented on a Xilinx Virtex5 FPGA and achieved the speed of 526MHz.

REFERENCES

[1] IEEE 754-2008, IEEE Standard for Floating- Point Arithmetic, 2008.

[2] Mohamed Al-Ashrfy, Ashraf Salem and Wagdy Anis “An Efficient implementation of Floating Point Multiplier” IEEE Transaction on VLSI 978-1-4577-0069-9/11@2011 IEEE, Mentor Graphics.

[3] B. Fagin and C. Renard, “Field Programmable Gate Arrays and Floating Point Arithmetic,”

IEEE Transactions on VLSI, vol. 2, no. 3, pp.

365- 367, 1994.

[4] N. Shirazi, A. Walters, and P. Athanas,

“Quantitative Analysis of Floating Point Arithmetic on FPGA Based Custom Computing Machines,” Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM‟95), pp.155-162, 1995.

[5] L. Louca, T. A. Cook, and W. H. Johnson,

“Implementation of IEEE Single Precision Floating Point Addition and Multiplication on FPGAs,” Proceedings of 83 the IEEE Symposium on FPGAs for Custom Computing Machines (FCCM‟96), pp. 107-116, 1996.

[6] Jaenicke and W. Luk, "Parameterized Floating- Point Arithmetic on FPGAs", Proc. of IEEE ICASSP, 2001, vol. 2, pp.897-900.

[7] Whytney J. Townsend, Earl E. Swartz, “A Comparison of Dadda and Wallace multiplier delays”. Computer Engineering Research Center, The University of Texas.

[8] B. Lee and N. Burgess, “Parameterisable Floating-point Operations on FPGA,” Conference Record of the Thirty-Sixth Asilomar Conference on Signals, Systems, and Computers, 2002.

[9] Xilinx13.4, Synthesis and Simulation Design Guide”, UG626 (v13.4) January 19, 2012.

