II In In ^plications algorithms T6M VNUHoChiMinh University of Technical HoChiMinh VA vA TINH DONG DOI V6l SINGLE/DOUBLE ROOT OF SCIENCE & * 95 -

(1)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013

DESIGN, IMPLEMENTATION OF DIVIDER AND SQUARE ROOT IN SINGLE/DOUBLE PRECISION FLOATING POINT NUMBER ON FPGA THIET KE v A THU'C THI BQ TINH T O A N DAU CHAM DONG DOI V6l PHEP CHIA

VA CAN BAC HAI TREN FPGA

Hoang Trang Bui Quang Tung, Ngo Van Thuyen versity of Technology, VNUHoChiMinh City University of Technical Education HoChiMinh City

Received May 29, 2012; accepted January 11, 2013 ABSTRACT

The details of design and implementation of division, square root in single/double precision boating point on FPGA are presented This study gives the design, implementation on FPGA with 'ower output latency, much smaller size in ALUT, DLR, and ALM than Altera's design. For the maximum frequency consideration, this study obtains better result only in division operation with single oredsion floating point. The choice of this study or Altera's design in a system should be considered due to the maximum frequency of system or the FPGA size.

Keywords: division, square root, floating point, FPGA, low latency T6M TAT

BAi bAo nAy md tA chi tiit quA trinh f/jyc hi$n vA thiit ki b0 tlnh toAn diu chim ddng dii v6i phAp chia vA cSn b$c hai tr&n FPGA. NghiSn cO'u nAy dua ra quA trinh thiit kit, thyc hi$n trAn FPGA v6i cAc tlAu chl nhw d0 tri thip, tAi nguyAn bao gim ALUT, DLR vA ALM nhd han thiit ki cOa Altera.

Dii vCn tin s6 ldn nhit, nghi'An cOv chl d^t dtrge tit han trong phAp chia diu chim d^ng dd chinh xAc dan. ViAc li/a chgn thiit ki gi&a nghidn cOv nAy v6i ci^a Altera trong mdt h$ thing s§ diioc cAn nhic giOa tin s6 ldn nhit ciJa h$ thing hoSc IA tAi nguyin cua FPGA

1. INTRODUCTION

Computation in the floating point number is used in many ^plications and algorithms requiring high precision, for example, the digital filters, images/video encoding, decoding ... Floating-point arithmetic was implemented as software emulation and hardware implementation in almost of major microprocessors as specific hardware.

Meanwhile, field-programmable gate array (FPGA) as a reconfigurable hardware have made itself as a good choice of hardware implementation both in industry applications and in education. It could combine the advantages of custom hardware architecture and reasonable cost in product, algorithm development.

FPGA used for floating-point applications would be poor performance due to the complex implementation of floating-point arithmetic which consumes a large amount of resources [1]. This makes FPGAs less attracdve for use in floating-point intensive applications.

Despite this poor performance of FPGAs, many studies on FPGA implementation of floating point arithmetic have been done for scientific applications [2]-[6]. In [2], double precision floadng-point arithmetic including addition, subtracdon, multiplicadon, division, square root was presented. From this study and also other works in our literature review, the division and square root arithmetic are the most complicated ones in the point of timing and space. In [3], the floating point square root was proposed in general algorithm for fast computing. In this study, we introduce the FPGA implementadon in detail for division and square root of single, double floating-point according to the IEEE 754 standard.

The oudine is organized as follows.

Secdon II summarizes the important aspects of floating-point format in IEEE 754 standard. The details in implementation of floating-point division on FPGA are shown in section III, and are followed up with square root operation details in section IV. Section V gives the

(2)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013 results. Conclusion is finally presented in

section VI.

2. F L O A T I N G - P O I N T F O R M A T Several different representations o f real numbers have been proposed, but by far the most widely used is the floating-point representation. The IEEE754 Floating-point representation is composed o f three fields, the Sign-S, the Exponent-E, and the Manlissa-M.

In single precision floating-point representation, the Exponent-E is 8 bits, whereas the Mantissa- M is composed o f the remaining 2 3 bits. In double precision floating-point representation, the, Exponent-E is 11 bits, and the Mantissa-M, is composed o f the remaining 52 bits. In both cases, the hidden-1 representation for the Mantissa M a ^ t u d e holds, effectively extending its representational power by one-bit. The IEEE754 Floating-point number is represented as in Fig. 1.

Similarly, the value o f a double precision IEEE754 Floating-point number is typically

^ v e n by the following fonnula:

^ = ( - I ) ' x ( I J / ) x 2 " ™ , fornomialnumber (4)

^ = ( - I ) ' x ( 0 J f ) x 2 " " ^ , fordenonnalnumber (5) The special numbers o f single/double precision floating point number are shown in Table I.

3 . T H E I M P L E M E N T A T I O N OF F L O A T I N G - P O I N T D I V I S I O N

The floating point divigon implementation on FPGA is shown in FigJ. HK first step is unpacking t w o floating-pad numbers, then checks special values, return if the value is N a N , zero or infini^. A special case, it will return infinity if B (divisor) = 0.

(ZC 7] H I 3

Fig 1. Floating-point number format The real value o f above two formats is calculated as follows:

N = {:-\y^My.T- ⁽¹⁾

The value o f a single precision IEEE754 Floating-point number is typically given by the following formulas:

A f = ( - l ) ' x ( l J l f ) x 2 ' ^ - ' " , fornomiainumber (2) N = (-1)' X (0.AO X 2"'^, for denonnal number (3)

Table 1. Special number of single/double precision floating-point number

L=^ H~irr-^^

ConditioD E 0 0

0<e<255 (for single) 0<e<2047(for double) 2SS (for single) 2047 (for double) 255 (for single) 2047 (for double)

M 0 Nonzero Nonzero 0 Nonzero

Description ZSTG rrumber

±Denormal number

± Nonnal number

±cn(mfmityy„„„ber

^HaN(Not a number)

Fig 2. Floating-point division block diagram 3.1 U n p a c k i n g i n p u t o p e r a n d s of division

The Unpacking block proposed in FiPj separates three fields sign, exponen'^ ^ mantissa o f input operands and ir'.j'icates'type o f input operands which ar-^ deriormalized oi nomializeH numbers. ',„ addition, this block i i s o detects cx.•^eption cases of division is shown in T-^oie 2.

' G o / e div o p a NaN Anv

;)

UJ

Any Any

2. The exc div opb Any NaN 0

QO

0

0 0

'ption cases of division div result NaN NaN NaN NaN Infinity 0

Exception NaN NaN NaN NaN DividejjiL UnderflOT.

66

(3)

JOURNAL OF SCIENCE A TECHNOLOGY « No. 95 - 2013

Fig 3. Unpacking input operands 3.2 The sign of division

The block in Fig.4 assigns a sign to divi^on.

Where: bias = 127 for single number precision; bias 1023 for double number precision.

(a)

S^^ZXDB-^

- - - f t

j T ^ n ""Pacfc-ilfl",

Fig 4. The sign of division Table 3 shows two exception cases which assign a sign to division which are different from FigA

Table 3. Description two exception cases which assign a sign to division

divjopa

NaN Any

divjopb

Any NaN

unpack_sign Single div_opa[3I]

divopb[3iJ

Double div_opa[63J div_opb[63]

3.3 The exponent of division

This block calctilali-s exponent of division that is determined for three cases as in Fig.5.

^"^ lfunpack_expa >= unpack_expb '^' Ifunpackexpb - unpackexpa < bias '""' If unpack_expb - unpackexpa > bias

Fig 5. The exponent of division 3.4 Dividing mantissa A by mantissa B

Fig 6. Mantissa division flowchart 1 The mandssa of division is implemented by using iterative method. The description of calculadng the single and double precision floadng point number implemented in 26/55 dock cycles is shown as in Fig.6.

3.5 RoundiDg the output result of division This block rounds the result by rounding die mandssa field. The rounding single precision and double precision floating-point number are described in Fig.7.

.67

(4)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013 N | ^

S P"'~|~V-r.,

> & -

SQRT(x)

1. e^ bits 23-30 ofx

2. n -H- bit 0-22<'>/0-5P ofx, extendedto 25(i>/54<^> bits by pending "61"

3. e*-e+I 4. tfe is odd 5. n*-n«l

endif

6. n *-f(n). wheref is the implementation^

the square root of algorithm -63"'/l023<^*

Fig 7. Rounding the ouput of division for single precision and double precision number 4. THE IMPLEMENTATION OF FLOATING-POINT SQUARE ROOT 4.1 Square root algorithm

A floating point number consists of three fields sign, exponent, and mantissa. If bit sign is

" 1 " then the number will not be calculated square root because it is a negative number. The square root of a floating point number is given as below 3.

)

4m X 2^,if e even

^Irn y.2^,if eodd (6) Since floating point numbers have the binary expansion by bias. Therefore, the new exponent will be calculated by

e = e + bias Eq. (6) is rewritten as die following:

)

Vnx2 ^ , ifeeven ,ifeodd

(7)

From Eq. (7), we realize that the square root ofa floating point number is the product of (e + bias)/2 and the square root of mantissa m A pseudo code is used as in [3] to compute the square root ofa floadng point number as shown below.

^'^ For single precision

"^ For double precision 4.2 Square Root Block Diagram

[EMpMar] [ V ] E.

Fig 8. Square root hardware architectures From SQRT(x), we Issue the hardware architecture to implement the square root ofthe floadng point numbers as shown in Fig.8.

4 2.1 Unpacking the input operand of square root

The block as in Fig.9 separates three fields sign, exponent, hidden bit and mandssa of input operands.

(5)

JOURNAL OF SCIENCE & TECHNOLOGY * No. 95 - 2013

Fig 9. Unpacking the input operand of square root

4.2.2 Square root

An algorithm SQRT (n) to compute the square root of an n number was studied as shown below [3].

SQRT(n) [n is a fixed point number 1 < n<4]

1.

2.

3.

4.

5.

" •

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

s.— (n»mim bit of mantissa)-l n.~n«2

a.-l I.-5

while a„„„ ti, .f »nni« / / (will loop

num mim

loop

num bit of mantisa times.) bit of mantisa=23 for single precision bit of mantisa=52 for double precision

s.~(s«2)+(n»numbitof_mantisa n .r- n«2

ifs<t I.-I-I a.-a«l else

s .— s-1 t.-1+l a.- (a«lj + 1 endif

t^(l«l)+l s . - s « 2 ifs>l endif return

a.-a + 1 a

From SQRT(n), we issue the hardware architecture to implement the square root of mantissas m of single/double precision numbers asshowninFig.lO.

• 7 *

* { &

\tim^

43^-^^?^-

W ^ M

_ L M p 2 3 6 « ) l ] i s r g f c _ tnpStdiesfuilalile

Fig / 0 Square root of mantissa 3 4.2 3 Rounding the output result of square root

The architecture of rounding square root is done as In Fig.ll.

tsb = (precision = doubla)? sq(t_inantissa[3] sqrt_manbssa|32]

sbdcy = (preciiion = double)? [sq[t_mantlssaI2;D]; |sq[t_iT)a[itissa[30 0) iDund = [precision = double)? >qrt_manli»a[Z] Bq(t_[i)antissa[31]

rouiHl 1 \

j q r t m a n t l u a

~^ V T N sqrt It

sqit round up jj^JLJ

Fig 11. Rounding the output result of square root

(6)

JOURNAL OF SCIENCE & TECHNOLOGY • No. »5 - 20t3 RESULTS

5. IMPLEMENTATION AND DISCUSSION

The division and square root designs in diis study are simulated in Modelsim 6.6d, synthesized in Altera Quartus II version 1 l.Ospl which was mapped on to Altera's Stratix III FPGA.

The test-tenches are done to verify our work and are shown as in Fig. 12- Fig. 15.

Our design presented in detail as in section III and section IV is implemented, verified and compared with Altera's design [7],

It's difficult to precisely compare between our study and odiers except the work In [7] because few study gave die implementation results but in different structure (Xilinx or Altera). However, to validate our study, the comparison between our study and others are also presented.

The comparison of resource in Adaptive Look-Up Tables (ALUT) between our work and others [2], [8] in case of Double type is given in Table IV. The equivalence between Xilinx and Altera structures is done based on [9], Table 4- Comparison of resource in ALUT parameter

AMJT Divider Square Root

This study 1644

[21 1000 697 i 790

[SI 1325 725 The comparison of Ir' quency, output latency and resources betwetu our design and Altera's design [7] is also .hown in Table V. In diis table, ALUTs, DL Rs, and ALMs stand for Adaptive Look-Up Tables, Dedicated Logic

Registers, and Adaptive Logic Modulo"

respectively; Output Latency is the number o^

clock cycles needed to finish the operation. ^ 1 All of die circuits in our design are muj' smaller than in Altera's design in all of &

quantifies ALUTs, DLRs, and ALMs. 4 | design also takes smaller output latenqrfff|

Altera's design. However, there are speed and circuit s\2R tradeoffs in square root opt division operation of doid)le precision floi point but in division operation of precision floating point. Because the Alteifi design in flie topic of diis work is not public9|

could explain the differences between our ^ and Altera's design as follows. All of (^lenip in this study using iteration method to calculi Therefore, the more bit number is, the more iteration and thus the latency of each operation are.

To define clearly the tradeoffs in division, square root operation between our and Altera's designs, the relative comparison is presented m Table VI. From diis table, one important remark extracted is diat the division operaticm of s i n ^ precision floating point in this stutfy is bettff than in Altera's design in both of circuit size and speed (smaller output latency and larger maximum fiequency). For the square root operation and division operation of double precision floating point, the remarks in dw last row of table VI are done in case of only floating point operation unite taken into account In the whole system including floating point operation unit, these remarks should be reconsidered if the maximum frequency of the system is smaller than one in Altera's design in table V.

Table 5. Comparison of resource and frequency between this study and Altera's design

Design Output latency ALUTs DLRs ALMs fMAX ^MHz)

This DIV Single

31 813 345 537 307

Double 60 1644

633 1154 255

study SORT Single

25 330 179 228 305

Double 54 697 360 504 215

Altera's design (6) DIV

Single 33 3658 3374 2510 283

Double 61 14191 13298 10501 264

SORT Single

28 526 942 536 396

Double 57

-illLl _yi5.

_2,311_

283

(7)

JOURNAL OF SCIENCE & TECHNOLOGY • No. 95 - 2013

fig 12. The result of division 32bits operation Fig 13. Tlie result of division 64bits operation

Fig I-l. The result of square-roof 32bils Fig 15. The result of square-root 64bits operation operation

Table 6. Relative comparison between this study and Altera's design

Design

Output latency ALUTs DLRs ALMs

fMAX

Remark of this study compared to Altera's design

Circuit size Speed Latency

This study/Altera's design (%) DIV

Single 93.9%

22.2%

10.2%

21.4%

108.5%

Smaller Faster Shorter

Double 98.4%

11.6%

4.8%

11.0%

96.6%

Smaller Slower Shorter

SORT Single 89.3%

62.7%

19.0%

42.5%

77.0%

Sinaller Slower Shorter

Double 94.7%

30.2%

9.4%

21.8%

76.0%

Smaller Slower Shorter 6. CONCLUSION

The details of design and implementation of division, square root in single/double precision floating point on FPGA are presented.

The implementation details in size, latency and maximum frequency are given and compared to megafiinction of Altera's design. This study gives the design, implementation on FPGA with lower output latency, much smaller size in

ALUT, DLR, and ALM dian Akera's design.

For the maximum frequency consideration, this study obtains better result only in division operation with single precision floating point.

TTie choice of this study or Altera's design in a system should be considered due to the maximum frequency of system, and the FPGA size.

REFERENCES

1. Yee Jem Chong, Sri Parameswaran, "Configurable muhimode embedded floating-point Units for FPGAs", IEEE Transactions on Very Large Scale Integration (VLSI) systems, pp.1-12,2010.

2. Paschalakis, S., Lee, P., "Double Precision Floating-Point Arithmetic on FPGAs", In Proc.

2003 2nd IEEE Intemational Conference on Field Programmable Technology (FPT '03), Tokyo, Japan, Dec. 15-17, pp. 352-358,2003.

71

(8)

JOURNAL OF SCIENCE & TECHNOLOqiHtfiSttijSIfiifi^^^B^

3. Hain, T.F., and Mercer, D.B., Fast Floating-point Square Root", Proceeding of 2005 Intemational Conference on Algorithmic Mathematics and Computer Science, Las Vegas, Nevada, USA, pp. 33-39,2005.

4. Guiilermo Marcus, Patricia Hinojosa, Alfonso Avila, and Juan Nolzco-FIores, "A My synthesizable single-precision, floating-point adder/subtractor and multiplier in VHDL forgenaal and educational use". Proceeding ofthe Sth IEEE Intemational Caracas Conference on Denes, Circuits and Systems, Dominican Republic, Nov.3-5, 2004.

5. Wang, X., Nelson, B.E., "Tradeoffs of Designing Floating-Point Division and Square Root on Virtex FPGAs", Proceeding of the 1 Ith IEEE Symposium on Field-Programmable Custm Computing Machines, 2003.

6. UCBTest suite. Online day June 04*, 2012: www.netlib.org/fp/ucbtest.tgz

7. Floating-point Megafunctions User Guide. Online day June 04*, 2012:

www.altera.com/literature/ug/ug_alt^_mfug.pdf

8. Gerhard Lienhart, Andreas Kugel,Reinhard M^ner, "Using Floating-Point Arithmetic on FPGAs to Accelerate Scientiflc N-Body Simulations", Proceedings ofthe 10 th Annual IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM'02), 2002.

9. Online: http://www.altera.com/cgi-bin/device_compare.pl Author's address: Ngo Van Thuyen -Email: [email protected]

University of Technical Education Ho Chi Minh City 01 Vo Van Ngan Str, Thu Due Dist, Ho Chi Minh City