PDF Low Complexity Distributed Arithmetic Based Pipelined Vlsi ... - Ernet

To alleviate this problem, we used a single DA-based LMS adaptive filter with a new coefficient transfer scheme. This is based on the fact that when radix size becomes equal to the word length of decisions, the implementation of DA-based LMS adaptive filter can be made SA-less.

Introduction

Applications of ADF

System Identification
Channel Equalization
Noise Cancellation

If the ADF output y(n) approaches yu(n), then the ADF can model part of the unknown system. The desired primary input signal consists of the original speaker ˆd(n) and the interfered noise signal component η(n).

Fig. 1.2: Block diagram of system identification configuration of an ADF.

LMS Algorithm

CLMS Algorithm

It is clear from the above discussion that the LMS algorithm either provides fast convergence rate or low steady-state error based on the choice of step size. The coefficients of ADF0 and ADF1 are updated based on the LMS criterion, according to

BLMS Algorithm

Complexity Issues and Related Research in LMS based Systems

Thus, there is a need to reduce the critical path delay to meet the desired data rate specifications. It can be noted that the location of matching delays (mD) can be changed to achieve the desired critical path.

Fig. 1.9: Block diagram of ADF based on DLMS algorithm.

Distributed Arithmetic

Note that the least significant bit (LSB) of the filter coefficients always forms the address lines to the LUT. To increase system speed, groups of coefficient bits can be used as address lines for LUTs.

Fig. 1.11: DA architecture for inner product computation of w = [w 0 , w 1 , w 2 , w 3 ] and x = [x 0 , x 1 , x 2 , x 3 ] with LUT decomposition.

DA based FIR Filter

TC-DA based FIR Filter

According to (1.31), it is clear that the number of clock cycles for the production calculation is reduced by a factor γ. These are nothing more than the partial filter products of the input samples which are shifted and accumulated for W number of clock cycles.

Fig. 1.13: TC-DA architecture of a 4 th order FIR filter.

OBC-DA based FIR Filter

For the given set of input samplesex(n−i), the term dk(n) would take 2N combinations of the input samples, and of these only one combination is selected at a time.

Literature on DA based Implementations

Problem Formulation

Due to the limitation of standalone LMS ADF in various applications, a combination of LMS ADFs in parallel or in series or in block is usually used to achieve better performance. This motivates us to exploit possible advantages of DA in the realization of LMS ADFs with low complexity.

Organization of the Thesis

In Chapter 2, the complexity of LMS ADF is optimized using OBC, but it cannot be used to improve the convergence performance due to the presence of non-OBC terms. In Chapter 4, a low-complexity architecture of ADFE is presented for channel equalization problem in 5G communication system.

Introduction

Mathematical Formulation

It is clear from (2.3) that the filter coefficients require a single clock cycle for the update. It is worth noting that the number of clock cycles to update the filter coefficients is m+ 1.

Proposed Scheme

Optimal LUT and SA Architectures

Design and Analysis of OBC-LUT Architectures
High Radix TC and OBC based LUTs
Architecture of SA Unit

The corresponding proposed radix-4 partial product generator (PPG) of TC and OBC is shown in Fig. The pipeline structure of the proposed MMP-CSFA based SA unit (PSAR) is shown in Fig.

Table 2.1: Truth Table for TC and OBC Radix-4 Partial Product Generators b i,2ρ+1 b i,2ρ

Architecture for Small Order Filter

Complexity Considerations of Coefficient Update Unit

The convergence performance of the proposed designs after removing non-OBC terms is shown in Fig. Second, the bit slices of coefficient increment terms are obtained before the coefficient coefficients are updated, as in (2.12).

Fig. 2.8: Circuit schematic of 4 th order OBC-DA based LMS ADF.

Architecture for Large Order Filter

It consists of four 4th order DA base units (as shown in Fig. 2.8) whose outputs are added by two separate binary adder trees, corresponding to four (B+ 1)-bit sum and carry words. For each kem+1, initial clock cycles of DA base unit γj(l) are calculated by activating the corresponding registers in CURA using Ej = 1 and fed to ISR unit.

Fig. 2.10: Circuit schematic of 16 th order OBC-DA based LMS ADF with an ISR unit and a controller.

Performance Comparison

Computational Complexities

Hardware Complexities
Time Complexities

Like DA3-ADF and DA4-ADF, the CP0 of the proposed designs includes a computational delay of OBC-LUT. Interestingly, the proposed designs require a single clock cycle to produce the output like DA3-ADF and DA4-ADF, except for the proposed Structure-I.

Implementation Results

ASIC Synthesis
FPGA Synthesis

It is quite clear that the throughput of the DA0-ADF, DA1-ADF and DA2-ADF models are significantly lower than the other tube models. The physical LUTs of the DA0-ADF, DA1- ADF and DA2-ADF models are mapped on the FPGA using the HDL primitive [80].

Conclusion

The RLS ADF offers a fast initial rate of convergence and a low steady-state error compared to the LMS ADF. LMS ADF with variable step size is a potential solution that achieves both fast initial convergence and low steady-state error [13].

DA -ADF [69] DA -ADF [70] DA -ADF [71] Fig.2.11:ComparisonofadderscomplexityforthepresentedandexistingDAbaseddesigns.

Mathematical Formulation

Thus, the coefficient update equation of pipeline CLMS ADF with two adaptation delays can be expressed as . Therefore, it can be concluded that the separate coefficient adjustment of ADF1 is not required.

Proposed Scheme

Architecture for Small Order Filter

The second adjustment delay is placed after ec(n−1) calculation to obtain ec(n−2) from ec(n−1), as shown in Fig. The internal scheme of the proposed CCU includes an equality check, a counter with enable (E) and clear (CLR) inputs and a comparator, as shown in Fig.

Architecture for Large Order Filter

Simulation study indicates that the duration of time windows of undesired correlations is always less compared to the duration of time window of desired correlation (in steady state). It should be noted that the counter must be cleared for each new count using CLR signal.

Fig. 3.6: Circuit schematic of 16 th order LMS ADF based on TC-DA.

Determination of Pre-defined Time Window

To derive the predefined time window ζ and additional mean steady-state error performance ∆ξd2, three points are considered: 'A', 'B' and 'C', as shown in Fig. This is because the number of iterations would be increased to achieve lower steady-state errors.

Fig. 3.7: Assumed MSE curves for the presented design to determine ζ.

Performance Comparison

Computational Complexities

Convergence Performance

For example, the proposed filter used step size µ0 = 1/N in the initial adjustment period and step size µ1 = 2−p/N in the steady state. According to the simulation results, the improvements in steady-state error performance for the proposed filter can be as high as 6 dB compared to DA3-ADF and DA4-ADF.

ASIC Synthesis

However, using a smaller step size µ1 = 2−p/N would reduce both mismatches and stationary errors, but it makes the convergence speed slower. On the other hand, while the proposed architecture is pipelined, the convergence speed and minimum steady-state error turn out to be better because it is based on the two different step sizes.

Conclusion

Tabel 3.1: Sammenligning af beregningsmæssige kompleksiteter mellem DA-baserede designfor Nth OrderFilter,Lth OrderBase UnitandW-bit WordlengthofFilter Coefficients DesignTypeAddersRegisterShiftersLUT/MUXthroughput DA0-ADF[67]Non-Pip/Par†+3M-2TM+(1K+2M+2M+TM+1M − 1)TM+ TA )] DA1(a)-ADF[68]Non-Pip/Par†2M+N+12N+3L+1LM2LM-LUT1/[k1(TACC+TA) DA1(b)-ADF[68]Non-Pip/ Par ?†3M+N+32M+3N+1LM+NM0-XOR2L−1M-LUT1/[k1(TACC+TM+TA) DA2-ADF[69]Non-Pip/Par?‡4M+3M+3N+2M + 2NM0-XOR2LM-LUT1/[k2(TACC+(L+1)TM+2TA)] DA3-ADF[70]Pip∗/Par†(3,2L−1+1)M3M(1+2L−1)+N + 2∼2LM2LM-MUX1/[W(LTM+TFA+TD)] DA4-ADF[71]Pip∗/Par†M(2+2L−1)+N(3+2L)M+2N+4LM2LM-MUX1 / [W(LTM+TFA+TX+TD)] ProposedPip∗/Par†¶M(3+2L−2)+N0(4+2L−1)M+NLM(2L−1)M1/[W(LTM) +TA+TFA+TX +2N0+4(1+L)M-MUX+Td)] Pip∗:Pipelinet arkitektur med to tilpasningsforsinkelser,Par:ParallelLUT,†:samtidigeSAogCUUoperationer,‡:samtidigeSA,CUUog LUToperationer,?:OBCbaserede produktdeling.Designeti[67–69]har en controller, mens det foreslåede design involverer en CCU, der består af en kvalitetskontrol, en modsætning og en komparator;N=LM,L:basisrækkefølge,M:antal basisenheder;B:angiver ordlængden af butinputprøver ogkoefficienter;L−W=N0=N0=N ;k0= 2L+max(W,2L−1)+log2M,k1=2L−1+W+log2M+1, k2=2L/2+max(W,2L/2−1)+log2M+1;TACC ,TM, TA,TFAogTDerhenholdsvis forsinkelser for at se-uptable,a2-til-1multiplekser,enadder,enfuld-adder ogregistreres. Imidlertid ville den direkte realisering af FB-filter i DFE ved hjælp af DA øge den kritiske vej, da SA-enhed bidrog med betydelig tid i feedback-sløjfen.

Fig. 3.9: (a) Finite-precision simulation of the presented DA based design, slow-ADF, fast-ADF and basic-CLMS ADF [11] for 128 th order, FP: floating-point and FXP: fixed-point (b) MSE learning curves of DA 0 [67], DA 1 [68], DA 2 [69], DA 3 [70], DA 4 [71

Mathematical Formulation and Background

Therefore, it is important to design a trial-decision FB filter using OBC-DA as it offers low implementation complexity. This was addressed by the same authors in [33] by precomputing and storing the remaining FB filter coefficients in a second LUT as shown in the figure.

Fig. 4.2: Architecture of conventional ADFE for N b th order FB filter.

Proposed Scheme

Transformation of LUT Contents
Proposed Architecture
Design of High-Throughput Architecture

Use of Inverted Multiplexers
Use of Buffers/Inverters
Use of Retiming Technique

Coefficient Update Unit

In similar arguments, the iteration limit of the proposed unfolded architecture for 16-QAM would be 11TM/3. The update schedule of DLUT from time n to n+1 for 8th order FB filter is shown in FIG.

Fig. 4.6: Transformation of LUT contents of stage-I and stage-II for 8 th order FB filter.

Performance Comparison

Computational Complexities

Hardware Complexity
Time Complexity

Similarly, the number of clock cycles required for the update of DLUT in adaptive TSP-DFE and R-DFE designs would be 2Nb-P and 2Nb, respectively. In contrast, the number of clock cycles required for the design in [39] depends on the buffer size and the delay factor.

Error Performance

Convergence Performance
BER Performance

In contrast, the design in [39] is associated with a slower factor whose value greatly affects the BER result. In addition, this design also needs extra clocks for system initialization, which further degrades BER performance.

Implementation Results

ASIC Synthesis
FPGA Synthesis

For example, if the BER value is fixed at 10-3 and P varies from 3 to 4, then almost 2 dB more SNR is required for the Rayleigh fading channel, while it is only 1.2 dB for the AWGN channel. Compared with R-DFE and TSP-DFE, the proposed design provides a better BER value by eliminating the effect of the first P coefficients of FB filter in parallel due to E-MUX at stage-I, as shown in Figure 1.

Conclusion

Tabela 4.1: Primerjava računalniških zapletenosti med zasnovami DFED za Nth for OrderFFfilterin Nth BorderFBfilter with ConstellationSizeM DesignTypeAddedsMultiplexersRegistersCriticalPath/ThroughputLatencyLUT/MULT R-DFE†[29]non-A(M)Nb(M)Nb−1Nblog2 MTMlog2M00 A2(M)Nb+12(M)Nb−2 +(M −1)NbNblog2M1/[k0TMlog2M]02Nb+M PP-DFE†?[33]ne-A2P+M +(Nb−P+1)2P+M−1(Nb+P)log2MTFA∼TMULP(Nb−P+ 1)log2M TSP-DFE†[33]ne-A2(M)Nb/22(M)Nb/2−23Nblog2M/2TA+TS Nb/2+1+TMlog2MNb/22(Nb+M)/2+1 A3 (M)Nb/2+23(M)Nb/2 +(M−1)Nb/2(3Nblog2M)/21/[k1(2TA+TS N/2+1+log2MTM)]Nb/23(2) (Nb+M)/2 DFE†[34]A(M)Nb(M)Nb(M)Nb+NbJlog2(Nb+1)TMβ+γ0 DFE†[35]ne-A(M)Nb~MNbNbNb2 + (M)Nblog2(Nb+1)TM/(P+log2N−1)R−10 DFE?[38]ne-A∼[(Nb+1)log2M]2−Nblog2MJ0[(Nb+1)TA+TS ]γ∼(Nblog2M)2 DFE?[39]A∼2[(Nblog2M)2+1]−Nblog2M1/[k2((N+1)TA+TS)]β+γ∼2(Nblog2M)2 DFE† [40]ne-A(Nblog2M)2/2(M−1)Nb2 /2(Nblog2M)3/6NbTA+TS+TMlog2MQ−1− Predlagano†ne-A2(M)Nb/2−1+12(M )Nb/2−1(3Nb/2+2)log2MTA+0,5TS Nb/4+1+TMlog2MNb/2+12(Nb+M)/2 A3(M)Nb/2−1+33(M)Nb /2−1−1 +(M−1)Nb/2(3Nb/2+2)log2M1/[k3(1,5Ta+0,5Ts Nb/4+1+TMlog2M)]Nb/2+13(2)( Nb+M)/2−1 †:Arhitektura brez množitelja;?:Arhitektura, ki temelji na množilniku;A:Prilagodljivo,P:Faktor pospeševanja,J:Faktor odvijanja (ali paralelizacije),R:Faktor prirastka,Q:Faktor ponovitve;β:Faktor počasnega upadanja; γ: Urni cikli inicializacije sistema, za enostavnost γ = 0; KM: Velikost vhodnega medpomnilnika; KM=2MKwithK=Nf+Nb+1,Throughput=Clockrate/Processingtimepersample,Clockrate=1/Criticalpath,k0=Nf+α02Nb+1; k1=Nf+P+α12Nb−P+1;k3=Nf+P+α22Nb−P−1+2;z α0=2M,α1=2M/2inα2=2M/2−1;k2=J0KM;zJ0=pJinpoznačuje ekstraparalelizacijo, vključeno v DFE ?[38]inDFE?[39].TMTA, TMUL in TFA so računske zakasnitve multiplekserja 2 proti 1, seštevalnika, množilnika in polnega seštevalnika za shranjevanje s prenosom. Poleg navedenih zapletenosti strojne opreme predlagana neprilagodljiva DFE in ADFE zahtevata 1-bit 2((M−1)Nb/2−1)in3(( Vrata M−1)Nb/2−1)XOR, medtem ko predlagani ADFE vključuje pogojno lestvico in sodčasti premik za posodobitev vsebine DLUT oziroma koeficientov filtra FB. Upoštevajte, da zasnove[34,39]temeljijo na istem principu in vključujejo medpomnilnike velike velikosti. znaki so ocenjeni za I-kanal sistema M-QAM, ki jih je treba pomnožiti s faktorjem dva, da se določi celotna kompleksnost. Kot je razloženo v 1. poglavju, je ADF osnovni gradnik v dušilniku hrupa za oceno hrupa za dušenje.

DFE [40] Fig.4.14:Comparisonofcriticalpathsbetweenproposedandexistingdesignsfor16-QAM.

Mathematical Formulation

As a result, the LUT size can be reduced to half, the remaining half combinations are taken by external XOR gates. It can be noted from (5.20) that the content present at the lower half even address location forp= 0, q= 0 is two's complement to the content present at the upper half odd address location forp= 1, q= 1.

Fig. 5.2: Block diagram of pipelined ADF based on BLMS algorithm with one adaptation delay.

Proposed Scheme

Filter Block Update Strategy
Proposed Architecture

LUT Update Scheme
Architecture of Sub-Filter Unit
Architectures of Error Computation Unit and Coefficient Update Unit 146

Computational Complexities
Noise Reduction Performance
ASIC Synthesis

While the proposed design takes only 41 clock cycles (8 clock cycles in LUT0 and LUT1 update, 32 clock cycles in SA unit and 1 clock cycle in block error calculation and updating the content of external registers). From the results, it is found that the proposed filter works well at certain frequencies.

Fig. 5.4: Block update scheme for the presented DA based BLMS ADF (only LUTs are shown with subscripts 0 and 1 indicate even and odd components respectively) of 6 th order filter and block-length of two

Conclusion

Although the complexity of pipelined LMS adaptive filter was optimized using OBC scheme in Chapter 2, non-OBC terms were produced at the output. In each iteration, the correlation between the adjacent errors was compared with the predefined time window.

Table 5.2: Comparison of Adders, LUT size and Noise Reduction of Presented and Existing Design [47]

Suggestions for Future Research

Bishop, »Algoritem prilagodljivega filtra s spremenljivim korakom (VS),« IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. Parker, “Block implementation of adaptive digital filters,” IEEE Transactions on Acoustics, Speech and Signal Processing, vol.

Block diagram of an ADF

Block diagram of system identification configuration of an ADF

Block diagram of channel equalization configuration of an ADF

Block diagram of conventional ADFE

Block diagram of adaptive noise cancellation configuration of an ADF

Circuit schematic of an ADF based on conventional LMS algorithm

Block diagram of ADF based on CLMS algorithm

Block diagram of ADF based on BLMS algorithm

Block diagram of ADF based on DLMS algorithm

Block diagram of pipelined LMS ADF with two adaptation delays

Architecture of pipelined LMS ADF based on OBC-DA

Design, analysis and comparison of 4 th order OBC-LUT I, II, III architectures

Neuvo, “The maximum sampling rate of digital filters under hardware speed limitations,” IEEE Transactions on Circuits and Systems, vol. F¨arber et al., “The 5G candidate waveform race: a comparison of complexity and performance,” EURASIP Journal on Wireless Communications and Networking, vol.

Design, analysis and comparison of radix-4 TC and OBC partial product generators

Radix-2 pipelined MMP-CSFA based SA unit for OBC-DA

Radix-4 pipelined MMP-CSFA based SA unit for OBC-DA

Circuit schematic of 4 th order OBC-DA based LMS ADF

MSE learning curves of adaptive equalization problem for the presented and existing

Circuit schematic of 16 th order OBC-DA based LMS ADF with an ISR unit and a

Comparison of adders complexity for the presented and existing DA based designs

Comparison of registers complexity for the presented and existing DA based designs

Comparison of multiplexers, XOR gates and LUT words complexities for the presented

Throughput curves for the presented and existing DA based designs

Thus, it is clear that the number of clock cycles to update the contents of DLUT is significantly reduced for the proposed design. The corresponding noise reduction performance results for the proposed design are illustrated in Fig.

Block diagram of pipelined CLMS ADF with two adaptation delays

Pipelined CSFA based SA unit for TC-DA

Circuit schematic of 4 th order S-PLUT

Circuit schematic of 4 th order LMS ADF based on TC-DA

The block diagram of the proposed BLMS ADF for filter order = 16 and block length = 4 is shown in Fig. Also, the proposed filter maintains an estimate of the input noise from time to time as shown in the figure.

Circuit schematic of 16 th order LMS ADF based on TC-DA

Assumed MSE curves for the presented design to determine ζ

Variation of ζ with respect to initial error for 32 nd order filter and 8, 16-bit wordlengths

Block diagram of OFDM-QAM with TEQ

Architecture of conventional ADFE for N b th order FB filter

Architecture of R-DFE for 3 rd order FB filter