due to coefficient update operation. Moreover, the filtering and coefficient updating operations are mutually coupled which needs that the partial products of filter coefficients must be re-calculated before the filtering operation. In the past, several authors have made effort based on approximations to the standard LMS algorithm [67–71]. Allred et al. in [67] suggested a new technique by employing an external LUT for storing the partial product of input samples. In their work, they showed that by proper selection of system parameters, the throughput can be made independent of the filter orders.
However, the use of two LUTs made the whole system computationally time and area intensive. Guo and DeBrunner in [68] addressed this problem using a single LUT for storing the partial products of input samples, while the adaptation of filter coefficients was similar to conventional LMS criterion. A considerable amount of savings in area and time were achieved for both in the filtering and coefficient updating operations. Later, Surya and Rafi in [69] proposed a novel design based on the framework of [67] by storing the partial products of filter coefficients and input samples in two separate LUTs.
Further, they showed that storing of the recent sample in the external register would allow decompo- sition of LUT into two smaller LUTs. Unlike the LUT based realizations discussed above, it is also possible to realize the LMS ADF without LUT [72]. Meher and Park in [70] proposed first LUT-less design for LMS ADF based on DLMS algorithm. Using this scheme, the throughput performance of LMS ADF was improved as compared to [67, 68], however, at the cost of an increase in hardware complexity, especially for the larger order of DA base units. Later, in [71], the same authors proposed a new pipelined architecture of LMS ADF using the frameworks of [68, 70]. It reduces both LUT update time of [68] and complexity requirement of [70] by a novel design of parallel LUT structure.
Specifically, LUT is used to generate the partial products of input samples in parallel using hard- ware elements. Further, the convergence properties are affected by the selected value of step-size and pipelined nature of the structure. In order to increase the speed and reduce the area-complexity of the system, binary carry-save full adder (CSFA) based SA unit was employed. However, it is found that the CSFA based SA unit has carry-loops which may limit the maximum operational speed [71].
Mohanty and Meher in [47] proposed DA based BLMS ADF where a novel LUT sharing technique has been suggested for the computation of filtering and coefficient updating operations. It offers low area and low power for the implementation BLMS ADF. Later, Mohanty et al. in [48] suggested a low-complexity BLMS ADF to achieve better area and power figures, but increased the computational time. Hence, the new challenge would be to reduce the computational complexities of pipelined LMS
TH-2070_136102022
1.6 Literature on DA based Implementations
ADF and its variants. With DA offers computational savings in the realization of LMS ADF, we feel that there is still scope to realize such filters which account for low-area, low-power, and can provide better throughput performance for higher order filters implementation.
1.6.1 Problem Formulation
Owing to the ubiquitous applications of LMS ADFs, the computational complexity has been in- creasing rapidly with increasing levels of integration. Thus, the computational complexity poses a major concern for VLSI implementation of LMS ADFs, especially in view of meeting the require- ments of a given application. In general, the computational complexity of an algorithm is expressed in terms of hardware and time complexities. In LMS ADF, multipliers primarily contribute to the hardware complexity, while the critical path delay determines the time complexity. As the order of ADF increases, both the number of multipliers and critical path delay go up, which make the real-time operation of ADF difficult. Pipelining is a commonly used technique to reduce the critical path delay, however it slows down the convergence rate and increases the steady state error. In order to improve the convergence performance of ADF, additional hardware is required. Moreover, due to the limitation of standalone LMS ADF in various applications, a combination of LMS ADFs in parallel or in series or in block is normally used to achieve better performance. For instance, the combination of two LMS ADFs in parallel provides fast convergence rate and low steady-state error in system identification, the combination of two LMS ADFs in series achieves better performance in channel equalization and block of LMS ADFs offers superior performance in noise cancellation scenarios. As a consequence, this leads to further increase in hardware complexity. Thus, there is a need for an alternative approach to reduce the overall computational complexity. Distributed arithmetic (DA) is a preferable method for efficient implementation of ADFs as it can eliminate the multipliers from LMS ADFs. However, the complexity of DA based LMS ADF is limited by the size of LUT which is used to store the filter partial products. The determination of redundancies in the partial products could be a possible solution for reducing the complexity of DA based LMS ADF. The complexity reduction of pipelined ADF using DA with the focus on improving its convergence performance without hardware overhead has not been studied in the literature. This motivates us to exploit possible benefits of DA in the realization of low-complexity LMS ADFs. As a proof of concepts, two case studies are considered with one in channel equalization problem for 5G communication system and other in noise cancellation problem for in-ear headphones.