VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
118AN ANALYTICAL RESEARCH ON LOW-COST HIGH-PERFORMANCE VLSI ARCHITECTURE SYSTEM
Gautam Kumar
Research Scholar, Laxmi Devi Institute of Engineering and Technology, Alwar RTU Kota, Rajasthan
Sandeep Kumar Dinkar
Associate Professor, Laxmi Devi Institute of Engineering and Technology, Alwar RTU Kota, Rajasthan
Abstract:- The multiplier gets and outputs the records with binary representation and uses most effective one-degree bring shop Adder (CSA) to keep away from the deliver propagation at each addition operation. This CSA is also used to carry out operand pre computation and format conversion from the carry shop format to the binary illustration, main to a low hardware value and quick important route put off on the price of more clock cycles for completing one modular multiplication. To overcome the weak spot, a Configurable CSA (CCSA), which might be one full-adder or two serial 1/2-adders, is proposed to lessen the greater clock cycles for operand pre computation and format conversion by using 1/2. The mechanism that may hit upon and skip the unnecessary deliver-keep addition operations in the one-level CCSA structure even as retaining the fast crucial direction postpone is developed. The extra clock cycles for operand pre computation and layout conversion can be hidden and high throughput can be acquired.
This paper is discussing about the Semi carry keep primarily based Montgomery Modular Multiplication (SCS-MM2), with excessive velocity overall performance. In this Paper, we propose a modified SCS based totally Bernard Law Montgomery modular multiplication (SCS-MM2) with a Reversible bring shop Adder (RCSA) the usage of peres gates, in order that the performance can be increased, and its simulation and synthesis outcomes are offered. Previously, the radix-2 Sir Bernard Law modular Multiplication (MM) structure changed into carried out for basic MM, complete deliver keep Montgomery Modular multiplication (FCS-MM) and the fundamental SCS-MM1. The proposed Radix-2 modified SCS-MM2 describes high overall performance architecture and its effects are proven for 128bit length.
1. INTRODUCTION
1.1 Modular Multiplication Algorithms 1.1.1 Montgomery Multiplication Fig. 1 demonstrates the radix-2 version of the Montgomery MM calculation (indicated as MM calculation). As mentioned earlier, the Montgomery specific products of A and B can be were given as S=A×B×R−1 (mod N), where R−1 is the speak of R modulo N. That is, R×R−1=1 (modN). Word that, the notation Xi in Fig. 1 demonstrates the ith little bit of X in binary portrayal. What is greater,
the notation Xi: jindicatesa portion of X from the ith bit to jthbit. Because the merging scope of S in MM calculation is0≤S<2N, an additional operation S=S−Nis required to expel the bigger than average deposit if S≥N. To dispense with the final comparison and subtraction in step 6 of Fig. 1, Walter changed the quantity of cycles and the esteem of R to k+2and 2k+2modN, separately. All things considered, the long carry propagation for the big operand growth nonetheless restricts the execution of MM calculation.
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
119 Algorithm SCS − based MM:SCS − based Montgomery multiplication Inputs: A, B, N (modulus)
Output: S k + 2
1. SS 0 = 0; SC 0 = 0;
2. for i = 0 to k − 1 { 3. qi = S i 0+ Ai x B0 mod 2;
4. SS i + 1 , SC i + 1 = Ai x B0) mod 2 ; 5. }
6. S k + 2 = SS k + 2 + SC k + 2 ; 7. return S k + 2 ;
Fig. 1 SCS-based Montgomery multiplication algorithm
Fig. 2 SCS-MM-1 multiplier 1.1.2 FCS-Based Montgomery
Multiplication
To keep away from the configuration trade, FCS-primarily based Montgomery multiplication keeps up A, B, and S within the carry save portrayals (AS, AC), (BS, BC), and (SS, SC), respectively.
McIvor et al. [9] proposed two FCS based Bernard Law Montgomery multipliers,
indicated as FCS-MM-1 and FCS-MM-2 multipliers, made out of one 5-totwo (3- stage) and one four-to (-level) CSA architecture, individually. The calculation and layout of the FCS-MM-1 multiplier are appeared in Figs. 2 and 3, respectively. The barrel enlist complete snake (BRFA)
Alogrithm FCS − MM − 1:
FCS − based Montgomery multiplication Inputs: AS, AC, BS, BC, N modulus
Outputs: SS k + 2 , SC k + 2
1. SS 0 = 0; SC 0 = 0;
2. for i = 0to k + 1{
3. qi= SS i 0+ SC i 0+ Ai x BS0+ BC0 mod2;
4. SS i + 1 , SC i + 1 = (SS i + SC i + Ai x BS + BC + qi x N)/2;
5. } 6. return SS k + 2 , SC k + 2 ; Fig. 3 FCS-MM-1 Montgomery multiplication algorithm
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
120 Fig. 4 FCS-MM-1 multiplierIn Fig. 4 accommodates of two flow registers for putting away AS and AC, a complete snake (FA), and a flip-flounder (FF). For more details about BRFA, please allude to [9] and [10]. On the other hand, the FCS-MM-2 multiplier proposed adds up BS, BC, and N into DS and DC at the beginning of each MM. Therefore, the profundity of the CSA tree can be reduced from 3 to 2 stages. All matters considered, the FCS-MM-2 multiplier necessities two additional 4-to-1 multiplexers tended to by Ai and qi and two extra registers to store DS and DC to diminish one degree of CSA tree. Sooner or later, the critical path of the FCS-MM-2 multiplier is probably quite reduced with a noteworthy increment in system sector when compared with the FCS-MM-1 multiplier.
1.2 Proposed System Using SCS MM-2 To preserve a strategic distance from the lengthy carry engendering, the midway aftereffects of moving unique expansion may be stored inside the carry save representation (SS, SC), as regarded in Fig. 1. Note that the number of emphasess in Fig. 1 has been modified from k tok+2 to expel the final examination and subtraction. However,
the agency transformation from the convey spare arrangement of the closing precise item into its double configuration is needed, as shown in step 6 of Fig. 1.
Fig. 2 demonstrates the engineering of SCS-primarily based MM calculation proposed (signified asSCS-MM-1 multiplier) constructed from one two- stage CSA layout and one association converter, wherein the dashed line denotes a 1-bit flag. In [5], a 32-bit CPA with multiplexers and registers (indicated as CPA_FC), which includes two 32- bitinputs and produces a 32-bit yield at every clock cycle, became received for the association exchange. Thus, the 32-bit
CPA_FC will take 32 clock cycles to complete the enterprise alternate of a 1024-piece SCS-primarily based Sir Bernard Law multiplication. The additional CPA_FC presumably develops the area and the critical path of the SCS- MM-1 multiplier. The works precomputed D=B+N so that the calculation of Ai ×B+qi
×Nin step four of Fig. 1 can be disentangled into one preference operation. One of the operands zero, N, B, and D will be picked if (Ai, qi) = (zero,0),(zero, 1), (1, zero), and (1, 1), one by one.
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
121 Fig. 5 SCS-MM-2 multiplierOn the other hand, Zhangetal. Reused the 2-level CSA engineering to play out the employer transformation so the CPA_FC may be evacuated. This is, S[k+2]=SS[k+2]+SC[k+2] in step 6 of Fig. 5 is supplanted with the repeated carry- spare expansion operation (SS[k +2], SC[k +2]) =SS[k +2]+SC[k +2] till SC[k +2]=0.
Fig. 5 shows the design of the Sir Bernard Law multiplier proposed (denoted as SCS- MM-2 multiplier). Word that the select signals of multiplexers M1and M2in Fig. 5 produced by the manipulate element are not appeared in Fig. 5 for the motive of simplicity. However, the additional clock cycles for prepare transformation are reliant on the longest bring proliferation chain in SS[k+2]+SC[k+2]and about k/2 clock cycles are required in the maximum pessimistic situation seeing that two- stage CSA engineering is acquired.
3. PROPOSED SYSTEM:
3.1 A. Basic Path Delay Reduction The deliver propagation addition operations of B+N and the association conversion are finished via the one-stage CSA engineering of the MSCS-MM multiplier through repeatedly executing the carry-spare enlargement (SS, SC)=SS+SC+0 until SC=zero. In expansion, we moreover precompute Ai and qi in cycle i−1 (this can be clarified all the greater unmistakably in segment III- C)in order that they can be applied to right away pick the desired enter operand from zero, N, B, and D through the multiplexer M3in emphasis i. Alongside these traces, the critical path deferral of the MSCS-MM multiplier may be reduced into TMUX4+TFA. Be that as it can, notwithstanding acting the three-input
bring spare increases ok +2 instances, severa additional clock cycles are required to perform B+N and the configuration exchange via the only-level CSA engineering considering they should be performed once in every MM.
4. EQUIPMENT REQUIREMENTS 4.1 VLSI and Systems
Those points of hobby of integrated circuits convert into focal factors on the framework level:
• Smaller Physical Length:
Smallness is frequently choice in itself—recall convenient TVs or handheld cellular telephones.
• Lower Control Intake: replacing a modest bunch of preferred elements with a solitary chip diminishes upload up to control utilization. Decreasing pressure utilization has an expansive affect on anything is left of the framework: a littler, much less pricey power supply can be utilized; when you consider that much less strength usage implies much less warm temperature, a fan may also by no means again be critical; a less complex bureau with much less protective for electromagnetic defensive might be potential, as well.
• Decreased Price: Reducing the quantity of segments, the energy deliver requirements, bureau costs, et cetera, will without a doubt decrease framework price.
The step by step increasing influence of reconciliation is with the cease goal that the price of a framework labored from custom
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
122 ICs can be less, no matter the factthat the character ICs value more than the usual parts they supplant. Understanding why coordinated circuit innovation has such vast effect on the define of advanced frameworks requires understanding both the innovation of IC fabricating and the monetary components of ICs and automated frameworks.
4.2 Mask-Driven Manufacturing
Coordinated circuit fabricating innovation, on the other hand, is astoundingly adaptable. Whilst there are some assembling paperwork for various circuit sorts—CMOS, bipolar, and so on.— an assembling line could make any circuit of that sort just through changing more than one essential instruments referred to as covers. As an instance, a solitary CMOS producing plant can make the 2 microchips and microwave stove controllers via converting the veils that frame the examples of wires and transistors at the chips.
Silicon wafers are the crude cloth of IC producing. The manufacture procedure frames designs at the wafer that make wires and transistors. A development of indistinguishable chips are designed onto the wafer (with a few space stored for test circuit structures which allow assembling to quantify the aftereffects of the assembling process).
The IC fabricating system is effective in view that we are able to create severa indistinguishable chips by dealing with a solitary wafer. Through converting the covers that parent out what designs are set down at the chip, we determine the computerized circuit a good way to be made.
The IC introduction line is a nonexclusive assembling line—we are able to swiftly retool the line to make full-size amounts of some other type of chip, utilising a comparable handling steps applied for the road's beyond item.
4.3 Circuits and Layouts
We ought to manufacture a breadboard circuit out of preferred components. To construct it on an IC manufacture line, we need to move above and beyond and outline the layout, or examples at the
veils. The rectangular shapes in the format (appeared here as a painting referred to as a stick define) body transistors and wires which suit in with the circuit in the schematic.
Making designs is rather tedious and essential—the degree of the format comes to a decision the price to fabricate the circuit, and the states of components in the layout determine the velocity of the circuit also. Amid assembling, a photolithographic (photographic printing) method is applied to trade the design designs from the veils to the wafer. The examples left by the duvet are applied to specially trade the wafer: polluting affects are covered at chosen areas inside the wafer; shielding and main substances are blanketed fine of the wafer also.
4.4 Manufacturing Defects
Considering that no assembling method is impeccable, a part of the chips on the wafer won't paintings. Considering no much less than one deformity is certain to manifest on every wafer, wafers are cut into littler, running chips; the largest chip that can be sensibly fabricated these days is 1.5 to 2 cm on a facet, at the same time as a wafer is in shifting from 30 to forty five cm. Every chip is separately attempted; the ones that breeze via the check are spared after the wafer is diced into chips. The working chips are set in the bundles natural to advanced creators.
In some bundles, modest wires interface the chip to the package's pins at the same time as the package frame shields the chip from handling and the additives; in others, weld knocks mainly companion the chip to the package.
Coordinated circuit producing is an powerful innovation for 2 reasons: all circuits can be made from a couple of types of transistors and wires; and any mix of wires and transistors can be primarily based on a solitary manufacture line just by converting the veils that decide the instance of segments at the chip. Included circuits run brief in mild of the fact that the circuits are little.
4.5 Cost of Manufacturing
IC producing flora are to a first-rate diploma expensive. A solitary plant charges as much as $4 billion. Given that some other, high-quality in elegance
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
123 fabricating technique is produced atregular durations, that is a great speculation. The venture bodes well because a solitary plant can produce this sort of big wide variety of chips and can without a good deal of a stretch be changed to make unique styles of chips.
Within the early years of the integrated circuits commercial enterprise, businesses focused on constructing large amounts of more than one trendy elements. Those components are products—one eighty ns, 256Mb dynamic RAM is quite much similar to a few other, paying little recognize to the maker.
Organizations targeted on product components to some extent in view that assembling forms have been much less in reality knew and producing varieties are less difficult to display when a similar element is being synthetic for pretty a while.
4.6 Types of Chips
The superiority of preferred parts pushed the problems of constructing tweaked frameworks again to the board-stage creators who applied the standard components.
4.6.1 Application-Particular Coordinated Circuits (ASICs)
Instead of collect a framework out of general parts, originators could now be able to make a solitary chip for his or her specific application. Since the chip is precise, the factors of some fashionable elements can frequently be beaten right into a solitary chip, lessening framework measure, strength, warm temperature, and fee. Software-precise ICs are potential as a consequence of laptop devices that help people configuration chips notably greater swiftly.
4.6.2 Systems-on-Chips (SoCs)
Manufacture innovation has improved to the factor that we will put a whole framework on a solitary chip. As an example, a solitary chip computer can include a CPU, shipping, I/O devices, and memory. SoCs permit frameworks to be made at a whole lot decrease cost than the identical board-level framework. SoCs can likewise be better execution and decrease manage than board-stage counterparts considering that on-chip
associations are extra effective than chip- to chip institutions.
4.7 Field-Programmable Gate Arrays (FPGA)
A discipline-programmable portal show (FPGA) is a bit of programmable justification that may execute multi-level purpose limits. FPGAs are most by and massive used as disengaged product chips that can be modified to recognize enormous limits.
Regardless, little squares of FPGA approach of reasoning may be useful fragments on-chip to empower the client of the chip to re-try some part of the chip's sound restriction. A FPGA piece ought to entire each combinational justification limits and interconnect to have the potential to fabricate multi- degree purpose limits. There are more than one particular advances for programming FPGAs, but most justification shapes are likely now not going to finish in opposition to wires or comparable tough programming progresses, so we are able to deal with SRAM-altered FPGAs.
5. TOOLS 5.1 Modelsim
Essential Steps for Simulation
This section offers moreover detail diagnosed with each progression during the time spent recreating your plan using ModelSim.
Step 1- collecting files and Mapping Libraries
Files expected to run ModelSim on your define:
• Layout files (VHDL, Verilog, and moreover gadget C), inclusive of enhance for the plan
• Libraries, each running and asset
• Modelsim.Ini (therefore made by the library mapping price)
5.2 Introduction to XILINX ISE
This tool can be applied to make, execute, recreate, and integrate Verilog plans for utilization on FPGA chips.
ISE: Integrated Software Program Environment
• Surroundings for the development and trial of computerized systems
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
124 configuration focused to FPGA orCPLD
• Incorporated accumulation of gadgets to be had through a GUI
• Based totally on a legitimate union motor (XST: Xilinx Synthesis technology)
5.2.1 Implementation
Synthesis (XST)
Produce a netlist record starting from a HDL portrayal
Translate (NGDBuild)
Converts all information plan netlists and after that composes the results into a solitary blended document, that depicts reason and constraints.
Mapping (MAP)
Maps the motive on machine parts.
Takes a net list and gatherings the smart additives into CLBs and IOBs (components of FPGA).
Place And route (PAR)
Region FPGA cells and interfaces cells.
Bit circulation technology 5.3 Introduction to FPGA
FPGA stays for area Programmable Gate Array which has the style of cause module, I/O module and steering tracks (programmable interconnect). FPGA may be arranged by cease customer to execute specific hardware. Pace is up to one hundred MHz but at introduce speed is in GHz.
Essential programs are DSP, FPGA based desktops, purpose imitating, ASIC and ASSP. FPGA may be changed generally on SRAM (Static Random get right of entry to reminiscence). It's miles risky and principle desired perspective of making use of SRAM programming innovation is re-configurability. Issues in FPGA innovation are multifaceted nature of motive thing, clock bolster, IO assist and interconnections (Routing).
5.4 FPGA Design Flow
FPGA contains a dimensional sorts of rationale pieces and interconnections among purpose squares. Both the rationale pieces and interconnects are programmable. Motive squares are customized to actualize a coveted capacity
and the interconnects are modified utilizing the trade bins to interface the purpose portions.
To be all of the more clear, at the off chance that we need to execute an unpredictable plan (CPU as an example), at that factor the outline is separated into little sub capacities and each sub work is actualized making use of one motive piece. Currently, to get our coveted outline (CPU), all the sub capacities achieved in intent squares ought to be associated and that is completed by using programming the interconnects.
6. RESULTS
6.1 Proposed Advanced Encryption Standard
SCS is based on a design precept called a substitution-permutation network, mixture of both substitution and permutation, and is rapid in both software program and hardware. Its predecessor SCS, AES does no longer use a Festal network. AES is a version of Rijndael which has a fixed block length of 128 bits, and a key size of 128, 192, or 256 bits. By means of evaluation, the Rijndael specification in step with se is exact with block and key sizes that may be any a couple of 32 bits, each with at the very least 128 and a most of 256 bits.
AES operates on a 4×4 column- foremost order matrix of bytes, termed the state, despite the fact that a few variations of Rijndael have a bigger block size and have additional columns in the nation. Maximum AES calculations are done in a special finite area. The key length used for an SCS-MM2cipher specifies the range of repetitions of transformation rounds that convert the enter, known as the plaintext, into the final output, referred to as the cipher textual content. The numbers of cycles of repetition are as follows:
10 cycles of repetition for 128-bit keys.
12 cycles of repetition for 192-bit keys.
14 cycles of repetition for 256-bit keys.
Every spherical includes numerous processing steps, each containing four similar but distinctive ranges, such as one which depends at the encryption key itself [9]. A hard and fast
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
125 of opposite rounds are applied totransform cipher text back into the authentic plaintext the usage of the equal encryption key.
6.2 High-Level Description of the Algorithm
Key Expansions
Round keys are derived from the cipher key using Rijndael's key schedule. It requires a separate 128-bit round key block for each round plus one more.
Initial Round
Add Round Key—each byte of the state is combined with a block of the round key using bitwise xor.
Rounds
Sub Bytes—a non-linear substitution step where each byte is replaced with another according to a lookup table.
Shift Rows—a transposition step where the last three rows of the state are shifted cyclically a certain number of steps.
Mix Columns—a mixing operation which operates on the columns of the state, combining the four bytes in each column.
Add Round Key
Final Round (no Mix Columns)
Sub Bytes
Shift Rows
Add Round Key.
6.3 Simulation Result
Fig 6 6.4 Synthesis Results
The evolved undertaking is simulated and confirmed their capability. As soon as the functional verification is completed, the RTL version is taken to the synthesis method the use of the Xilinx ISE device.
In synthesis manner, the RTL model can be transformed to the gate stage netlist mapped to a specific era library. Right here in this Spartan 3E family, many special gadgets have been to be had in the Xilinx ISE device. So that you can synthesis this design the device named as
―XC3S500E‖ has been selected and the package as ―FG320‖ with the tool pace inclusive of ―-4‖.
This designis synthesized and its effects have been analyzed as follows
6.5 Rtl Schematic
Fig 7 RTL Schematic with SCS-MM2
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
126 6.6 Technology SchematicFig 8 Technology Schematic for SCS –MM2 The principle advantage of proposed
system is to increase the speed of set of rules. It executed by way of implementing a full adder the use of peres gates.
Represents a full adder good judgment via using two peres gates. Peres gates are called reversible gates. The first peres gate
resembles a half adder and the second one also the identical. These gates produce three garbage outputs, which we don’t require. Right here we overlooked it.
The performance of RCSA is better than CSA.
6.7 Design Summary
Fig 9 Design Summary 6.8 Timing Report
Speed Grade: -4
Minimum period: 25.764ns
(Maximum Frequency:
38.814MHz)
Minimum input arrival time before clock: 35.797ns
Maximum output required time after clock: 11.408ns
Maximum combinational path delay: 11.219ns
7. CONCLUSION
This paper is discussing about the Semi deliver save based 1st viscount montgomery of alamein Modular
Multiplication (SCS-MM2), with excessive pace overall performance. In this Paper, we advise a modified SCS based totally Bernard Law Montgomery modular multiplication (SCS-MM2) with a Reversible deliver keep Adder (RCSA) the usage of peres gates, in order that the overall performance can be increased, and its simulation and synthesis effects are provided. Formerly, the radix-2 1st viscount montgomery of alamein modular Multiplication (MM) architecture become carried out for simple MM, full deliver keep Sir Bernard Law Modular multiplication (FCS-MM) and the simple SCS-MM1. The proposed Radix-2 changed
VOLUME: 07, Issue 07, Paper id-IJIERM-VII-VII, September 2020
127 SCS-MM2 describes excessiveperformance structure and its effects are shown for 128 bit length.
REFERENCES
1. R. L. Rivest, A. Shamir, and L. Adleman, ―A method for obtaining digital signatures and public-key cryptosystems,‖ Commun. ACM, vol. 21, no. 2,pp. 120–126, Feb. 1978.
2. V. S. Miller, ―Use of elliptic curves in cryptography,‖ in Advances in Cryptology.
Berlin, Germany: Springer-Verlag, 1986, pp. 417–426.
3. N. Koblitz, ―Elliptic curve cryptosystems,‖
Math. Comput., vol. 48,no. 177, pp. 203–
209, 1987.
4. P. L. Montgomery, ―Modular multiplication without trial division,‖ Math.Comput., vol.
44, no. 170, pp. 519–521, Apr. 1985.
5. Y. S. Kim, W. S. Kang, and J. R. Choi,
―Asynchronous implementationof 1024-bit modular processor for RSA cryptosystem,‖
in Proc. 2ndIEEE Asia-Pacific Conf. ASIC, Aug. 2000, pp. 187–190.
6. V. Bunimov, M. Schimmler, and B. Tolg, ―A complexity-effective version of Montgomery’s algorihm,‖ in Proc.
Workshop Complex. Effective Designs, May 2002.
7. H. Zhengbing, R. M. Al Shboul, and V. P.
Shirochin, ―An efficient architecture of 1024-bits cryptoprocessor for RSA cryptosystem basedon modified Montgomery’s algorithm,‖ inProc.4th IEEE Int. WorkshopIntell. Data Acquisition Adv.
Comput. Syst., Sep. 2007, pp. 643–646.
8. Y.-Y. Zhang, Z. Li, L. Yang, and S.-W.
Zhang, ―An efficient CSA architecture for Montgomery modular multiplication,‖
Microprocessors Microsyst., vol. 31, no. 7, pp. 456–459, Nov. 2007.
9. C. McIvor, M. McLoone, and J. V. Mc Canny, ―Modified Montgomery modular multiplication and RSA exponentiation techniques,‖ IEE Proc.-Comput. Digit.
Techn., vol. 151, no. 6, pp. 402–408, Nov.
2004.
10. S.-R. Kuang, J.-P. Wang, K.-C. Chang, and H.-W. Hsu, ―Energy-efficient high- throughput Montgomery modular multipliers for RSA cryptosystems,‖ IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 11,pp. 1999–2009, Nov. 2013.
11. J. C. Neto, A. F. Tenca, and W. V. Ruggiero,
―A parallel k-partition method to perform Montgomery multiplication,‖ in Proc. IEEE Int. Conf. Appl.-Specific Syst., Archit., Processors, Sep. 2011, pp. 251–254.
12. J. Han, S. Wang, W. Huang, Z. Yu, and X.
Zeng, ―Parallelization ofradix-2 Montgomery multiplication on multicore platform,‖ IEEET rans. Very Large Scale Integr. (VLSI) Syst., vol. 21, no. 12, pp.
2325–2330, Dec. 2013.
13. P. Amberg, N. Pinckney, and D. M. Harris,
―Parallel high-radix Montgomery multipliers,‖ in Proc. 42nd Asilomar Conf.
Signals, Syst., Comput., Oct. 2008, pp.
772–776