SERIAL VS. PARALLEL ELLIPTIC CURVE CRYPTOPROCESSOR DESIGNS

(1)

SERIAL VS. PARALLEL ELLIPTIC CURVE CRYPTO PROCESSOR DESIGNS

Adnan Abdul-Aziz Gutub¹, Abdul-Aziz Tabakh², Ayed Al-Qahtani²and Alaaeldin Amin²

1Umm Al-Qura University (UQU) - Makkah 21955, Saudi Arabia

2King Fahd University of Petroleum and Minerals (KFUPM) - Dhahran 31261, Saudi Arabia

ABSTRACT

Point conversion in Elliptic Curve Cryptography (ECC) arithmetic suffer the costly complex division operation. Several methods proposed to overcome this division by multiple multiplication steps using chosen projective coordinates systems.

This work adopts Aladdin Modulo Multiplier algorithm (an improvement to Montgomery modulo multiplier) in its design to compare both serial and parallel ECC multiplier hardware for its computations. The investigation focuses on tradeoffs between time and area showing the benefit of different ECC designs, i.e. using serial and parallel Aladdin modulo multipliers.

KEYWORDS

Elliptic curve cryptography, serial & parallel implementation, modulo multiplier architecture, Aladdin multiplier, projective coordinates, data security.

1. INTRODUCTION

Cryptography systems have been growing in importance recently as a method for improving data security.

Most of the public key cryptosystems (such as RSA, ElGamal, and Elliptic curve cryptosystems) (Gutub &

Khan 2012, Gutub 2010, Gutub 2010-IJS, Tawalbeh et al.2010) use modular multiplication in encryption and decryption operations with a large modulus (ranging 100~1000 bits). Elliptic curve cryptography (ECC) had proven its high security with lowest number of modulus bits leading faster and simpler computations compared to others (Black et al. 1999, Gutub et al. 2011, Tawalbeh et al. 2009). The most complex arithmetic operation within ECC is its point conversion based on division. Several methods have been proposed to convert this division operation to multiple multiplication steps using a chosen projective coordinates system (Gutub et al. 2007, Gutub & Ibrahim 2003). In this work, we adopted an improved algorithm for modulo multiplication named Aladdin modulo multiplier to implement serial and parallel ECC hardware.

The ECC arithmetic is described to contain several elliptic curve point doubling and elliptic curve point adding operations (Black et al. 1999, Gutub & Tahhan 2008). The traditional way of calculating the ECC doubling and adding operations is performed in sequence and can be improved using specific parallelism (Gutub 2003, Gutub & Ibrahim 2003, Gutub et al. 2011, Gutub & Khan 2012), which is further enhanced in this work adopting Aladdin multipliers. The proposed parallel algorithms in the summarized papers (Gutub et al. 2007, Gutub & Ibrahim 2003) are evaluated and compared to this proposed work using VHDL. The investigation focused on the performance tradeoffs relating time (speed) and area (space) of these methods.

The results exploited parallelism in order to boosting up the ECC operation speed.

The paper is organized as follows. Section 2 gives a brief introduction to ECC. Then, Section 3 describes Aladdin multiplier as the improved basic block adopted to perform modulo multiplication in this ECC design.

Parallel multiplications within ECC research is introduced in Section 4. Section 5 gives an idea of the complete ECC architecture designed and tested. Then, Section 6 details the experimental results and findings.

Finally, Section 7 concludes this research work.

(2)

2. ELLIPTIC CURVE CRYPTOGRAPHY (ECC)

Elliptic curve cryptography (ECC) is an approach of public-key cryptography based on the algebraic structure of elliptic curves over finite fields GF(p) or GF(2^k), where this research scope focus on GF(p). A GF(p) elliptic curve is a plane curve defined by equation of form: y² = x³+ ax + b. The set of points on such a curve (i.e. all solutions of the equation together with a point at infinity) can be shown to form an abelian group (with the point at infinity as identity element) (Black et al.1999). If the coordinates x and y are chosen from a large GF(p) finite field, the solutions form a finite abelian group. The discrete logarithm problem on such elliptic curve groups believed to be more difficult than the corresponding problem in the underlying finite field. Thus keys in elliptic curve cryptography can be chosen to be shorter than RSA for comparable level of security (Black et al. 1999).

There are many ways to apply elliptic curves for encryption/decryption purposes (Gutub 2003). In its most basic form, users randomly choose a base point (x,y), lying on the elliptic curve E. The plaintext (the original message to be encrypted) is coded into an elliptic curve point (xm,ym). Each user selects a private key

‘n’ and compute his public key P=n(x,y). For example, user A’s private key is nA and his public key is PA=nA(x, y).

For any one to encrypt and send message point (xm, ym) to user A, he/she needs to choose a random integer k and generate the cipher text: Cm = { k(x, y) , (xm, ym) + kPA }.

The cipher text pair of points uses A’s public key, where only user A can decrypt the plaintext using his private key. To decrypt the cipher text Cm, the first point in the pair of Cm, k(x, y), is multiplied by A’s private key to get the point: nA(k(x,y)). Then this point is subtracted from the second point of Cm, the result will be the plaintext point (xm,ym). The complete decryption operations can be shown as:

((xm,ym)+kPA)-nA(k(x,y) = (xm,ym) + k(nA(x,y))-nA(k(x,y)) = (xm,ym)

The most time consuming operation in the encryption and decryption procedure is finding the multiples of the base point, (x,y). The algorithm used to implement this is presented in the next section.

2.1 Point Addition and Point Doubling

The ECC algorithm used for calculating nP from P is the binary method, since it is known to be efficient and practical to implement in hardware (Black et al.1999). This binary method algorithm is shown below:

Define k: number of bits in n n_i: the ith bit of n

Input: P (a point on the elliptic curve).

Output: Q = nP (another point on the curve).

1. if n_k-1 = 1, then Q:=P else Q:=0;

2. for i = k-2 down to 0;

3. { Q := Q+Q ;

4. if n_i = 1 then Q:= Q+P ; } 5. return Q;

Basically, the binary method algorithm scans the binary bits of n and doubles the point Q k-times.

Whenever, a particular bit of n is found to be one, an extra operation is needed. This extra operation is Q+P.

As can be seen from the above binary algorithm, adding two elliptic curve points and doubling a point are the most basic operations in each ECC iteration.

The douple of a point in ECC is calculated as shown below:

(x1, y1) + (x1, y1) = (x3, y3); where x1¹0 l= (3(x1)²+ a) /(2y1) x3=l²–2x1 y3=l(x1–x3)–y1

The addition of two ECC points is performed as below:

(x1, y1) + (x2, y2) = (x3, y3); where x1¹x2 l= (y2–y1)/(x2–x1) x3=l²–x1–x2 y3=l(x1–x3)–y1

Note that adding a point or doubling two points over elliptic curve requires inversion, i.e. to compute l. This inversion complexity can be reduced (Black et al. 1999) using projective coordinates as discussed next.

(3)

2.2 Projective Coordinates

The inversion complexity is reduced by projecting the coordinates (x,y) into (X,Y,Z), where x=X/Z, and y=Y/Z (Homogeneous) (Black et al. 1999). The procedure for projective point addition of P+Q (two elliptic curve points) is shown in Fig. 1. Similarly, the form of formulas for projective point doubling is shown in Fig. 2. The squaring calculation is assumed very similar to the multiplication computation. They are both denoted as M (number of multiplication operations) as they dominate the operation complexity.

3. ALADDIN MODULO MULTIPLIER

Modulo multiplication operation is essential for modulo exponentiation operation used in several public-key security algorithms such as Diffie-Hellman, Elgamal and RSA. It is also an important operation for elliptic point addition and doubling in ECC.

One interesting efficient way to perform modulo multiplication is using the new Aladdin algorithm for binary radix-2 modulo multiplication, as detailed in the paper submitted to IEEE Trans. Comp. by (Amin).

Radix-2 Aladdin Modulo Multiplier algorithm shown in Algorithm-1 has been researched and implemented based on improving the famous Montgomery modulo multiplier, presented by Montgomery (1985). Table 1 lists briefly Aladdin multiplier benefits over the Montgomery multiplier. In general, the modulo multiplier is required to compute P=A.B mod N where A, B, N are n-bits unsigned integers. The most significant bit (MSB) of M must be 1, thus it must satisfy (2n-1 ≤ N ≤ 2n–1).

P=(X1,Y1,Z1); Q=(X2,Y2,Z2);

P+Q=(X3,Y3,Z3);where P≠ ±Q

P = (X1,Y1,Z1) P+P =(X3,Y3,Z3)

Figure 1. projective point addition of P+Q Figure 2. projective point doubling of P+P

Table 1. Montgomery vs. Aladdin Modulo Multipliers Montgomery Aladdin

Pre-computations Excessive Minor

Single Modulo Multiplication Impractical Very Practical

Produces Quotient No Yes

Modulus Must be ODD No Restrictions

Modulo Exponentiation O(n²) O(n²) but with 15% higher proportionality constant

The scale step objective is to keep the running product P fit within k-bit even after the shift step of the next iteration. This means that the running product P should be in the range -2^k≤ P ≤ 2^k-1.

This algorithm does the modular reduction operation during the multiplication. It deals with signed numbers in 2’s complement representation for addition and subtraction. Therefore, all the calculations are carried out using n+2 bits in order to accommodate the shift operation and the sign bit (since the original numbers are unsigned). The modular operation is done in the addition operation and the scaling operation,

(4)

where M or its multiple is subtracted for the partial product. Choosing the scaling factors is described thoroughly in paper submitted to IEEE Trans. Comp. by (Amin). Obviously, the initialization step is done every time the modular reduction operation is performed, since it depends on A and M, where A changes frequently but M does not.

For efficiency reasons Radix-2Redundant Aladdin Modulo Multiplier algorithm shown in Algorithm-2 has been also presented in the paper submitted to IEEE Trans. Comp. by (Amin). Using Carry Save Adder (CSA) will result in O(1) per iteration compared to O(n) per iteration when we use Carry Propagate Adder (CPA). It is clear that the usage of CSA adds great enhancement since modulo multiplication operation delay cost decreases from O(n²) using CPA to O(n) using CSA.

I. Initialize P ←0 W ←A-N

IF W≥0 Then W ←W-N II. LOOP

a-shift b-Add c-Scale

For i=k-1 down to 0 do P ←2P

IF Pk+1 =1 Then P ←P+biA.

Else P ←P+biW.

Case Pk+1Pk is

c.1 00: P ←P (no scaling).

c.2 01: P ←P-2N.

c.3 10: P ←P+2N.

c.4 11: P ←P (no scaling).

End Case End For

III. Correct IF P ≥ N Then P ←P-N

Else while P<0 do P ←P+N Return(P)

Algorithm 1. Radix-2 Aladdin Modulo Multiplier Algorithm-1 has the following parameter:

· A : the multiplicand is a k-bit unsigned integer.

· B : the multiplier is a k-bit unsigned integer.

· N : the Modulus is a k-bit unsigned integer that is in the range 2^k-1≤ N ≤ 2^k-1. This means that it should has the binary value 1 at the MSB which is a need for security algorithm.

· P : the result is represented in a k+2-bit signed 2’s complement format. For the 2 additional bits, the most significant one of them is for the sign. As for the other one is necessarily to accommodate the shift left operation by one bit. P is in the range -2^k+1≤ P ≤ 2^k+1-1.

· W : N-conjugate of A which the smallest (k+2)-bit negative quantity such that A=W mod N which is pre-computed.

This implementation used one CSA (as shown in Fig. 3), a pre-calculation, correction, and control unit components, in order to do the specified tasks. The pre-calculation module does operations indicated in algorithm-2. For that, we used a subtractor to subtract N from A, a 2-by-1 multiplexer to choose the input of the subtractor, a register to hold the value of W, and two complementor that gives the 2’s complement W and N, to be used in the subtraction of the data path. The subtractor is built using CPA - with full adders – and the complementor is built using half adders, where a one is added to the least significant bit with the 1’s complement, and the carry is propagated through all the inputs. The correction module does the required correction steps in the algorithm using CPA to assimilate the values of the partial sum and the partial carry. A subtractor is used to reduce the output and two consecutive CPA's to double the addition operations if the value of P is negative. Three 2-by-1 multiplexers are to choose the appropriate result among the four results, i.e. the result without any addition or subtraction, the result of subtraction, the result with only one addition operations, and the result with two additions.

The control unit is implanted as finite state machine. It consists of 6 states that are repeated over the multiplication operations. These states are idle, reset, initial1, initial2, add, and scale. In the initial state, the internal registers are reset and the inputs are latched. Then, in the intial1, the first pre-calculation operation is performed (i.e. W=A–N). In intial2, the subtraction operation is done again, if needed, then add and scale states are repeated K times; where K is the number of the bits in the inputs. Finally, the idle state does the correction step generating a signal that the result is valid. The result is kept in the output line as long as the multiplier is not used (idle).

(5)

I. Initialize PS,PC ←0 W ←A-N

IF W≥0 Then W ←W-N II. LOOP

a-shift b-Add c-Scale

For i=k-1 down to 0 do

Q2:0 =PSk+1:k-1+PCk+1:k-1. –-3-Bit CPA PS ←4PS PC ←4PC

IF Q2 =1 Then (PS,PC)←(PS,PC)+biA.

Else (PS,PC)←(PS,PC)+biW.

Z3:0 =PSk+1:k-1+PCk+1:k-1. –-3-Bit CPA Case Z2Z1Z0 is

c.1 000: (PS,PC)←(PS,PC) (no scaling).

c.2 001: (PS,PC)←(PS,PC)-N.

c.3 010: (PS,PC)←(PS,PC)-2N.

c.4 011: IF Z3=0

Then (PS,PC)←(PS,PC)-2N.

Else (PS,PC)←(PS,PC)+2N.

c.5 100: (PS,PC)←(PS,PC)+2N.

c.6 101: (PS,PC)←(PS,PC)+N.

c.7 110: (PS,PC)←(PS,PC)+N.

c.8 111: (PS,PC)←(PS,PC) (no scaling).

End Case End For

III. Assimilate P←(PS+2PC) “Carry Propagate Addition”.

IV. Correct IF P ≥ N Then P ←P-N

Else while P<0 do P ←P+N Return(P)

Algorithm 2. Redundant Radix-2 Aladdin Modulo Multiplier

4. PARALLEL MULTIPLICATIONS WITHIN ECC

In (Gutub et al. 2007), the author proposed a practical procedure to parallelize both multiplication and point conversion levels of computations. The number of the multiplication and addition operations varies according to the projective coordinate that is used. In (Gutub & Ibrahim 2007), the authors proposed another way to perfume these multiplication and addition operations in parallel in order to reduce the latency. They count the number of operations in two projective coordinates and then show that the projective coordinate that has the more multiplication operations can perform its operation faster, since some of its operations can be performed in parallel.

We in this work, use the same basic idea but with different multiplier, i.e. Aladdin modulo multiplier. The ECC parallel multiplication procedures implemented are depending on projective coordinates data flow graphs for adding two elliptic curve points and doubling a point, as shown in the dataflow Figures, Fig. 4 and Fig. 5, respectively. The chosen projection of (x,y) to (X/Z,Y/Z) is applied to an extra level of parallelism within the point adding and doubling in the right-to-left add and double algorithm in order to speed up the process in the procedure. Observe in Fig. 4 that each EC point addition operation needs four steps with full utilization of all four multipliers. Similarly, the EC point doubling, shown in Fig. 5, needs three multiplication steps with full utilization of all multipliers, which is considered great efficiency.

The author in (Gutub et al. 2007) claimed, also, that best security and speed is achieved when using full parallelization, which can be attained with parallel multipliers in hardware. However, no implementation for this complex design of multipliers was performed to ensure its promising speed and security. This made up the scope we are investigating within this ECC hardware research to cover this issue too.

5. ARCHITECTURE OF THE HARDWARE

The architecture of the new processor is outlined in Fig. 6. This architecture is used to implement ECC point addition and point doubling operations based on the projective coordinate forms discussed before. Unlike existing designs, which use a single multiplier, the new architecture uses four multipliers to meet the high data rate demands of applications.

(6)

It is found that four multipliers are sufficient to exploit the full parallelism inherent in projective coordinates. As can be seen from section 4, the corresponding critical path of the dataflow diagram is 4 GF(p) multiplications. Here the time of GF(p) addition and subtraction is ignored since it is very small compared to multiplication. Therefore, the lower bound of the minimum computation time to perform one elliptic point operation in the calculation of nP is 7 GF(p) multiplications. It can be easily seen that performing four multiplications in parallel will meet this lower bound. Furthermore, the utilization of the four multipliers is high and all the four multipliers are used in all of the steps.

In addition to adopting Aladdin modulo multipliers discussed in section 3, we need to have a modulo- adder and a modulo-shifter. The main task of the modulo adder is to add two integers and then perform a reduction operation if the result exceeds the allowed range. It consists of two CPA and a subtractor. The first CPA is used to add/subtract the inputs according to the indicated operation. Then the output is fed to the other CPA and the subtractor in order to do a scaling up or down operation as needed. Based on the sign of the original sum sign bit and the reduced sum sign, the final result is selected from the inputs by a 4-by-1 multiplexer. We used two CPA in order to reduce the number of clock cycles needed to perform this operation.

Figure 3. ECC Carry Save Adder (CSA) data path --- Figure 4. EC Point addition --- Figure 5. EC Point doubling The task of the shifter is to shift the input to the left by one (multiply the input by 2) and then scale the output if required. It is built using subtractor and 2-by-1 multiplexer. The subtractor is used to scale the shifted input if it exceeds the allowed range. The select input of the multiplexer is the sign bit of the reduced shifted input. If the reduced shifted input sign is positive, then it will be chosen; otherwise, the shifted output is chosen.

Figure 6. Outline of the processor architecture

The execution time for the adder and the shifter is negligible compared to the multiplier. Registers are needed to hold the intermediate values and each register is associated with a multiplexer (8-by-1 Mux) to choose the appropriate input for that register among 6 functional components (4-multipliers, adder and shifter) and the external input (to load X,Y and Z for point doubling or addition). Each input of the functional units is associated with a multiplexer (16-by-1 Mux) to select the input among 10 registers.

(7)

6. EXPERIMENT RESULTS & ANALYSIS

The proposed parallelized algorithm was expected to be faster than the serial multiplication one. The speed of the algorithm is found affected significantly by the type of the multiplier used. The study compared the area and speed of the parallel architecture with the serial architecture assuming same multiplier, i.e. Aladdin multiplier is used.

The parallel architecture was modeled using VHDL to estimate fair and independent area and speed. The size of computation parameters (the bus size) is estimated 160 bits. The experiment shows that the maximum clock frequency according to our implementation is 20.547MHz making the maximum clock period as 48.669ns. The total size of the design – including the data path and the control unit – is found equivalent to 394,472gates. The control unit size consists of 1,419 gates which represents less than 1% of the overall design size, while the rest is for the data path (344876 gates for the data path and 48177 for buffering given by VHDL). Control unit can work with higher clock rate (541.859MHz) which means that the data path limits the speed of the design. Improving the speed of the data path can reduce the latency. Parallel ECC processor that uses four multipliers requires ten registers to keep the intermediate values. Each register requires a multiplexer to select the appropriate input. For each input of the functional components (multipliers, adder and shifter), a multiplexer is needed to choose the input among the available register values (R0 to R9).

To compare this design with the serial ECC processor (that perform the multiplication operations serially), only the data path is considered changed, since it occupy the highest portion of the area. The difference in depth of the combinational path in the data path is ignored and the clock rate difference is neglected. The major comparison factor is the number of iterations and the area needed for the design (in terms of gate counts). In the case of parallel design, we needed ten registers (10 in case of point addition and 8 in case of point doubling), while only eight registers are needed in case of serial processor (8 for point addition and 5 in point doubling). Table 2 lists the area of all designs and their components (as gate count).

Note that the serial design contains 104605 gates, which is saving 30% area compared to the parallel design space. Observe how the multipliers occupy the largest portion of the design (almost 35% is serial design and 42% is the parallel design), which indicates that reducing the multipliers area will influence the overall area significantly.

Table 2. The gate count for every components

Component Gate count per Component Serial Design Parallel Design Count Gates Count Gates

Multiplier 36,279 1 36,279 4 145,116

Adder 9,570 1 9,570 1 9,570

Shifter 2,862 1 2,862 1 2,862

Register 1,283 8 10,264 10 12,830

Mux 4x1 2,400 8 19,200 0 0

Mux 8x1 5,286 5 26,430 10 52,860

Mux 16x1 11,058 0 0 11 121,638

Total Count of Gates 104,605 344,876

Investigating the impact of the parallelism on speed is represented based on the critical path to determine the clock rate. It neglects the time needed for addition, and shifting operations since they are found negligible compared to multiplication. In general, the time estimation is counted as the number of multiplications using the processor.

In encryption\decryption process, point doubling is always performed and point adding is only performed half the time on average as in (Gutub et al. 2007, Gutub & Ibrahim 2003). This results the cost analysis shown in Table 3. Observe that the serial design is better when the area is the major dominating factor, as in embedded devices with limited area and energy hungry devices, while the parallel design gives better performance when the speed is the main factor to choose the design.

(8)

Table 3. Cost Performance Analysis

Focus Figure of Merit Serial Design Parallel Design

Both Same AT 2,039,797.5 1,724,380

Time AT² 39,776,051.25 8,621,900

Area A²T 2.13373×10¹¹ 5.94697×10¹¹

7. CONCLUSION

ECC parallel and serial multiplication hardware are modeled using the improved Aladdin modulo multiplier algorithm. The study compared the designs through VHDL implementation that investigated the tradeoffs regarding time and area. The analysis showed how exploiting parallelism can boost up the speed on the cost of 30% extra hardware area. The work also proofed that the multipliers worth investigation, since they occupy the largest portion of the design, i.e. 35% of the serial design and 42% of the parallel design are consumed by multipliers. Parallelization may require more hardware, more registers and multipliers, to give the high performance needed in today applications.

ACKNOWLEDGEMENT

Thanks to King Fahd University of Petroleum and Minerals (KFUPM) for partially hosting this research.

Thanks also to Umm Al-Qura University (UQU) for collaborative moral support toward these achievements.

REFERENCES

Amin, A., Generalized algorithm for binary modulo multiplication and multiplication-division. submitted to IEEE Transactions on Computers.

Blake, Seroussi and Smart, 1999. Elliptic Curves in Cryptography. Cambridge University Press: New York.

Gutub, A., 2003. Fast Elliptic Curve Cryptographic Processor Architecture Based On Three Parallel GF(2k) Bit Level Pipelined Digit Serial Multipliers. IEEE 10th International Conference on Electronics, Circuits and Systems (ICECS), Sharjah, UAE, pp. 72-75.

Gutub, A. and Ibrahim, M., 2003. High Radix Parallel Architecture For GF(p) Elliptic Curve Processor. IEEE Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong Kong.

Gutub, Ibrahim, and Al-Somani, 2007. Parallelizing GF(P) Elliptic Curve Cryptography Computations for Security and Speed. IEEE International Symposium on Signal Processing and its Applications (ISSPA), Sharjah, UAE.

Gutub, A. and Tahhan, H., 2008. Efficient Adders to Speedup Modular Multiplication for Cryptography. 5^th IEEE International Workshop on Signal Processing and its Applications (WoSPA), University of Sharjah, Sharjah, U.A.E.

Gutub, A., 2010. Remodeling of Elliptic Curve Cryptography Scalar Multiplication Architecture using Parallel Jacobian Coordinate System. International Journal of Computer Science and Security (IJCSS), Vol.4, No.4, pp. 409-425.

Gutub, A., 2010. Preference of Efficient Architectures for GF(p) Elliptic Curve Crypto Operations using Multiple Parallel Multipliers. International Journal of Security (IJS), Vol.4, No.4, pp. 46 – 63.

Gutub, El-Shafe, and Aabed, 2011. Implementation of a pipelined modular multiplier architecture for GF(p) elliptic curve cryptography computation. Kuwait Journal of Science and Engineering (KJSE), Vol. 38, No. 2B, pp. 125-153.

Gutub, A. and Khan, F., 2012. Hybrid Crypto Hardware Utilizing Symmetric-Key & Public-Key Cryptosystems.

International Conference on Advanced Computer Science Applications and Technologies (ACSAT), Palace of the Golden Horses, Kuala Lumpur, Malaysia.

Montgomery, 1985. Modular multiplication without trial division. Mathematics of computation, Vol. 44, No. 170, pp. 519-521.

Mano and Kime, 2001. Logic and Computer Design Fundamentals. 2^nd ED, Prentice Hall.

Tawalbeh, L., Swedan, S. and Gutub, A., 2009. Efficient Modular Squaring Algorithms for Hardware Implementation in GF(p). Information Security Journal: A Global Perspective, Taylor & Francis, Vol. 18, No. 3, pp. 131–138.

Tawalbeh, Mohammad, and Gutub, 2010. Efficient FPGA Implementation of a Programmable Architecture for GF(p) Elliptic Curve Crypto Computations. Journal of Signal Processing Systems, Springer, Vol. 59, No. 3, pp. 233-244.