Efficient pipelined multistream AES CCMP architecture for wireless LAN

(1)

Efficient pipelined multistream AES CCMP architecture for wireless LAN

Haeyoung Rha and Hae-wook Choi Electrical Engineering and Computer Science

Korea Advanced Institute of Science and Technology (KAIST) Taejon, 305-701, Republic of Korea

{miroandi, hwchoinew}@kaist.ac.kr Abstract— Widespread WLAN has used Advance Encryption

Standard in Counter with Cipher Block Chaining Message Authentication Code Protocol (AES CCMP) as security services.

As the next-generation WLAN adopts multiuser MIMO technology, multiple AES blocks are needed because this technology supports multiple packet transmission in parallel. To minimize the cost increase due to multiple AES blocks, we propose a pipelined AES architecture which processes multiple packets in parallel. By using a pipelined AES architecture instead of multiple AES blocks, efficiency is improved greatly since this pipelined architecture enables to sharing of the AES core. To find the optimum number of stages, we implemented pipelined architectures with various numbers of stages. The results show that a 3-stage pipeline is most efficient, and we achieve 4.34 Gpbs throughput with 2.39 Mbps/slice implementation cost of 1,817 slices and 0 BRAM using Xilinx Virtex 5. This efficiency results from the pipelined architecture’s high operating frequency and resource sharing.

Keywords: AES, AES-CCMP, IEEE 802.11i, multiuser MIMO, pipeline

I. INTRODUCTION

Over the past few years, wireless communication has become widespread. To offer secure communication, cryptographic algorithms have been adopted, and the Advanced Encryption Standard (AES) [8] has been widely adopted in wireless communication. In the IEEE 802.11i standard for Wireless Local Area Networks (WLAN), the AES CCMP is adopted to provide both privacy and data integrity. The AES CCMP combines the AES counter mode (CTR) with AES chaining-message authentication code (CBC-MAC). The increasing demand for faster access has led to increasing demand for a higher throughput AES. Next-generation WLAN adopts multiuser MIMO (MU-MIMO) technology, which transmits multiple packets for multiple users at the same time.

To transmit packets securely, an AES CCMP hardware block per packet has been used. Using multiple AES CCMP blocks causes a cost increase due to the number of used AES CCMP blocks. Thus, both high throughput and efficiency should be considered in choosing AES CCMP architecture.

An iterative AES algorithm can be implemented as an unfolded loop structure or a fully folded structure. An unfolded structure shows higher throughput and a folded iterative structure results in lower area [10]. Traditionally, an unfolded structure with pipelining has been used to increase throughput in AES architecture. The work in [1] reaches over 30 Gbps

through a subpipelining structure. However, due to the feedback nature of AES CBC mode, pipelining AES CCM is not efficient [6]. Instead an iterative architecture has been used [2-7]. There are two ways to implement the iterative CCMP architecture. The first way is to operate two AES blocks, CTR and CBC mode respectively, in parallel [3]. The second was is to sequentially operate the AES block one mode at a time [4].

A separate AES block shows higher throughput at the cost of area.

In this work, we implemented the pipelined AES CCM hardware architecture with folded iterative architecture. Unlike the traditional iterative architecture of AES CCMP, pipelined architecture becomes efficient when multiple AES CCMP encryptions are performed in an AES core. This is because the next-generation WLAN supports the simultaneous transmission of multiple packets. In this architecture, resources are shared because multiple packets can be operated with one core.

Moreover, since pipelining reduces the critical path, the operating frequency can be increased. However, the increase of efficiency will be saturated, and to find the most efficient architecture, we implemented several architectures. In this paper, the AES encryption algorithm and AES CCMP mode will be explained in section II. The proposed architectures and implementations will be discussed in detail in section III. Four different numbers of pipeline stages were implemented, and comparison of the architectures will be given in section IV.

Finally, the conclusion will follow.

II. OVERVIEW OF AESCCMP^PROTOCOL

The Rijndael algorithm has been selected as the new AES algorithm by the National Institute of Standards and Technology (NIST) [8]. Also, the AES algorithm [11] in the CCM mode was proposed to be implemented in the IEEE 802.11i standard which defines security protocols for WLAN networks. Since AES CCM mode uses only AES encryption, the AES encryption algorithm and CCM mode will be summarized in the following subsections.

A. AES encryption

AES is a symmetric block cipher that can encrypt or decrypt a 128-bit data block using a cipher key with a length of 128, 192, or 256 bits which results in 10, 12, or 14 rounds of operation, respectively. The 128-bit key is the most commonly used. A 128-bit data block is disposed in a 4 by 4 byte matrix, and at each round, a block is transformed by the following four

(2)

Figure 1. Authentication with AES CBC-MAC mode

Figure 2. The encrypted AES with CTR mode basic transformations. At the last round, only three

transformations are executed except the MixColumns transformation. Arithmetic operations, addition and multiplication, are over finite fields, Galois field 2⁸ (GF(2⁸)).

• SubBytes transformation is applied to the state of matrix for the nonlinearity. Each input byte of the state matrix is substituted in two steps. First, each input state is replaced with its multiplicative inverse in GF(2⁸) and followed by an affine transformation. There are two ways to implement SubBytes transformation, using a lookup table or calculating logically.

• ShiftRows transformation is used for inter column diffusion. Cyclic shift rows of a data block array using 0, 1, 2, and 3-byte offsets to the left. Since the operation contains no calculations, it can be simply implemented by routing appropriately from SubBytes transformation outputs to MixColumns inputs.

• MixColumns transformation diffuses each element.

The four elements of each column are considered as a polynomial over GF(2⁸) and multiplied by fixed polynomial c(x) modulo x⁴ +1 by polynomial. The c(x) is given by: c(x) =(03)x³ + (01)x²+ (01)x+(02).

This calculation uses Galois field multiplication, and this can be done by subtracting the polynomial when the result of multiplication is larger than 8 bits, i.e. 128.

• AddRoundKey transformation is addition (XOR in GF(2⁸)), the final operation to be performed in each round. The output of MixColumns is exclusive ORed with a round key generated from the key scheduler.

To provide round keys, the key scheduler generates keys in each round from the original key.

• At the first round, the original key is used to generate keys, and for the rest of the rounds the generated round key is fed-back to generate the next round key. The key scheduler takes as input a 4-word key and produces a linear array of 44 words.

B. AES CCM protocol

AES CCMP uses the AES encryption algorithm to generate encrypted and authentication data at the same time.

Traditionally, encryption and authentication were provided by two different cryptographic primitives, but the CCM protocol supports two security services with one algorithm AES. It uses only an AES encryption algorithm with a fixed 128-bit key.

The CCM protocol is composed of CTR mode for confidentiality and CBC-MAC mode for authentication.

AES with CBC mode is used to generate the authentication data as shown in Fig. 1. The Message Integrity Code (MIC) computation process begins with initialization vector (IV) deviation from the packet header and performs AES encryption. Then, it calculates an exclusive OR between the AES encryption result and the next block, and this is used as IV of the next block. This procedure is repeated for the rest of the packet. In case of the encryption, 8 bytes of the CBC mode

result are appended to the end of the packet as an MIC value.

The MIC value is used to verify the received packet. During the decryption procedure, the MIC value is calculated. If the calculated MIC value and the attached MIC value do not match, then the decrypted packet fails to ensure data integrity.

To generate the encrypted data, AES with CTR mode is used as shown in Fig. 2. In this mode, a counter value is encrypted. The procedure begins with AES calculation of the counter value which is generated from the flags, nonce, and length of the payload field. The encrypted counter values are exclusive ORed with the first block of data. Then, the counter value is updated and CTR repeats the previous steps for the next 128-bit block until the final block is processed. The last block is zero padded if it is smaller than 128 bits, and the calculated AES is truncated depending on the last block size.

As explained above, the CBC-MAC and CTR modes use the same AES encryption core. In some architecture, these two modes are performed in different AES blocks [3]. If lower throughput is demanded, the two modes share an AES block

(3)

Figure 3. Architecture of 4 stage pipelined AES CCMP

(a) Pipelined AES encryption datapath

(b) Pipelined key scheduler

Figure 4. Datapath of 4 pipelined AES (a) and datapath of key scheduler(b) and sequentially operate AES blocks one block at a time [4].

CTR and CBC can be pipelined using an AES block, but this architecture is hardly used.

III. SYSTEM ARCHITECTURE AND IMPLEMMENTATION To achieve higher throughput and efficiency, various architectures have been adopted for AES. If pipelining is applied to CTR mode and ECB mode, it achieves high throughput, but the feedback nature of CBC mode prevents the use of pipelined architecture due to latency [6]. However, if multiple packets are performed in parallel, pipelined architecture is applicable for CBC mode efficiently.

In the proposed architecture, multiple packets are pipelined even in CBC mode. In the pipelined architecture, by inserting registers in the critical path, the operation frequency is increased as the latency of a packet is increased. Area is substantially reduced since multiple packets share an AES block instead of using the multiple AES blocks. As a result, total throughput and efficiency are increased with a reasonable decrease of throughput of a packet.

As the number of pipeline stages is increased, the total throughput and efficiency increases but it will be saturated at some point. Moreover, the decreased throughput of a packet may not satisfy the demand of one packet throughput. Thus, the number of pipelines should be considered in choosing an AES CCM architecture. In this paper, we implemented pipelined architectures with one to four stages. A detailed description of the architecture will be given in this section.

A. System architecture

Various architecture parameters should be considered in architecture design. These include the number of pipeline stages, resource sharing, such as the sharing of AES CTR module and AES CBC module, or MAC header analyzer sharing. Implementation results for various numbers of pipeline stages and analysis will be given in section IV.

Sharing an AES core for CTR mode and CBC-MAC mode was considered, but implementation results showed that this is inefficient. This is because mode-sharing architecture needs two separate key schedulers and dummy registers for two AES cores operating in CTR mode and CBC, respectively.

The structure of the designed architecture is shown in Fig.

3. As shown in this block diagram, the 4-stage pipelined AES CCM system consists of 4 MAC header analyzers, a buffer, and an AES CCM core. MAC header analyzer is not shared while the buffer and AES CCM core are shared. So the number of MAC header analyzers is the number of streams to be handled. The MAC header analyzer generates nonce, MIC IV and MIC headers from the received MAC header and arranges them in one block unit. The received payload is also organized in one block unit and the last block is zero padded if needed.

Outputs of four MAC header analyzers are arbitrated and saved in one dual-port memory buffer.

To support pipelined operation, received packets are arranged as packets are to be interleaved in order. The clock frequency of the AES CCMP core is 1, 2, 3, and 4 times higher than that of the rest of the implementation depending on the number of stages.

B. 4-stage pipelined AES CCMP core

Since the basic structures of pipeline architectures are the same, except the different number of stages, we explain for the 4-stage pipelined AES CCM core. The AES CCM core consists of key scheduler, two pipelined AES cores and an AES controller. A block diagram of the datapath of AES encryption and the key scheduler is shown in Fig. 4. One AES core can encrypt 4 blocks at a time, and the key scheduler can also generate round keys from 4 original keys. To share hardware resource, a round of AES is implemented.

There are two ways to implement SubBytes transformation.

This transformation can be implemented in hardware in two ways; a 256-byte lookup table with an 8-bit input and an 8-bit output and calculating the transformation logically. Since it is the most complex part, SubBytes are implemented in several stages in pipelined architecture. On the other hand, due to timing overhead, AES CCM implementation uses a look up table, which is usually mapped to the dedicated embedded

(4)

Figure 5. Timing diagram of 4 pipelined AES core

memory in FPGA. We use the SubBytes look up table to finish a round within 4 clock cycles and to get a higher operation frequency in FPGA. More importantly, implementing as ROM or RAM in ASIC causes area cost and routing delay to increase due to the distance between the logic and ROM or RAM. Thus, mapping to logic is more appropriate in SubBytes implementation.

MixColumns transformation requires multiplication and addition in GF(2⁸). AES encryption requires constant multiplication by 2 and 3 in GF(2⁸) as described in section 2.

Since 3 times of polynomial can be obtained by adding 2 times of polynomial and original polynomial, only 2 times of polynomial implementation are needed.

In this pipelined structure, MixColumns transformation has three stages. In the first stage, 2 times of polynomial is obtained, in next two stages the last of computation is done as shown Fig 4 (a). In the third stage AddRoundKey is performed.

There are two critical paths in this architecture. The first is the feedback path of CBC output into CBC input. During the decryption, AES decrypted CTR results are needed to calculate CBC input. Fig. 5 shows the encryption timing diagram of the proposed architecture. To minimize the number of registers, the timing of data input and data output is carefully arranged. A block of encrypted data is obtained after 40 clock cycles.

C. Registering in pipelined AES CCMP core

Inserting registers is dependent on the delay of the critical path. For 1-stage pipeline architecture, only reg2 and reg6 in Fig. 4 are used. Other registers are not inserted, and the critical path is one round of AES encryption. To reduce the critical path in 2-stage pipelined architecture, reg4 and reg8 are inserted. In this architecture the critical path is path from reg4 to reg2, SBox output. In 3-stage pipelined architecture, reg1 and reg5 are inserted and the SBox read path is the critical path. To reduce the critical path further, reg3 and reg7 are inserted in the 4-stage pipelined structure.

IV. IMPLEMENTATION RESULT AND COMPARISON In this section, the implementation results of the proposed design are given and are compared with previous AES CCM mode architectures. Since many cost parameters and implementations are targeted to different devices, comparing the results will be focused only on area efficiency.

A. Implementation result

The proposed hardware was designed in VerilogHDL, and the simulation was carried out with Modelsim 6.6 for functional verification. The designed AES CCM architecture was mapped to Xilinx FPGA to compare the results with other works as shown in Table I. The design was synthesized with Synplifypro and placed and routed in the Xilinx ISE 9.2 design suite. The targeted device was Virtex5 xc5vlx110ff676 speed grade -1. Table I shows the implementation results in relation to the various pipelined AES CCMP implementations in Xilinx Virtex5. Area, clock frequency, throughput, and efficiency are considered. As seen in Table I, the 3-stage pipelined architecture outperforms the 1-, 2- and 4-stage pipelined architectures. For implementation cost comparison, used resources, clock frequency, throughput and efficiency are considered. In this table throughput values are obtained as processed block size multiplied by the operating clock period divided by the number of clock cycles. Efficiency is the ratio of throughput to area. Area means the occupied slice, and embedded memory is not included in area. The most efficient 3-stage pipeline architecture operates at 339 MHz, and throughput of 4.34 Gbps is obtained in Virtex5.

B. Comparison of different stage pipelined architectures The most efficient architecture is the 3-stage pipelined architecture as shown in Table I. As the number of pipeline stages is increased, the operating frequency increases, but the area of the AES CCM core increases slightly. This leads to increased efficiency, but the operating frequency is saturated at stage 3. This is because the critical path in the 3-stage pipeline architecture, the SBOX look up table read path, cannot be divided into different stages. Thus, inserting registers does not augment the operating speed much. Rather 4-stage pipeline architecture is less efficient due to the area increase by adding registers.

In Table 1, the total area increase is mainly due to the increase of dual-port memory size since dual-port memory is mapped to flip-flop.

TABLE I. EFFICIENCY IN RELATION TO THE NUMBER OF PIPELINE STAGES

Target

device Stage

Clock Freq.

(MHz)

Occupi ed slices

Throu ghput (Gbps)

Efficiency (Mbps/slic

es)

xc5vlx 110ff6 76-1

1 213 1809 2.72 1.5

2 312 1809 4.00 2.21

3 339 1817 4.34 2.39

4 357 2155 4.57 2.12

C. Comparison with references

Table II, shows relevant results from related works.

References [2] to [4] reported on architectures with one AES CCM core, and that in reference [5] has multiple cores. Since the proposed architecture can process multiple packets in parallel, there is a greater number of occupied slices than in [2]-[4]. However, the operating frequency is several times

(5)

higher than in previous works due to the reduced critical path of the pipelined architecture. Although it is difficult to compare the results of different types of FPGAs, the efficiency of the proposed architecture is good compared to those reported in previous works. The efficiencies of the proposed architecture are 2.39 Mbps/slice in Xilinx Virtex5. The efficiencies of those reported in previous works are between 0.25 and 0.802. The proposed architecture on Virtex5 shows especially high efficiency, even though block memory in FPGA is not used.

TABLE II. IMPLEMENTATION RESULTS OF AESCCM ARCHITECTURE

Work Device

FPGA resouces

Clock Freq.

(MHz)

Throug hput (Gbps)

Efficie ncy (Mbps/

slices) slices BRAM

[2] Xc2vp20f g676

6766 CLB

0 148 1.722 0.25

[3] Xc3s700a fg400

1803 4 ROMs

105 0.588 0.32

[4] xc5vlx50f f676

2378 (LU T)

10 127 1.637 0.68

(Mbps /LUT) [5] Virtex4 2183

2

37 193 17.518 0.802

This work

xc5vlx11 0ff676-1

1817 0 339 4.34 2.39

V. C^ONCLUSION

High-throughput encryption has been a very important issue. In addition, increasing efficiency is becoming more important because the next-generation WLAN increases the number of simultaneous packets due to the adoption of multiuser MIMO technology. Since an increase in AES blocks leads to cost increase, efficiency should be considered seriously. In this paper, pipelining is introduced in AES CCM hardware in the communication field. Traditionally, pipelining is not considered to be applicable to AES CCM mode. By adopting pipeline architecture for multi-stream, a high operating frequency, high total throughput, and efficient design of a CCMP AES cipher for IEEE 802.11i is implemented. In

this work, 4 different numbers of pipeline stages were implemented. The results showed that the 3-stage pipelined architecture was the most efficient, operating at 339 MHz with a throughput of 4.34 Gbps; 1,817 slices without BRAM were used and 2.39 Mbps/slice was achieved.

REFERENCES

[1] I. Hammad, K. El-Sankary, and E. El-Masry, “High-Speed AES Encryptor With Efficient Merging Techniques,” IEEE Embedded Systems Letters, vol 2, pp. 67-71, 2010.

[2] C. Sivakumar, and A .Velmurugan, “High Speed VLSI Design CCMP AES Cipher for WLAN (IEEE 802.11i),” IEEE - ICSCN 2007, MIT Campus, Anna University, Chennai, India. Feb. 22-24, pp.398-403, 2007.

[3] J. Ji, S. Jung, E. Jun, and J. Lim, “Efficient Sequential Architecture for the AES CCM Mode in the 802.16e Standard,” 2009 Second International Conference, pp.253-256, 2009.

[4] I. Algredo-Badillo, C. Feregrino-Uribe, R. Cumplido, and M. Sandoval,

“FPGA Implementation Cost and Performance Evaluation of the IEEE 802.16 and IEEE 802.11i Security Architectures Based on AES-CCM,”

2008 5th International Conference on Electrical Engineering, Computing Science and Automatic Control (CCE 2008), pp.304-309, 2008.

[5] S.A. Hoseini, B. Khodabandeloo, M. Jelodari Mamaghani, P. Teymoori, and N. Yazdani, ”High throughput low power CCMP architecture for very high speed wireless LANs,” Computer Architecture and Digital Systems (CADS), 2010 15th CSI International Symposium, pp.59-65, Sept. 2010.

[6] F. Rodriguez-Henriquez, N.A. Saqib, A. Díaz Pérez, and C.K. Koc,

“Cryptographic algorithm on reconfigurable hardware”, Springer, chapter 9, 2006.

[7] D. Whiting, R. Housley, and N. Ferguson, "Counter with CBC-MAC (CCM)," RFC 3610, Internet Eng. Task Force, Sept. 2003.

[8] FIPS Publication 197, “Advanced Encryption Standard,” U.S.

Doc/NIST, 2001.

[9] IEEE 802.11w, “Part 11: Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY) Specifications Amendment 4: Protected Management Frame,” Sept. 2009.

[10] R.Chaves, G. Kuzmanov, S. Vassiliadis, L. Sousa, “Reconfigurable memory based AES co-processor”, Parallel and Distributed Processing Symposium, 2006.

[11] W. Stallings, “Cryptography and Network Security,” Pearson, chapter 5, 2006.