• Tidak ada hasil yang ditemukan

J00939

N/A
N/A
Protected

Academic year: 2017

Membagikan " J00939"

Copied!
2
0
0

Teks penuh

(1)

Hardware-assisted estimation of entropy

norm for high-speed network traffic

Yu-Kuen Lai, Theophilus Wellem and Hui-Ping You

The computation of the entropy of a high-speed data stream in a one-pass fashion is crucial to many network security applications. Motivated by the work of Lallet al., this study examines the design trade-off of processing speed and accuracy for estimating the entropy norm. The proposed scheme leverages the Count Sketch with constant memory access on counter update and point query operations. With a bounded relative error and a constant memory access cycle, the design can process incoming traffic with a throughput of 30 Gbit/s.

Introduction: Entropy, which measures the degree of concentration and dispersion of a given feature space, is utilised as an important indicator of changes in network traffic behaviour [1]. Accordingly, the entropy-based method is regarded as an important building block in various anomaly detection [2] and traffic monitoring systems [3,4]. Owing to the high speed of Internet traffic, tracking all packets [5] to enable complex analysis in real time is difficult. Consequently, as the speed of networks is rapidly increasing beyond today’s 100 Gbit/s throughput, advanced methodologies [6] are required to reduce the processing and storage costs incurred. This work extends the work of Lallet al.[5], by using the point query of‘Count Sketch[7] in place of the exact counting procedure for entropy norm estimation. Without an unpredict-able memory access cycle due to hash collision resolution for pointer lookup and extra tag storage, the proposed design can process very high-speed network traffic for entropy computation.

Background: A data streamϕ= (a1,a2,…,am) is a massive sequence of

melements. Thetth element,at= (k,u), comprises a keykand an update

u. The element can be regarded as a packet of the Internet protocol (IP) traffic. As an example, the keyk∈[2104] can be used to represent a 5-tuple trafficflow ID consisting of source and destination IP addresses, source and destination ports and a protocol number. The updateu rep-resents either packet length or packet count (u= 1).

This Letter is concerned with measuring the entropy of the traffic stream based on specifiedflow IDs (key) in the packet header. The entropy is defined as in (1), wheremis the total number of packets in the stream andfkis the total number of packets for a givenflow ID

k∈[n]

H= − n

k=1

fk

mlog fk

m

(1)

=log(m)− 1

m

n

k=1

fklog(fk) (2)

Equation (1) can be further formulated as H= log(m)−S/m, where

S;nk=1fklog(fk) is the‘entropy norm’. Chakrabartiet al. [8] and

Lallet al.[5] developed a (ɛ,δ)-approximation scheme to estimate the

entropy norm. Motivated by the work of Alonet al.[9], the scheme selects theyth item (contains key ofky) uniformly at random from the

incoming data stream of lengthm. A counterCkeeps an exact count of the subsequent items that match the keyky. As shown in Algorithm

1, the estimated entropy norm S can be computed directly from the value of the counter asS=m(ClogC−(C−1)log(C−1)). To further constraint the error of the estimated entropy norm, a set of p×q

counter arrayCijis used. The counter array is arranged inp: = 2 log(1/δ)

groups. For each group, the mean value is computed from q: = 32 logm/ɛ2counters. Then, the value ofmedian1

ip(1q q

j=1Sij) over

thepgroups is taken as thefinal entropy norm.

Algorithm 1Basic for entropy norm estimation [5,8]

A data streamϕ= (a1,a2,…,am) hasmitems, where thetth itemat(k) consists of akey k∈[n].

1. Choose a numberyuniformly at random from {1, 2,…,m}. 2. Maintain a counterC =ç{r:ar(k) =ay(k),yrm}ç.

3. OutputS=m(ClogC−(C−1)log(C−1))

Motivation: To track the number of all incoming packets that match the selected keys, extra labels and pointers have to be kept in the memory for key matching to resolve hash collisions. These extra data may occupy up to 200 bits of memory [5]. Moreover, packets with the same key may be selected in various locations in the stream. The number of these packets is then recoded in different corresponding locations ofCij. Since key matching and counter updating operations

must be conducted in wire-speed for all packets in the incoming traffic, a series of unpredictable memory read and write accesses on the counter arrayCijbecome the critical bottleneck of the measurement

system.

e

t samplinguniform (p×q) packets query (key) sketch

update for all packets

q

p ioctl host operating system

entropy computation

PCI Express

count sketch

CSij Cij

Cij update

MAC RxQ

MAC RxQ MAC RxQ MAC RxQ

user data path th(key), hash functions 2-universal

g

t(key)

Fig. 1Block diagram of proposed system based on Count Sketch for entropy norm computation on NetFPGA platform

System design: The main innovation of the proposed scheme is the tracking of packet counts without the need for keeping extra labels and pointers in the memory by using the Count Sketch [7] data structure. Count Sketch is a (ε,δ)-approximation algorithm that exploits the prob-abilistic properties of a universal class of hash functions to guarantee the accurate estimation of various attributes of a data stream. Count Sketch is compact and can summarise large numbers of data elements rapidly without keeping extra stateful information.

Algorithm 2Count Sketch

Initialise:

1 SetCSij= 0, where 1≤ie, 1≤jt,

e:=e32andt:=O(log(1

d));

2 Choosetindependent hash function

ht:[n][e] from a 2-universal family; 3 Choosetindependent hash function

gt:[n]{+1, −1} from a 2-universal family;

Update:(itema= (k,u) consists of a‘keyk’and an‘updateu’)

1 forj =1 totdo

CShj(k),j=CShj(k),j+u·gj(k);

Query:

1 On a query of an itema(consists of a‘keyk’), report ˆ

fa=median1≤jtgj(kCShj(k),j.

As shown in Algorithm 2, the Count Sketch is constructed based on a two-dimensional array of counters CSij, 1≤ie, 1≤jt. A

2-universal family of hash functionshtandgtare used to hash the key

of each incoming packet. The point query of Count Sketch can provide the estimated packet count being updated so far of a given key. The probability (1−δ) that the error of the estimated point query ˆ

fk is bounded by the value of 1√F2 [7], where F2 represents the second frequency moment of the packet stream, is high

Prob[|fˆkfk| ≤1

F2

](1d) (3)

The proposed entropy computation system, presented in Fig.1, com-prises two major parts. The hardware part, consisting of the Count Sketch data structureCSij, the counter arrayCijand sampling modules,

is responsible for tracking packet counts at line rate. For each arriving packet, sketch updates must be performed at wire-speed. When an incom-ing packet has been selected, Count Sketch is immediately queried. The result of the query is a (ɛ,δ)-approximation of the current packet count.

The estimated packet countfˆais placed in the counter arrayCijfor the

esti-matorS. We argue that the point query from Count Sketch can mimic the

(2)

behaviour that is described in Algorithm 1, because the arrival sequence of packets in the data stream in a previous observation period does not influence the entropy norm computation. With respect to the software part, the host central processing unit computes thefinal entropy value based on the counter arrayCijthat is sent from the PCI Express bus

after every observation period.

entropy

Lall et al. 1.0

0.10 0.25 0.50 0.75 0.95 Zipf

1.25 1.50 1.75 2.00 2.25 2.50 0.8

0.6

0.4

0.2

0

CCFC UniSamp exact

Fig. 2Normalised entropy obtained using Count Sketch data structure on synthetic data sets (m = 1 000 000) with different Zipf parameters Count Sketch consists of 3 × 32 k entries

Each entry has 32 bit words

Total size of Count Sketch is 384 kbytes

5.0%

4.0%

3.0%

2.0%

1.0%

0%

number of entries for Count Sketch table Lall

et al .

4k 8k 16k 32k 64k

128 k

256 k error rate

p= 4 q= 5120

Fig. 3Relative estimated entropy error based on differently sized Count Sketches for CAIDA anonymous Internet trace [11]

15.0%

Lall et al. error rate

CCFC UniSamp 10.0%

5.0%

0%

0.10 0.25 0.50 0.75 0.95 1.25 Zipf

1.50 1.75 2.00 2.25 2.50

Fig. 4Relative errors of estimated entropy for various Zipf parameters using proposed Count Sketch scheme

Performance analysis: The proposed scheme is verified using synthetic data sets that are generated using a Zipfian distribution [10]. As shown in Fig.2, the simulation is based on a Zipf parameter from 0.1 to 2.5. The synthetic data set comprises 1 000 000 packets with keyk∈[232]. A simulation based on the CAIDA antonymous Internet trace [11] is also carried out. The trace, which is collected from CAIDA’s equinix-sanjose high-speed Internet backbone, contains over 30 million packets and 448 809 unique IPv4 source addresses. The total size of the memory that is used in the system for Count Sketch (384 kbytes) and the entropy norm estimator array Cij (40 kbytes) is 424 kbytes.

Count Sketch is composed of three counter arrays, each with 32 k entries. The estimator counter arrayCij has 4 × 5120 counters. Each

counter entry has 32 bit words. The proposed scheme can estimate the entropy value with limited error on synthetic data of various Zipf parameters. The Count Sketch approach has a higher estimated error than does the method of Lallet al., with a Zipf parameter of <1.25 mainly because of the point query error of Count Sketch in a lightly

skew distribution. The point query error of Count Sketch can be reduced by adding more entries to the counter. As presented in Fig.3, the relative estimated entropy error is below 1.5% with more than 32 k entries in Count Sketch. For a Zipf parameter of >1.5, as displayed in Fig. 4, the proposed method achieves a lower error because the scheme of Lallet al. must store packet counts of many identical keys in estimator counter arrayCij.

The system hardware is implemented in a Xilinx Virtex-5field pro-grammable gate array (FPGA) on a NetFPGA-10G platform. The crit-ical path is located in the 2-universal hash function and Count Sketch update modules. The system takes three clock cycles to hash the 32 bit incoming key and update the Count Sketch. Therefore, the pro-posed system can process traffic at a throughput of 30 Gbit/s with a 160 MHz system clock, assuming the worst-case scenario of a minimum-sized Ethernet frame with an inter-frame gap and preamble of 96 and 64 bits times, respectively.

Conclusion: This Letter proposes a scheme that is based on the Count Sketch data structure for estimating the entropy norm. Simulations of entropy computation are performed using synthetic and real-world traces. The scheme uses only a small amount of memory (424 kbytes) to estimate the entropy of Internet traffic [11] accurately with a bounded error (below 1.5%). The design of the system supports straight-forward hardware implementation. Hence, it is capable of processing very high-speed network streams without unpredictable memory access cycles that are associated with multiple pointer lookup and extra tag storage. The hardware implementation is confirmed using a NetFPGA-10G platform with entropy computation at a throughput of 30 Gbit/s.

Acknowledgments: The authors thank the anonymous reviewers for valuable comments and suggestions. This work is supported in part by the Taiwan National Science Council under contract nos. MOST 103-2221-E-033-030, MOST 103-2632-E-033-001, NSC 102-2221-E-033-031, NSC 101-3113-P-033-003 and NSC 101-2221-E-033-072.

© The Institution of Engineering and Technology 2014

2 July 2014

doi: 10.1049/el.2014.2377

One or more of the Figures in this Letter are available in colour online. Yu-Kuen Lai, Theophilus Wellem and Hui-Ping You (Department of Electrical Engineering, Chung Yuan Christian University, Zhongli City, Taiwan)

E-mail: [email protected]

References

1 Lakhina, A., Crovella, M., and Diot, C.:‘Mining anomalies using trafc feature distributions’. Proc. 2005 Conf. Applications, Technologies, Architectures, and Protocols for Computer Communications, SIGCOMM’05, ACM, New York, NY, USA, 2005, pp. 217228 2 Nychis, G., Sekar, V., Andersen, D.G., Kim, H., and Zhang, H.:‘An

empirical evaluation of entropy-based traffic anomaly detection’. Proc. 8th ACM SIGCOMM Conf. Internet Measurement, ACM, Vouliagmeni, Greece, 2008, pp. 151–156

3 Bartos, V., Zadnik, M., and Cejka, T.:‘Nemea: framework for stream-wise analysis of network traffic’. CESNET Technical Report, 2013 4 Sekar, V., Reiter, M.K., and Zhang, H.:‘A case for a RISC architecture

for networkflow monitoring’. Technical Report, CMU-CS-09-125 5 Lall, A., Sekar, V., Ogihara, M., Xu, J.J., and Zhang, H.:‘Data

stream-ing algorithms for estimatstream-ing entropy of network traffic’. ACM SIGMETRICS, 2006, pp. 145–156

6 Hermsmeyer, C., Song, H., Schlenk, R., Gemelli, R., and Bunse, S.: ‘Towards 100 G packet processing: challenges and technologies,Bell Labs Tech. J., 2009,14, (2), pp. 57–79

7 Charikar, M., Chen, K., and Farach-Colton, M.:‘Finding frequent items in data streams’,Theor. Comput. Sci., 2004,312, (1), pp. 315 8 Chakrabarti, A., Do Ba, K., and Muthukrishnan, S.:‘Estimating entropy

and entropy norm on data streams’. Proc. 23rd Annual Conf. Theoretical Aspects of Computer Science, STACS’06, Berlin, Heidelberg, 2006, pp. 196–205

9 Alon, N., Matias, Y., and Szegedy, M.:‘The space complexity of approxi-mating the frequency moments’. Proc. 28th Annual ACM Symp. Theory of Computing, STOC’96, New York, NY, USA, 1996, pp. 2029 10 Cormode, G.:‘MassDAL public code bank: Sketches, frequent items,

changes (Deltoids)’, Massive Data Analysis Lab.

11 CAIDA:‘The CAIDA UCSD anonymized internet traces 2012 equinix-sanjose.dira.20120119-130000.utc.anon.pcap.gz’, 2012

Gambar

Fig. 1 Block diagram of proposed system based on Count Sketch for entropynorm computation on NetFPGA platform
Fig. 4 Relative errors of estimated entropy for various Zipf parameters usingproposed Count Sketch scheme

Referensi

Dokumen terkait

In the present paper we obtain sufficient conditions for the oscillation of bounded solutions of a class of impulsive differential equations of second order with a constant delay

We use a coupling method for functional stochastic differential equations with bounded memory to establish an analogue of Wang’s dimension-free Harnack inequality [ 13 ].. The

The constant need to weigh, measure, count and account for food that most dieters feel can become an obsession with food that comes close to that experienced by someone with an

pack both an instruction code and a 15-bit address in the 16-bit instruction format, operations involving memory access will normally be specified in Hack using two instructions:

It is observed that the Time-Dependent RSB combined with the Elliptic Convolution of the shortest path method has no update time, and the Query Performance Loss QPL is reduced in planar

Nomenclature AES-CCM Advanced Encryption Standard-CBC-MAC AES-CCMP Advanced Encryption Standard-Counter with CBC-MAC Protocol ALU Arithmetic Logic Unit C Constant Value CBC-MAC

MPI - Receiving a Message with MPI_Recv MPI_Recvmsg, MAXSIZE, type, src, tag, MPI_COMM_WORLD, &status;  message contents block of memory  count size of buffer  message type type

Directed Threaded Interpretation Directed Threaded Interpretation ƒ Even with predecoding, indirect threading includes a centralized dispatch table, which requires • Memory access and