Superspreader Detection System on NetFPGA Platform
Theophilus Wellem
†,‡, Guan-Wei Li, Yu-Kuen Lai
Dept. of Electronic Engineering†, Dept. of Electrical EngineeringChung Yuan Christian University, Chung-Li 32023, Taiwan
{g10202604, g10179006, ylai}@cycu.edu.tw
Dept. of Informatics‡
Satya Wacana Christian University, Salatiga 50711, Indonesia
[email protected]
ABSTRACT
We propose a system to detect superspreaders based on com-binations of FM sketch, Bloom filter and hash table. It first eliminates sources which are not potential superspreaders, then counts the number of connections to distinct destina-tions (fan-out) for remaining sources. The proposed system is implemented on NetFPGA platform. Our experiment re-sults show that the system can detect superspreaders with higher fan-out and estimate their fan-outs accurately using small amount of memory.
Categories and Subject Descriptors
C.2.3 [Network operation]: Network monitoring
General Terms
Algorithms, Design, Performance
Keywords
Superspreader; Bloom filter; FM sketch; NetFPGA;
1.
INTRODUCTION
Detecting sources (or hosts) that have communication with a large number of distinct destinations (also known as su-perspreader, super source) is an important ingredient for network anomaly detection system. This large number of connections in a short period of time can be an indication of port scan attack, DDoS attack, or worm propagation. Several works using sampling and data streaming algorithm have been proposed to detect superspreaders and estimate the number of distinct destinations they connect to (fan-out) [1, 2, 3, 4]. Most of them utilize the combinations of sam-pling, hashing, Bloom filter, bitmap counting, and statistical estimation methods to detect and identify the superspread-ers. Zhaoet al. [2] proposed data streaming algorithm and
flow sampling techniques to detect super sources and su-per destinations. The algorithm uses an online streaming
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). Copyright is held by the author/owner(s).
ANCS’14,October 20–21, 2014, Los Angeles, CA, USA. ACM 978-1-4503-2839-5/14/10.
http://dx.doi.org/10.1145/2658260.2661766.
module (OSM) and an identity sampling module to process incoming traffic in parallel. The OSM encodes the fan-out information for each source IP and the identity sampling module captures the potential candidate of super sources. An estimation algorithm is used to get the estimated fan-out of these candidates of super source from the OSM. The drawback of this scheme is that, the detectable fan-out value for a source is limited up to 266 (m ln m) due to ”saturation” of bitmap array’s column. Therefore, the method in [2] can only be used for small fan-out which is impractical as the number of fan-out can be larger than 10,000 in a 60-second measurement interval for high-speed networks.
Kamiyamaet al. [3] proposed a superspreader
identifica-tion using flow sampling. The scheme uses a Bloom filter to check for a new flow, and a hash table to store the source IP address and its flow counter. When the counter value of a source is above a predefined threshold, it is identified as a superspreader. Their scheme uses chaining for hash colli-sion resolution. The length of linked-list is limited to four in order to bound the processing time. However, it is also only work for small fan-out threshold, small number of sources and flows. For higher fan-out, larger number of sources and flows, the memory space needed can be very large.
Motivated by the work in [2] and [3], we propose a su-perspreader detection system in streaming fashion using a combination of FM sketch, Bloom filter and hash table.
2.
DESIGN AND ARCHITECTURE
In general, the number of superspreaders in one measure-ment interval is usually small compared to regular flows. The main idea of our proposed scheme is to first filter out sources with low fan-out first, and count only the fan-out of potential superspreaders. Therefore, we can save memory space and processing time for counting the fan-out more ac-curately. Fig. 1 shows the block diagram of the system. The architecture of the superspreader detection system consists of hardware and software parts. The hardware communi-cates with the host software via NetFPGA register system. All data structures reside in memory. To filter out those sources that are not superspreaders, we use a bitmap data structure. The bitmap data structure is anm×nbit array of FM sketch. When a packet arrives, the flow ID ({s, d}) and the source s are hashed to the bitmap. For each row (i.e., source hashed to that row, assuming no collision), we can estimate the number of distinct packets (or distinct des-tinations) usingnbi=
1
ϕ
2Ri, i= 1,2, . . . , m.[5], where ϕ = 0.775351. Ri is the position of leftmost ’0’ for rowi.
Header parser
Hash
Hash
Bloom Filter
Hash Table
Hash Control
Unit Hash
Bitmap
Packet
Readstatistics tohostsoftware Settingsfrom
hostsoftware PCIe
interface
Figure 1: Block diagram of superspreader module
The Bloom filter is adopted to identify a new flow and the hash table maintains fan-out count for each source.
The superspreader detection is executed for each measure-ment interval. At the beginning of measuremeasure-ment interval, the Bloom filter, bitmap array, and hash table entries are initially set to zero. Upon each packet arrival, the bitmap data structure is updated. After one interval, theRivalue of the bitmap is used as threshold for the subsequent interval. At subsequent interval, when a sourcesis hashed to rowi, the correspondingRivalue will be compared to a predefined threshold,T. If it is less thanT thensis filtered out, else the flow ID ({s, d}) is checked to the Bloom filter. If it is a new flow, it will be added to the Bloom filter, then the sourcesis hashed to the hash table and its fan-out counter is updated. Currently, no collision resolution is implemented for the hash table. Finally, at the end of measurement in-terval, the values from hash table are used to report the superspreaders and their corresponding fan-outs.
In this system, we do not sample or save any source IP ad-dress to get the superspreader candidates. Instead, we rely on the incoming source IP address of the subsequent mea-surement interval because superspreaders tend to maintain their connections for a while. This can save memory space especially when implemented on platform with limited hard-ware resources.
3.
EVALUATION
We evaluate our system using traces available from MAWI and CAIDA. The number of distinct sources and distinct
{s, d} flows in the MAWI trace (15 minutes) are 15,106 and 48,761, respectively. The CAIDA trace has 28,744,877 packets, 440,771 distinct sources and 1,044,618 distinct{s, d}
flows, with duration of 56 seconds. The superspreader is de-fined as the source whose fan-out is greater than or equal to 50 [3]. The FNR (false negative ratio) is the number of superspreaders being incorrectly not identified divided by the number of actual superspreaders. The FPR (false pos-tive ratio) is number of non-superspreaders being incorrectly identified as superspreaders divided by the number of actual superspreaders. For the results shown in Table 1, the Bloom filter and hash table sizes are 48 KB and 16 KB, respectively. As shown, our results have larger FNR and FPR compared to those presented by Kamiyamaet al. [3]. The large FPR
is due to collisions in the hash table since no collision reso-lution is implemented.
For the experiment results shown in Table 2, we set the bitmap size to 16384×32-bit (64 KB). The Bloom filter size is varied with three hash functions (k = 3). The hash
ta-Table 1: Experiment results Kamiyamaet al.[3] Proposed
Trace NLANR CAIDA Interval 10 secs. 10 secs.
# of flows 63,638 284,314 # of sources 16,455 156,172 # of spreaders 139 271
Memory (BF+HT) 64 KB 64 KB 512 KB FNR 0.06 0.21 0.09
FPR 7.5e-4 0.50 0.04
Table 2: Identification and counting accuracy Trace BF size FNR FPR Est. error Total memory MAWI 32 KB 0.06 0 2.24% 128 KB
CAIDA 256 KB 0.20 0.21 8.75% 768 KB 512 KB 0.16 0.18 5.17% 1.25 MB
ble has 8192 buckets of each 32-bit counter (32 KB). The total memory size is 128 KB. For CAIDA trace, we increase the total memory size (768 KB and 1.25 MB) to achieve more accurate result since the number of flows and sources per interval are larger. The total memory requirement used in the experiments is fit into memory available on the Net-FPGA board. Compared to the results in [2] and [3], our preliminary result is promising. For MAWI trace (Table 2), it can detect superspreaders and estimate larger fan-out (more than 5000) with average estimation error below 5% using only 128 KB memory.
4.
CONCLUSION AND FUTURE WORK
In contrast to previous approach, our system can detect superspreaders with higher fan-out threshold and estimate their fan-outs with average estimation error below 10%. We seek to further optimize the accuracy by applying collision resolution techniques for the hash table. We will also inves-tigate various factors that affect the measurement accuracy, such as measurement interval, threshold, bitmap size, Bloom filter size, and hash table size.
Acknowledgments
This research was funded in part by the National Science Council, Taiwan, under contract number NSC 102-2221-E-033-031, MOST 103-2221-E-033-030, and MOST 103-2632-E-033-001.
5.
REFERENCES
[1] S. Venkataraman, D. Song, P. B. Gibbons, and A. Blum. New streaming algorithms for fast detection of superspreaders. In
Proceedings of Network and Distributed System Security Symposium (NDSS’05), p. 149–166, 2005.
[2] Q. Zhao, A. Kumar, and J. Xu. Joint data streaming and sampling techniques for detection of super sources and destinations. In5th ACM SIGCOMM IMC 2005, p. 77–90, 2005.
[3] N. Kamiyama, T. Mori, and R. Kawahara. Simple and adaptive identification of superspreaders by flow sampling. InIEEE INFOCOM 2007, p. 2481–2485, 2007.
[4] X. Guan, P. Wang, and T. Qin. A new data streaming method for locating hosts with large connection degree. InIEEE GLOBECOM 2009, p. 1–6, Nov. 2009.
[5] P. Flajolet and G. Martin. Probabilistic counting algorithms for database applications.Journal of Computer and System Sciences, 31(2):182–209, Sep. 1985.