Design and implementation of a Data Stream Management System

Due to the technological advancements of the 21st century, the previous decade has witnessed a tremendous increase in the amount of data being exchanged, stored and processed across the globe. For example, consider a router that is part of the Internet and is one level ahead of the popular web server, ie. if B and the data stream S were standard tables in an RDBMS, an SQL query of the following type could be used to obtain the number of requests from each of the blacklisted IP addresses.

In a typical DBMS, the query is processed assuming the size of the data is known before the calculation (e.g. the GROUP BY clause). Furthermore, the result set depends on the current state of the DBMS and does not change once it is returned to the user. Having seen the need for a flow management system, we now present a formal definition and properties of data flows that we will use throughout the remainder of the dissertation.

In an aggregate model, each stream element contains a range value for a specified value in the signal domain S. In a cash register model, each stream element contains a nonnegative value for a specified domain value.

Challenges

The domain value represented by the tuple is incremented by the value contained in the tuple. The turnstile model is a generalization of the cash register model and relaxes the assumption that Vi ≥0. In areset model, each element in the stream replaces the earlier value of a particular domain value with the new value.

Organization

Count-Min Sketch
Point Query
Range Query
Heavy Hitters

We present performance results showing the time taken to answer a query, time taken to process a new element in the stream, the space used to answer the query and the accuracy of the returned result, to compare. We also provide a brief summary of the systems that currently exist for managing these high-speed data streams. Since the entire stream of data cannot be stored, a summary of the entire or a selected subset of the stream is stored and used for the purpose of analysis.

Load Shedding - It refers to the process of shedding a set of data items or a portion of the flow during periods of overload. Sketching - It is the process of randomly projecting a subset of features from the data stream. As mentioned above, algorithms that use CM sketches, because only a summary of the data stream is stored, also return the search result in a (ε, δ) approximation, i.e.

When a new element of the stream ait(it, ct) is encountered at timet, the value B(it) is increased by ct. The algorithm ensures that of all the values it returns as heavy hitters, none of the real heavy hitters are missed.

Figure 2.1: Count-Min Sketch count [h j (i t ), j] ←− [h j (i t ), j] + c t

STREAM

The database stream management system (DBSMS) is motivated by the Stanford Stream Data Manager (STREAM), a system for running continuous queries on multiple continuous data streams as described in Section 2.2. DBSMS attempts to integrate a data flow management system (DSMS) and a traditional database management system (DBMS). The STREAM system supports declarative query language and copes with high data rates and query loads by providing approximate answers when resources are limited.

The DBSMS system will extend STREAM with features such as user-controlled approximation, support for persistence, integrating user-defined approximation algorithms into STREAM, mapping given queries into approximation operators and multi-query optimization. STREAM can be used as a data flow management system (DSMS) and any relational database can be used as a database management system (DBMS). The DBSMS system supports declarative query language [see 3.3] using the extended version of Continuous Query Language.

In this architecture, the user can register a query corresponding to the STREAM system and can also register a stream. The user can plug in any approximation algorithm as an operator to answer certain types of queries with user-desired accuracy. Finally, after validating it syntactically and semantically, it will be passed to the DBSMS system, which will determine the type of query - whether it is a CQL, SQL or User Defined Algorithm query.

In this system, user has the option to specify the approximation algorithms and it's equivalent query template. Using this template we can generate a query plan and can replace similar query plans using the matching approximation operator. DBSMS also maintains a summary about streams and queries to query about stream statistics like minimum key, L1 norm etc.

The implementation plan and query mapping from one type to another are discussed in the following sections.

Types of queries

One-Time query

A one-time query is a query that is sent at any time after the start of the stream in which it is registered. It is executed in the current state of DBSMS and the result set is static rather than a stream. This is further divided into two types depending on whether the query is performed on a relation or a stream.

Continuous query

Query language

Registering a stream

Registering a query

Querying the result

Query execution and conversion from one type of query to another type

Query processing structure

Project goals

Query mapping into approximation operators

Streaming statistics queries

An equivalent set approximation algorithm is an approximate query of a point on the stream with a value of 1 for each key. An equivalent exact algorithm for maintaining sequence length is to maintain a single variable that counts the number of elements that arrived.

Streaming aggregate queries

Streaming data queries

Equivalent query: select key, sum(value) from relation group by key with sum(value) >= value;.

Integrating streaming algorithms into the streaming database

User controlled approximation

Execution structure of user defined algorithm

Streams collection

Queries collection

StreamSource and OperatorQueueThread communication structure . 24

Input query can be received from the input terminal (InputConsoleQueryReader) provided by the system or from the port on which the system is listening. Using this object, the Input component records the type of query and the operation to perform.

Input component

Register request
Action request
QueryResult request
Show request

Register request divided into two types of register requests: stream register request and query register request.

Configuration component

Stream component

Stream Element
Stream source
Operator queue thread
Streams collection
Queries collection

Stream source is responsible for generating and distributing stream elements to all registered operators. By using this property the derived class will be notified when an operation was performed in the parent class. Since persistence is one of the most crucial properties of the system, it will be used to synchronize the system with the secondary storage after two classes.

Algorithm component

Process to plug in user defined stream source

Process to plug in user defined algorithm operator

Various experiments are performed to determine the behavior of the system and compare the performance against a well-known database. In DBSMS, the insertion time varies between queries because the storage structures are different for different queries. Compared to other algorithms, the harder algorithms take more time because the number of operations to be performed while inserting into the data structures is large compared to the rest of the algorithms.

Almost every algorithm takes constant time, because the average number of operations to be performed is the same for flows of different sizes. In DBSMS, statistics take much less time because statistics are calculated in advance, unlike the calculation results that are calculated in databases each time. Compared to DBMS, the query takes less time because it is answered approximately using dyadic ranges, as opposed to giving an exact answer in databases.

So far in this report, we have presented a design for a generic database flow management system. We also presented our system implementation, where we introduced user error control and partially addressed the challenges of query mapping, integration of approximation algorithms into the system, and persistent support. We now present some open questions and other challenges that need to be addressed in system design.

Although error control has been provided to the user, we still have to deal with the problem of error propagation. Until now, error handling has only been provided at the level of a single query that maps to exactly one approximation operator. An algorithm implemented by the operator automatically creates memory structures to answer queries based on ε and δ received as input from the user.

However, if a query is mapped to more than one approximation operator, or if a complex query includes the results of simpler queries, the system must calculate the error propagation between the different operators. The user provides the full error for each query and not at the level of the operators to which it is mapped. The system should thus be able to calculate the error values ε1 and ε2 to be used with the point and range query operators to ensure that the total error ε is satisfied.

Figure 6.1: Insertion times of DBMS vs DBSMS

Query mapping and Feasibility

The user must enter an SQL-type query with the desired precision, and the system, using a set of well-defined rules, must be able to transform it into simpler queries that can be mapped with the existing approximation operators. For example, as seen above, queries with the following template should be automatically identified as point query and should be mapped to the point query approximation operator. Such a form of mapping is most desirable and not yet incorporated into the system.

Also, since we reduce a multidimensional tuple to a two-dimensional tuple, and store on the reduced tuple, we may not be able to answer queries that use other properties of the discarded tuple. This leads us to introduce a check whether a query over a stream is actually feasible or not. Such checking of query feasibility does not currently occur and must be introduced in the system.

Multi-Query Optimization

The input component defines the input structure of the input that is sent to the DBSMS engine. REGISTER_REQUEST : Request to register an entity in the system QUERY_RESULT : To query the system. This module also consists of system checkpoint properties and all system properties.

The transmission source is responsible for the generation and distribution of the generated stream and the distribution of the transmission elements to all registered operators.