Principles of Distributed Database Systems, Third Edition

The only part of this chapter that is essential is Example 2.3, which introduces the running example that we use throughout much of the book. You will notice that the publisher of the third edition of the book is different from the first two editions.

Distributed Data Processing

It is possible to achieve integration without centralization, and this is exactly what distributed database technology strives to achieve. Control of the execution of various tasks can be distributed rather than performed by a computer system.

What is a Distributed Database System?

This is not true in distributed computing systems, where operating system as well as hardware heterogeneity is quite common. A DDBS is also not a system where, despite the existence of a network, the database is located only at one network node (Figure 1.3).

Data Delivery Alternatives

In the push-only mode of data delivery, data transfer from servers to clients is initiated by a server push in the absence of any specific request from clients. Data is pulled from servers to clients in an ad-hoc manner whenever clients request.

Promises of DDBSs

Transparent Management of Distributed and Replicated Data
Reliability Through Distributed Transactions
Improved Performance
Easier System Expansion

The DBMS is responsible for performing all necessary translations from the operating system to the higher-level user interface. In the case of a distributed database, this means that some data may be inaccessible, but with due care users.

Complications Introduced by Distribution

Design Issues

Distributed Database Design
Distributed Directory Management
Distributed Query Processing
Distributed Concurrency Control
Distributed Deadlock Management
Reliability of Distributed DBMS
Replication
Relationship among Problems
Additional Issues

Concurrency control involves synchronizing accesses to the distributed database so that the integrity of the database is maintained. The relationship between the components is shown in figure 1.7. The design of distributed databases affects many areas.

Fig. 1.7 Relationship Among Research Issues

Distributed DBMS Architecture

ANSI/SPARC Architecture
A Generic Centralized DBMS Architecture
Architectural Models for Distributed DBMSs
Autonomy
Distribution
Heterogeneity
Architectural Alternatives
Client/Server Systems
Peer-to-Peer Systems
Multidatabase System Architecture

It is the "real world" view of the enterprise that is modeled in the database [Yormark, 1977]. Network transparency, on the other hand, is supported by the definition of the global conceptual schema.

Bibliographic Notes

As expected, these date back to the early days of the introduction of distributed DBMS technology. Much of the discussion on architectural patterns for multi-DBMS is from [¨Ozsu and Barker, 1990].

Overview of Relational DBMS

Relational Database Concepts

The key of a relation schema is the minimum nonempty subset of its attributes such that the values of the attributes that make up the key uniquely identify each tuple of the relation. The number of attributes of a relation defines its degree, while the number of tuples of the relation defines its cardinality.

Normalization

This is because EMP's key contains the PNO attribute and null values cannot be part of the key. Example 2.3. The following set of relationship schemas are normalized in BCNF with respect to the functional dependencies defined for the relationships.

Relational Data Languages

If the relative order of attributes is not important, it is necessary to replace the second part of the definition with the sentence "the corresponding attributes of the two relations must be defined over the same domain". Correspondence is rather loosely defined here. Example 2.4. Consider the relationship EMP shown in Figure 2.3. The result of selecting those tuples for electrical engineers is shown in Figure 2.4. Intersection of two relationsRandS(R ∩ S) consists of the set of all tuples contained in bothRandS.

Π ENO,PNO (ASG) 1 PNO PROJ

Review of Computer Networks

Types of Networks
Communication Schemes
Data Communication Concepts
Communication Protocols

Some of the hosts can be connected directly to the switches (using fiber optic, coaxial cable or copper wire) and some via wireless base stations. For example, telephone lines can carry data in analogue form between the home and the central office - the rest of the telephone network is now digital and even the home-to-central office link is becoming digital with voice over IP (VoIP) technology. Each of the intranets that are part of the Internet can use its own preferred protocol, so the computers on that network implement that particular protocol (eg the token ring mechanism and the CSMA/CS technique described above are examples of this type of protocols) .

Bibliographic Notes

Along the second behavioral dimension of the access model, it is possible to identify two alternatives. Unfortunately, it's hard to find many distributed applications in real life that would be classified as static. Incidentally, it is along this dimension that the relationship between distributed database design and query processing is established (see Figure 1.7).

Top-Down Design Process

The objective in this phase, which is the focus of this chapter, is to design local conceptual schemas (LCS) by distributing the entities over the sites of the distributed system. The reason for dividing the distribution design into two steps is to better deal with the complexity of the problem. The result is a form of feedback, which can result in supporting one of the previous steps in the design.

Distribution Design Issues

Reasons for Fragmentation
Fragmentation Alternatives
Degree of Fragmentation
Correctness Rules of Fragmentation
Allocation Alternatives
Information Requirements

The reconstructibility of the relation from its fragments ensures the preservation of the constraints defined on the data in the form of dependencies. Assuming that the database is properly fragmented, a decision must be made to allocate the fragments to different locations in the network. Therefore, the replication decision is a trade-off that depends on the ratio of read-only queries to queries.

Fragmentation

Horizontal Fragmentation
Vertical Fragmentation

Example 3.6. Consider the relation SALARY in figure 3.3. The following are some of the possible simple predicates that can be defined on SALARY. We must therefore refer to the conditions imposed on the definition of the global affinity measure AM, where CA(0,k) =0. The second complication concerns the location of the block of attributes that must form one fragment.

Fig. 3.7 Expression of Relationships Among Relations Using Links

PNO, BUDGET}

The alternative solution is to recursively apply the binary partitioning algorithm to each of the fragments obtained during the previous iteration. The idea behind shifting is to move the block of attributes that should form a group to the upper left corner of the matrix, where it can be easily identified. With the addition of the shift operation, the complexity of the partitioning algorithm increases by a factor nand becomes O(n2).

PNO, PNAME, LOC}

Hybrid Fragmentation

In most cases, a simple horizontal or vertical sharding of a database schema will not be sufficient to meet the requirements of user applications. In this case, a vertical fragmentation can be followed by a horizontal one, or vice versa, producing a tree-structured partition (Figure 3.19). In the case of horizontal fragmentation, it must stop when each fragment consists of only one tuple, while the endpoint for vertical fragmentation is one attribute per fragment.

Allocation

Allocation Problem
Information Requirements
Allocation Model
Solution Methods

Minimal cost. The cost function consists of the cost of storing each Fiat a siteSj, the cost of searching the FiatSj site, the cost of updating the Fiat of all sites where it is stored, and the cost of data communication. The second term of the objective function calculates the total cost of storing all duplicate copies of the fragment. In the remainder of this section we extend the components of this model based on the information requirements discussed in Section 3.4.2. The decision variable is xi j, which is defined as.

Data Directory

On the other hand, keeping the directory up to date would be much more difficult as multiple copies would need to be updated. Therefore, the choice must depend on the environment in which the system operates and must be made by balancing factors such as response time requirements, directory size, machine capacity at the locations, reliability requirements, and directory volatility (ie, the amount of changes that the database experienced, which would cause the directory to change).

Conclusion

However, a closer look reveals that isolating the two steps actually adds to the complexity of the allocation models. It would be more promising to formulate a methodology that better reflects the interdependence of the fragmentation and distribution decisions. Materialization refers to the reorganization of the distributed database to reflect the changes required for the redesign step.

Bibliographic Notes

Use the binding energy and vertical partitioning algorithms to obtain a vertical fragmentation of the set of properties inA. Database integration, and the related problem of querying multiple databases (see Chapter 9), is only one part of the more general interoperability problem. The difference is that the GCS may not be the union of the local conceptual schemas (LCSs).

Fig. 3.21 Attribute Usage Values and Application Access Frequencies in Exercise 3.6

Bottom-Up Design Methodology

Mapping schema that defines how to map the elements of each LCS to other GCS elements (Section 4.4). Example 4.1. To facilitate our discussion of global schema design in multidatabase systems, we will use an example that is an extension of the engineering database we have used throughout the book. Example 4.2. The mapping of the E-R model to the relational model is given in Figure 4.6. Note that we have renamed some of the attributes to ensure the name.

Fig. 4.2 GAV and LAV Mappings (Based on [Koch, 2001])

Schema Matching

Schema Heterogeneity
Linguistic Matching Approaches
Constraint-based Matching Approaches
Learning-based Matching
Combined Matching Approaches

In the scheme in Figure 4.4, however, the client is included as an attribute of the PROJ entity. In structure-based approaches, structural similarities in the two schemas can be exploited to determine the similarity of schema elements. The similarity of the two subtrees is determined recursively based on the similarity of the nodes in the subtree.

Fig. 4.8 Learning-based Matching Approach

Schema Integration

There is no implied priority for the integration order of schemas, and trade-offs, such as the best representation for data elements or the most understandable structure, can be made between all schemas rather than between a few. Iterative integration (Figure 4.12b) provides more flexibility (typically more information is available) and is more general (the number of schemes can be varied depending on the integrator's preferences). They reduce the potential integration complexity and lead to automation techniques, as the number of schemas to consider at each step is more manageable.

Schema Mapping

Mapping Creation
Mapping Maintenance

If there are multiple covers (a contest can participate in multiple covers), they are ranked in ascending number of candidate sets in the cover. The last step of the algorithm builds a query qkj for each of the candidate sets in the wrapper selected in the previous step. The FROM clause contains all source relationships called inrikand in the join paths determined in step 2 of the algorithm.

Data Cleaning

Many of the data-level operators compare individual tuples of two relations (from the same or different schemas) and decide whether they represent the same fact. There have been proposals for special techniques to deal with this effectively in data cleaning (eg [Chaudhuri et al., 2003]). Given the large amount of data to be handled, data level cleaning is expensive and efficiency is a significant issue.

Conclusion

Bibliographic Notes

For each of the following queries, discuss which of the three local schedules (Football Player, Actor, and Politician) contribute to the global query result. For each of the following queries, discuss which of the three LCSs are needed for the global query result. For each of the five matches, indicate which of the following match approaches is likely to identify the match:.

View Management

Views in Centralized DBMSs
Views in Distributed DBMSs
Maintenance of Materialized Views

The only effect of this statement is to store the view definition in the catalog. A view can only be updated if updates to the view can be propagated to the underlying relationships without ambiguity. The main advantages of instant refresh are that the view is always consistent with the underlying data and that read-only queries can be fast.