CS545Why.ppt 376KB Jun 23 2011 12:31:08 PM

(1)

Sep-01

CS545 Intro

1

September 2001

Gio Wiederhold

Stanford University

www-db.stanford.edu/people/gio.html

CS545 intro

(2)

Sep-01

CS545 Intro

2

Abstract

The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users.

Databases have been available in various forms since 1958.

The major paper defining database functionality in a formal sense is due to Ted Codd, of IBM, published in 1970.

Information is created by applying knowledge (encoded as programs or rules) to collected data and message received.

Data and computation resources are provided by a variety of suppliers, public and

private. The number of potential suppliers and their autonomy also creates information overload

To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change.

The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound

(3)

Sep-01

CS545 Intro

3

Outline

•

Motivation and Functions needed

•

Early Inventions

•

Architecture

•

Formal basis

•

Breadth of applicability

•

Unsolved problems

(4)

Sep-01

CS545 Intro

4

Files versus Databases

Files: provide input and output for a program (transient)

•

Devices: Paper tape (ascii), Cards, Magnetic Tapes

•

Examples:

1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols)

still visible in files, IBM VM OS

2. UNIX: standard in > Standard out

3. Data-processing: in > > out = in > > out = in > > out ....

Databases: storage (persistent, reliable, random access)

•

Enabled by disk - technology, starting in 1960 (5MB)

•

Many users, i.e., many (small) programs

 

•

Example:

(5)

Sep-01

CS545 Intro

5

Files

•

Files: a means for programs to store data for later use

– The initial program determines

1. what data are being stored (all? – memory dump [LISP] ) 2. how it is being stored – structure and format

3. when it is being stored and available

– successor programs must follow these decisions

• often the successor program is another invocation of the initial program 

•

Problems

– One program requires a different structure than another: BOMP

– Data must be available rapidly, incrementally:

• Class-assignments

• seat reservations

• library checkout

(6)

Sep-01

CS545 Intro

6

Databases

•

Data are intended to be used by many programs

– Often small – transactions

– Various subsets of the all the relevant data

– Structural transformations: Bill-of-Materials Programs:

Input program

Records parts being delivered

Supplier :> parts

Output program

Records parts being consumed

Products :> parts

Inventory

(7)

Sep-01

CS545 Intro

7

BoMPs are common

• Supplier Parts Product-Assemblies

• Clinical-labs Observations Patient-Records

• Employees Salary & Tasks Productivity

• Accidents Reports Failure-Analysis

• Flights Seats Passengers

• Classes Grades Student-Performance

• . . .

Two directions / hierarchies needed for data access:

Data sources

Data consumption

Solutions?

(8)

Sep-01

CS545 Intro

8

Design Problem & Solutions

Conceptual - model • Supplier program:

– Use a hierarchy: supplier parts supplied ( 1: n )

• Consumer program:

– Use a hierarchy: consumer parts used ( 1: m )

Actual solution in memory: Matrix: if it exceeds memory then either supplier or consumer

part accesses become costly

Actual solution beyond memory: 1. redundant transformed data

2. pointer and index structures

s1 s2 s3 sn

c1

c2

c3

cm

(9)

Sep-01

CS545 Intro

9

Factors influencing design

• Size ---

memories are getting bigger, problems too

• Density of matrix:

– suppliers supply only some parts, overlapping – products consume only some parts, overlapping

• Performance requirements:

– supplier response can be less critical

– airline seats made available versus seats being sold

– laboratory data obtained versus patient records needed

• Usage patterns:

(10)

Sep-01

CS545 Intro

10

DBMSs

Database Management Systems

• Collection of the software needed to manage databases

• Components:

– Storage management – intertwined with the operating systems – Query and update processor – uses the schema

– Schema interpreter and compiler

– Transaction management and concurrency control/protection – also jointly with OS

– Logger for backup – Recovery programs

• Large, complex, not all features always needed

(11)

Sep-01

CS545 Intro

11

Inventions – 1 - Data Description

• Schemas [McGee, 1958]

program independence

= A symbolic description of each column, to be interpreted by update and retrieval programs as well as users

– Allows programs to use subsets

– Allows columns to be added without affecting current programs

• Compilation of Schemas [1975]

= avoids interpretation cost

– requires keeping track of last update for auto-recompile

• Views [Chamberlin et al., 1976]

Bounded schemas

= Data base adminiistrator defines schema subset for user roles – Can be compiled for fast execution

(12)

Sep-01

CS545 Intro

12

Inventions – 2 – access trees

• Indexes [Landauer 1963]

balanced trees

= Efficient ancillary access path – Requires updating to stay current

• Multiple Indexes [DavisLin 1965]

multi-attribute-based

access

= Multiple ancillary access paths – Allows access by multiple paths

– Requires much updating to stay current

• B-trees [Bayer, 1972] Index

Updateability

(13)

Sep-01

CS545 Intro

13

Inventions – 3 - structures

• Hierarchical Structures [IMS, 1963] Dense data structures

= Trees mapped to sequential structures for fast access to sparse data – Fast access when many related values are needed

– Costly to update, often done periodically

– Must be combined with trees for multiple-access paths

• Triple storage [Feldman, 1969] Arbitrary structures

= All data represented by object-attribute-value entries – High cost when many related values are needed

Note that these two conflict – in today's database

(14)

Sep-01

CS545 Intro

14

Inventions – 4 – model

foodfight

• Relational Model [Codd 1970]

= tabular model, with an algebraic set of operations, normalization – Formalization enabled understanding, dissemination

– No inter-relation semantics, specified when query is made

– Later constraints were added, implicitly defining keys, connections

• Hierarchical -

(also applied to one view of BOMPs)

= describe hierarchical connections among data records, no algebra

– An attempt to describe earlier, simple implementations in model terms

• Network –

generalization of BOMP

(15)

Sep-01

CS545 Intro

15

Why did the relational model win?

• Relational Model DBMSes

Sequel  QUEL, SQL

– Formality – allowed essential optimization algorithms – Restrictions – as normalization, provide guidance – Teachability – exposed principles:

• can't teach only from examples

– DBMS independence – safety blanket for mission-critical users • But implementations added features

• Use least common set of features?

– Hard to enforce once a system has been bought

• Few suppliers remain {ORACLE. IBM. MS, mySQL}

• ER model [Chen, 1976]

= Focuses on design, can be mapped to multiple implementations – Few tools for direct translation

(16)

Sep-01

CS545 Intro

16

Databases and the Web

• HTML presentation:

Hierarchical Markup Language

= Data are transformed for human consumption, external refs – Often hierarchical – object-oriented view

– If there was a schema, it is now hidden

• XML presentation

= Schema data is embedded – Much flexibility

– Much more space when entries are small

– Requires an interpretation for viewing as XSLT

• RDF

Resource description Formalism

= Triple representation: object-attribute-value – Great flexibility

(17)

Sep-01

CS545 Intro

17

Information overload

Data starvation

•

More databases

– public & corporate

•

Faster communication

– digital

– packeting: TCP-IP, ATM

•

World-wide connectivity

– Internet & Intranets

– world-wide web

•

Disintermediation

(18)

Sep-01

CS545 Intro

18

Change in Supply vs Demand

What information consumes is rather

obvious, it consumes the attention of its

recipients.

Hence a wealth of information creates a

poverty of attention, and a need to

allocate that attention efficiently among

the overabundance of information sources

that might consume it.

(19)

Sep-01

CS545 Intro

19

Making data relevant

• Data reduction

• Data abstraction

– Level changing – Summarization – Exception search

– Level change to integrate with other data sources

(20)

Sep-01

CS545 Intro

20

Data and Knowledge

Information is

created at the

confluence of

data --

the state

&

(21)

Sep-01

CS545 Intro

21

Transforming Data to Information

Application

Layer

Mediation

Layer

Foundation

Layer

data and simulation resources

(22)

Sep-01

CS545 Intro

22

Functions

inside

Mediation

Selection

Summarize

Transform

Inte- -gr

ation

Hetero-genous

resources

(23)

Sep-01

CS545 Intro

23

Function of Mediation

Apply

Domain-specific

Specialist

Knowledge to add value

• to locate data sources • to convert for consistency

• to integrate from diverse sources • to describe data for processing • to abstract for insight / models • to extrapolate to new situations • to summarize for presentation

(24)

Sep-01

CS545 Intro

24

Environmental Restoration at

INEL

Undoing 50 years of messes

….

MQL [ISX] MSL [Stanford] OQL [ODMG] QEM mediator QEM QEM QEM QEM QEM CORBA other mediators OEM OEM OEM OEM OEM OEM OEM QEM QEM Idaho National Engineering Laboratory June 1998

LOCKHEED MARTIN ISX - Stanford Univ.

(25)

Sep-01

CS545 Intro

25

From Schemas to Ontologies

Ontologies allow communication among partners

in enterprises

(rarely in machine-readable form)

Relationships determine meaning -

parent, school, company

Databases use ontologies during design

in their E-R diagrams

(implicitly)

and to

represent the leaf nodes in their schemas.

Variable and Class names in Software

Knowledge-bases use term ontologies

(often

explicitely)

,

add class definition

(to hold instances)

,

(26)

Sep-01

CS545 Intro

26

Ontology: components

.

We represent the contents and structure of

a languages by its ontology:

•

a set of well-defined terms,

which delimit the domain of discourse

•

relationships among those terms,

chosen from a limited set

(27)

Sep-01

CS545 Intro

27

Heterogeneity among Domains

If interoperation involves distinct

domains mismatch ensues

• Autonomy conflicts with consistency,

–

Local Needs have Priority,

–

Outside uses are a Byproduct

Heterogeneity must be addressed

• Platform and Operating Systems



• Representation and Access Conventions



(28)

Sep-01

CS545 Intro

28

Unsolved problem in Interoperation

Common assumption

in assembling and integrating

distributed information resources

•

The language used by the resources is the same

•

Sublanguages used by the resources are subsets of a

globally consistent language

This assumption

is

provably

false.

Working towards

the goal of global consistency is

1.

naïve

-- the goal cannot be achieved

(29)

Sep-01

CS545 Intro

29

Large Ontologies: good or bad?



Have all the Knowledge together

+ simple for customers of KBs

– hard for owners of KBs, must synchronize with many others

– in the limit -- everybody must be globally consistent



Large KB will cover multiple / all domains

_{created by a committee -- slow}

_{maintained by a committee} _{– costly to impssible}



Differences in level of abstraction -- efficiency

_homeowner:_nail

(30)

Sep-01

CS545 Intro

30

Evolution of mediation

W2 W1 D2 D6 D4 W3 I1 D1 D5 I2 M1 _M2

A1 A2 A4 A5

(31)

Sep-01

CS545 Intro

31

Definition*

A

mediator

is a software module that exploits

encoded knowledge about certain sets or subsets

of data to create information for a higher layer of

applications.

It should be small and simple, so that it can be

maintained by one expert or, at most, a small and

coherent group of experts.

(32)

Sep-01

CS545 Intro

32

Interfaces

Application



Mediator

{OQL, KQML, ...}

Mediator



Data sources

{SQL, TQL, XML, … }

{SQL, TQL, XML, … }

Data



real world

{sensors, clerks, … }

{sensors, clerks, … }

Human



Computer

{x-widgets, HTML}

(33)

Sep-01

CS545 Intro

33

An Integration Architecture

Client

Application

business reports

portfolios for each company

stock market prices

Wrapper

Ticker

Tape

Dialog

(34)

Sep-01

CS545 Intro

34

Status of Mediation Technology

Today

•

Handcrafted

•

Expert consults with

programmer

•

Programmer codes the

knowledge needed

•

Resource changes

require advise,

program update

Future

•

Generated from models

•

Domain Expert

maintains models

•

Specification

determines functions

•

Resource changes

(35)

Sep-01

CS545 Intro

35

A mediator is not static software:

Knowledge ages

Application Interface

Resource Interfaces

Owner / Creator Maintainer

Lessor - Seller Advertisor

Changes of user needs

Domain changes

Resource changes Models, programs,

rules, caches, . . .

(36)

Sep-01

CS545 Intro

36

Domain Specialization

• Knowledge Acquisition

(20% effort)

&

• Knowledge Maintenance

(80% effort *)

to be performed by

• Domain specialists

• Professional organizations

• Field teams of modest size

Empowerment

automously maintainable

(37)

Sep-01

CS545 Intro

37

Roles

Computer Scientists

•

Provide tools

– adapatation

– integration

– matching

– composing

•

Assess Standards

•

Assure scalability

Domain Experts

•

Learn to use the tools

•

Select resources

•

Assess their value

•

Rank their quality

•

Resolve semantics

•

Get client feedback

(38)

Sep-01

CS545 Intro

38

Mediation Research Topics

• Mediator management and maintenance

• Representation of knowledge and customer models

• Balancing dynamic and warehouse solutions

• Formalization of semantic heterogneities

– many levels and types

– roles for wrappers vs. mediators vs. applications – scalability by partitioning -- make it simple!

– Domain Ontologies --- tools, validation, . . .

• Effect of object paradigm and method-based access

• Service and business models

(39)

Sep-01

CS545 Intro

39

Integration

Science

Integration

Science

Artificial

Intelligence

knowledge mgmt

domain expertise

uncertainty

Artificial

Intelligence

knowledge mgmt

domain expertise

uncertainty

Systems

Engineering

analysis

documentation

costing

Systems

Engineering

analysis

documentation

costing

Databases

access

storage

algebras

Databases

access

storage

algebras

Long Range Science Vision

Integration Methods

(40)

Sep-01

CS545 Intro

40

Fat versus thin mediators

• too broad:

hard to maintain, needs a committee

•

too thin: insufficient added value

•

Too fat: hard to

compose

•

Too narrow: few costumers

domain scope

service

scope

(41)

Sep-01

CS545 Intro

41

Maintenance is good for you

re la ti ve a n n u al m ai n te n an ce c o st d ep re ci at io n = 1 / li fe ti m e

automobile hardware software automobile hardware software

100% 100% 40 40 0 0 20 20 70 70 30 30 10 10 80 80 90 90 60 60 50 50 li fe ti m e li fe ti m e years years 10 10

4 4 2 2 7 7 3 3 1 1 8 8 9 9 6 6 5 5 13 13 11 11 12

(42)

Sep-01

CS545 Intro

42

Client-Server Architecture

Client system

data and simulation resources Fast build of clients

by resource reuse

s

X

(43)

Sep-01

CS545 Intro

43

Systems with Mediators

Applications . . . .

Mediators . . . .

Data Resources . . .

(44)

Sep-01

CS545 Intro

44

Growth through

Reuse

New Application

Prior & Revised

Mediators

Extended Data

Resources

(45)

Sep-01

CS545 Intro

45

Linear O(

n

) Cost of Growth-- now

O(

n

2

)

• Data changes only affect some

mediators; only in their domain

• Mediators can

1. supply old information to n-1 prior

applications

2. provide better information to the new application

3. be partially or completely reused

• New applications, using the new data,

can be developed and inserted dynamically



2

(46)

Sep-01

CS545 Intro

46

Assigning maintenance responsibility

a. Source data quality –

supplier database, files, or web pages

b. Interface to the source –

wrapper, supplier or vendor for supplier

c. Source selection –

expert specialist in mediator

d. Source quality assessment –

customer input to mediator

e. Semantic interoperation –

specialist group providing input to the mediator

f. Consistency and metadata information –

mediator service operation or warehouse

g. Informal, pragmatic integration –

client services with customer input

h. User presentation formats –

client services with customer input

Services

Sources