Sep-01
CS545 Intro
1
September 2001
Gio Wiederhold
Stanford University
www-db.stanford.edu/people/gio.html
CS545 intro
Sep-01
CS545 Intro
2
Abstract
The distinction of storing data in files and databases is that databases are intended to be used by multiple programs and types of users.
Databases have been available in various forms since 1958.
The major paper defining database functionality in a formal sense is due to Ted Codd, of IBM, published in 1970.
Information is created by applying knowledge (encoded as programs or rules) to collected data and message received.
Data and computation resources are provided by a variety of suppliers, public and
private. The number of potential suppliers and their autonomy also creates information overload
To cope with these issues novel intermediate services are needed, opening up new opportunities. Many traditional relationships among consumers and vendors will change.
The autonomy of the suppliers causes heterogeneity and inconsistencies. The semantics of diverse sources are captured by their ontologies, the collection of terms and their relationships as used in the domain of discourse for the source. When sources are to be related we rely on their ontologies to make the linkages. . Creating a sound
Sep-01
CS545 Intro
3
Outline
•
Motivation and Functions needed
•
Early Inventions
•
Architecture
•
Formal basis
•
Breadth of applicability
•
Unsolved problems
Sep-01
CS545 Intro
4
Files versus Databases
Files: provide input and output for a program (transient)
•
Devices: Paper tape (ascii), Cards, Magnetic Tapes
•
Examples:
1. FORTRAN: tapes 1-5 input, 5 standard in ( 80 column cards) tapes 6-7 output, 6 print (120 cols), 7 punch ( 80 cols)
still visible in files, IBM VM OS
2. UNIX: standard in > Standard out
3. Data-processing: in > > out = in > > out = in > > out ....
Databases: storage (persistent, reliable, random access)
•
Enabled by disk - technology, starting in 1960 (5MB)
•
Many users, i.e., many (small) programs
•
Example:
Sep-01
CS545 Intro
5
Files
•
Files: a means for programs to store data for later use
– The initial program determines
1. what data are being stored (all? – memory dump [LISP] ) 2. how it is being stored – structure and format
3. when it is being stored and available
– successor programs must follow these decisions
• often the successor program is another invocation of the initial program
•
Problems
– One program requires a different structure than another: BOMP
– Data must be available rapidly, incrementally:
• Class-assignments
• seat reservations
• library checkout
Sep-01
CS545 Intro
6
Databases
•
Data are intended to be used by many programs
– Often small – transactions
– Various subsets of the all the relevant data
– Structural transformations: Bill-of-Materials Programs:
Input program
Records parts being delivered
Supplier :> parts
Output program
Records parts being consumed
Products :> parts
Inventory
Sep-01
CS545 Intro
7
BoMPs are common
• Supplier Parts Product-Assemblies
• Clinical-labs Observations Patient-Records
• Employees Salary & Tasks Productivity
• Accidents Reports Failure-Analysis
• Flights Seats Passengers
• Classes Grades Student-Performance
• . . .
Two directions / hierarchies needed for data access:
Data sources
Data consumption
Solutions?
Sep-01
CS545 Intro
8
Design Problem & Solutions
Conceptual - model • Supplier program:
– Use a hierarchy: supplier parts supplied ( 1: n )
• Consumer program:
– Use a hierarchy: consumer parts used ( 1: m )
Actual solution in memory: Matrix: if it exceeds memory then either supplier or consumer
part accesses become costly
Actual solution beyond memory: 1. redundant transformed data
2. pointer and index structures
s1 s2 s3 sn
c1
c2
c3
cm
Sep-01
CS545 Intro
9
Factors influencing design
• Size ---
memories are getting bigger, problems too• Density of matrix:
– suppliers supply only some parts, overlapping – products consume only some parts, overlapping
• Performance requirements:
– supplier response can be less critical
– airline seats made available versus seats being sold
– laboratory data obtained versus patient records needed
• Usage patterns:
Sep-01
CS545 Intro
10
DBMSs
Database Management Systems
• Collection of the software needed to manage databases
• Components:
– Storage management – intertwined with the operating systems – Query and update processor – uses the schema
– Schema interpreter and compiler
– Transaction management and concurrency control/protection – also jointly with OS
– Logger for backup – Recovery programs
• Large, complex, not all features always needed
Sep-01
CS545 Intro
11
Inventions – 1 - Data Description
• Schemas [McGee, 1958]
program independence
= A symbolic description of each column, to be interpreted by update and retrieval programs as well as users
– Allows programs to use subsets
– Allows columns to be added without affecting current programs
• Compilation of Schemas [1975]
= avoids interpretation cost
– requires keeping track of last update for auto-recompile
• Views [Chamberlin et al., 1976]
Bounded schemas
= Data base adminiistrator defines schema subset for user roles – Can be compiled for fast execution
Sep-01
CS545 Intro
12
Inventions – 2 – access trees
• Indexes [Landauer 1963]
balanced trees
= Efficient ancillary access path – Requires updating to stay current
• Multiple Indexes [DavisLin 1965]
multi-attribute-based
access
= Multiple ancillary access paths – Allows access by multiple paths
– Requires much updating to stay current
• B-trees [Bayer, 1972] Index
Updateability
Sep-01
CS545 Intro
13
Inventions – 3 - structures
• Hierarchical Structures [IMS, 1963] Dense data structures
= Trees mapped to sequential structures for fast access to sparse data – Fast access when many related values are needed
– Costly to update, often done periodically
– Must be combined with trees for multiple-access paths
• Triple storage [Feldman, 1969] Arbitrary structures
= All data represented by object-attribute-value entries – High cost when many related values are needed
Note that these two conflict – in today's database
Sep-01
CS545 Intro
14
Inventions – 4 – model
foodfight
• Relational Model [Codd 1970]
= tabular model, with an algebraic set of operations, normalization – Formalization enabled understanding, dissemination
– No inter-relation semantics, specified when query is made
– Later constraints were added, implicitly defining keys, connections
• Hierarchical -
(also applied to one view of BOMPs)= describe hierarchical connections among data records, no algebra
– An attempt to describe earlier, simple implementations in model terms
• Network –
generalization of BOMPSep-01
CS545 Intro
15
Why did the relational model win?
• Relational Model DBMSes
Sequel QUEL, SQL– Formality – allowed essential optimization algorithms – Restrictions – as normalization, provide guidance – Teachability – exposed principles:
• can't teach only from examples
– DBMS independence – safety blanket for mission-critical users • But implementations added features
• Use least common set of features?
– Hard to enforce once a system has been bought
• Few suppliers remain {ORACLE. IBM. MS, mySQL}
• ER model [Chen, 1976]
= Focuses on design, can be mapped to multiple implementations – Few tools for direct translation
Sep-01
CS545 Intro
16
Databases and the Web
• HTML presentation:
Hierarchical Markup Language= Data are transformed for human consumption, external refs – Often hierarchical – object-oriented view
– If there was a schema, it is now hidden
• XML presentation
= Schema data is embedded – Much flexibility
– Much more space when entries are small
– Requires an interpretation for viewing as XSLT
• RDF
Resource description Formalism= Triple representation: object-attribute-value – Great flexibility
Sep-01
CS545 Intro
17
Information overload
Data starvation
•
More databases
– public & corporate
•
Faster communication
– digital
– packeting: TCP-IP, ATM
•
World-wide connectivity
– Internet & Intranets
– world-wide web
•
Disintermediation
Sep-01
CS545 Intro
18
Change in Supply vs Demand
What information consumes is rather
obvious, it consumes the attention of its
recipients.
Hence a wealth of information creates a
poverty of attention, and a need to
allocate that attention efficiently among
the overabundance of information sources
that might consume it.
Sep-01
CS545 Intro
19
Making data relevant
• Data reduction
• Data abstraction
– Level changing – Summarization – Exception search
– Level change to integrate with other data sources
Sep-01
CS545 Intro
20
Data and Knowledge
Information is
created at the
confluence of
data --
the state
&
Sep-01
CS545 Intro
21
Transforming Data to Information
Application
Layer
Mediation
Layer
Foundation
Layer
data and simulation resources
Sep-01
CS545 Intro
22
Functions
inside
Mediation
Selection
Summarize
Transform
Inte- -gr
ation
Hetero-genous
resources
Sep-01
CS545 Intro
23
Function of Mediation
Apply
Domain-specific
Specialist
Knowledge to add value
• to locate data sources • to convert for consistency
• to integrate from diverse sources • to describe data for processing • to abstract for insight / models • to extrapolate to new situations • to summarize for presentation
Sep-01
CS545 Intro
24
Environmental Restoration at
INEL
Undoing 50 years of messes
….
MQL [ISX] MSL [Stanford] OQL [ODMG] QEM mediator QEM QEM QEM QEM QEM CORBA other mediators OEM OEM OEM OEM OEM OEM OEM QEM QEM Idaho National Engineering Laboratory June 1998LOCKHEED MARTIN ISX - Stanford Univ.
Sep-01
CS545 Intro
25
From Schemas to Ontologies
Ontologies allow communication among partners
in enterprises
(rarely in machine-readable form)Relationships determine meaning -
parent, school, companyDatabases use ontologies during design
in their E-R diagrams
(implicitly)
and to
represent the leaf nodes in their schemas.
Variable and Class names in Software
Knowledge-bases use term ontologies
(often
explicitely)
,
add class definition
(to hold instances)
,
Sep-01
CS545 Intro
26
Ontology: components
.We represent the contents and structure of
a languages by its ontology:
•
a set of well-defined terms,
which delimit the domain of discourse
•
relationships among those terms,
chosen from a limited set
Sep-01
CS545 Intro
27
Heterogeneity among Domains
If interoperation involves distinct
domains mismatch ensues
• Autonomy conflicts with consistency,
–
Local Needs have Priority,
–
Outside uses are a Byproduct
Heterogeneity must be addressed
• Platform and Operating Systems
• Representation and Access Conventions
Sep-01
CS545 Intro
28
Unsolved problem in Interoperation
Common assumption
in assembling and integrating
distributed information resources
•
The language used by the resources is the same
•
Sublanguages used by the resources are subsets of a
globally consistent language
This assumption
is
provably
false.
Working towards
the goal of global consistency is
1.
naïve
-- the goal cannot be achieved
Sep-01
CS545 Intro
29
Large Ontologies: good or bad?
Have all the Knowledge together
+ simple for customers of KBs
– hard for owners of KBs, must synchronize with many others
– in the limit -- everybody must be globally consistent
Large KB will cover multiple / all domains
created by a committee -- slow maintained by a committee – costly to impssible
Differences in level of abstraction -- efficiency
homeowner: nailSep-01
CS545 Intro
30
Evolution of mediation
W2 W1 D2 D6 D4 W3 I1 D1 D5 I2 M1 M2
A1 A2 A4 A5
Sep-01
CS545 Intro
31
Definition*
A
mediator
is a software module that exploits
encoded knowledge about certain sets or subsets
of data to create information for a higher layer of
applications.
It should be small and simple, so that it can be
maintained by one expert or, at most, a small and
coherent group of experts.
Sep-01
CS545 Intro
32
Interfaces
Application
Application
Mediator
Mediator
{OQL, KQML, ...}
{OQL, KQML, ...}
Mediator
Mediator
Data sources
Data sources
{SQL, TQL, XML, … }
{SQL, TQL, XML, … }
Data
Data
real world
real world
{sensors, clerks, … }
{sensors, clerks, … }
Human
Human
Computer
Computer
{x-widgets, HTML}
Sep-01
CS545 Intro
33
An Integration Architecture
Client
Application
business reports
portfolios for each company
stock market prices
Wrapper
Wrapper
Ticker
Tape
Dialog
Sep-01
CS545 Intro
34
Status of Mediation Technology
Today
•
Handcrafted
•
Expert consults with
programmer
•
Programmer codes the
knowledge needed
•
Resource changes
require advise,
program update
Future
•
Generated from models
•
Domain Expert
maintains models
•
Specification
determines functions
•
Resource changes
Sep-01
CS545 Intro
35
A mediator is not static software:
Knowledge ages
Application Interface
Resource Interfaces
Owner / Creator Maintainer
Lessor - Seller Advertisor
Changes of user needs
Domain changes
Resource changes Models, programs,
rules, caches, . . .
Sep-01
CS545 Intro
36
Domain Specialization
• Knowledge Acquisition
(20% effort)
&
• Knowledge Maintenance
(80% effort *)
to be performed by
• Domain specialists
• Professional organizations
• Field teams of modest size
Empowerment
automously maintainable
Sep-01
CS545 Intro
37
Roles
Computer Scientists
•
Provide tools
– adapatation
– integration
– matching
– composing
•
Assess Standards
•
Assure scalability
Domain Experts
•
Learn to use the tools
•
Select resources
•
Assess their value
•
Rank their quality
•
Resolve semantics
•
Get client feedback
Sep-01
CS545 Intro
38
Mediation Research Topics
• Mediator management and maintenance
• Representation of knowledge and customer models
• Balancing dynamic and warehouse solutions
• Formalization of semantic heterogneities
– many levels and types
– roles for wrappers vs. mediators vs. applications – scalability by partitioning -- make it simple!
– Domain Ontologies --- tools, validation, . . .
• Effect of object paradigm and method-based access
• Service and business models
Sep-01
CS545 Intro
39
Integration
Science
Integration
Science
Artificial
Intelligence
knowledge mgmt
domain expertise
uncertainty
Artificial
Intelligence
knowledge mgmt
domain expertise
uncertainty
Systems
Engineering
analysis
documentation
costing
Systems
Engineering
analysis
documentation
costing
Databases
access
storage
algebras
Databases
access
storage
algebras
Long Range Science Vision
Integration Methods
Sep-01
CS545 Intro
40
Fat versus thin mediators
• too broad:
hard to maintain, needs a committee
•
too thin: insufficient added value
•
Too fat: hard to
compose
•
Too narrow: few costumers
domain scope
service
scope
Sep-01
CS545 Intro
41
Maintenance is good for you
re la ti ve a n n u al m ai n te n an ce c o st d ep re ci at io n = 1 / li fe ti m e
automobile hardware software automobile hardware software
100% 100% 40 40 0 0 20 20 70 70 30 30 10 10 80 80 90 90 60 60 50 50 li fe ti m e li fe ti m e years years 10 10
4 4 2 2 7 7 3 3 1 1 8 8 9 9 6 6 5 5 13 13 11 11 12
Sep-01
CS545 Intro
42
Client-Server Architecture
Client system
data and simulation resources Fast build of clients
by resource reuse
s
X
Sep-01
CS545 Intro
43
Systems with Mediators
Applications . . . .
Mediators . . . .
Data Resources . . .
Sep-01
CS545 Intro
44
Growth through
Reuse
New Application
Prior & Revised
Mediators
Extended Data
Resources
Sep-01
CS545 Intro
45
Linear O(
n
) Cost of Growth-- now
O(
n
2
)
• Data changes only affect some
mediators; only in their domain
• Mediators can
1. supply old information to n-1 prior
applications
2. provide better information to the new application
3. be partially or completely reused
• New applications, using the new data,
can be developed and inserted dynamically
2
Sep-01
CS545 Intro
46
Assigning maintenance responsibility
a. Source data quality –
supplier database, files, or web pages
b. Interface to the source –
wrapper, supplier or vendor for supplier
c. Source selection –
expert specialist in mediator
d. Source quality assessment –
customer input to mediator
e. Semantic interoperation –
specialist group providing input to the mediator
f. Consistency and metadata information –
mediator service operation or warehouse
g. Informal, pragmatic integration –
client services with customer input
h. User presentation formats –
client services with customer input
Services
Sources