COP 5725 Fall 2012
Database Management
Systems
Data Warehousing and Data Mining
Warehousing
OLAP
Data Mining
Introduction
• Increasingly, organizations are analyzing
current and historical data to identify useful patterns and support business strategies.
• Emphasis is on complex, interactive,
exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static.
OLTP
•
Most database operations involve
On-Line Transaction Processing
(OTLP).
– Short, simple, frequent queries and/or modifications, each involving a small number of tuples.
– Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.
OLAP
•
Of increasing importance are
On-Line
Application Processing
(OLAP) queries.
– Few, but complex queries --- may run for hours.
OLAP Examples
1. Amazon analyzes purchases by its
customers to come up with an
individual screen with products of
likely interest to the customer.
2. Analysts at Wal-Mart look for items
with increasing sales in some region.
Common Architecture
•
Databases at store branches handle
OLTP queries.
•
Local store databases copied to a
central warehouse overnight.
•
Analysts use the warehouse for OLAP
Three Complementary Trends
• Data Warehousing: Consolidate data from many sources in one large repository.
– Loading, periodic synchronization of replicas.
– Semantic integration.
• OLAP:
– Complex SQL queries and views.
– Queries based on spreadsheet-style operations and
“multidimensional” view of data. – Interactive and “online” queries.
• Data Mining: Exploratory search for interesting trends and anomalies.
Data Warehousing
• Integrated data spanning long time periods, often augmented with summary information.
• Several gigabytes to terabytes common.
• Interactive response times expected for
complex queries; ad-hoc updates uncommon.
Warehousing Issues
• Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas.
• Heterogeneous Sources: Must access data from a variety of source formats and
repositories.
• Load, Refresh, Purge: Must load data,
periodically refresh it, and purge too-old data.
• Metadata Management: Must keep track of
source, loading time, and other information for all data in the warehouse.
Multidimensional
Data Model
• Collection of numeric measures, which depend on a set of dimensions.
– E.g., measure Sales, dimensions
MOLAP vs. ROLAP systems
• MOLAP systems: Multidimensional data are
stored physically in a (disk-resident, persistent) multi dimensional array.
• ROLAP systems: Multidimensional data are stored as relations.
– The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated
dimension table.
– E.g., Sales(pid, locid, timeid, sales) or Sales (bar, beer, drinker, time, price)
MOLAP and Data Cubes
•
Keys of dimension tables are the
dimensions of a hypercube.
– Example: for the Sales
(bar,beer,drinker,time,price) data, the four dimensions are bar, beer, drinker, and
time.
•
Dependent attributes (e.g.,
price
)
Visualization - Data Cubes
14 price
bar beer
drinker
Time?
Data Cube Marginals
•
The data cube also includes
aggregation (typically SUM) along the
margins of the cube.
•
The
marginals
include aggregations
Visualization - Data Cube
w/ Aggregation
16 price
bar beer
ROLAP and Star Schemas
•
A
star schema
is a common organization
for data at a warehouse. It consists of:
1. Fact table
: a very large accumulation of facts such as sales.Often “insert-only.”
Example: Star Schema
•
Suppose we want to record in a
warehouse information about every
beer sale: the bar, the brand of beer,
the drinker who bought the beer, the
day, the time, and the price charged.
•
The fact table is a relation:
Sales(bar, beer, drinker, day, time, price)
Example, Continued
•
The dimension tables include
information about the bar, beer, and
drinker “dimensions”:
Bars(bar, addr, license)
Beers(beer, manf)
Visualization
–
Star Schema
20 Dimension Table (Beers) Dimension Table (Time, etc.)
Dimension Table (Drinkers)
Dimension Table (Bars)
Fact Table - Sales
Dimensions and Dependent
Attributes
•
Two classes of fact-table attributes:
1. Dimension attributes
: the key of a dimension table.2. Dependent attributes
: a valueDimension Hierarchies
•
For each dimension, the set of values
can be organized in a hierarchy:
PRODUCT TIME LOCATION
category week month state
pname date city year
quarter country
OLAP Queries
• Influenced by SQL and by spreadsheets.
• A common operation is to aggregate a measure over one or more dimensions.
– Find total sales.
– Find total sales for each city, or for each state.
– Find top five products ranked by total sales.
• Roll-up: Aggregating at different levels of a dimension hierarchy.
OLAP Queries
• Drill-down: The inverse of roll-up.
– E.g., Can also drill-down on different dimension to get total sales by product for each year/quarter, etc.
– E.g., Given total sales by state, can drill-down to get total sales by city.
• Pivoting: Aggregation on selected dimensions.
– E.g., Pivoting on Location and Time
Slicing and Dicing: Equality
and range selections on one or more dimensions.
Comparison with
SQL Queries
• The cross-tabulation obtained by pivoting
SELECT SUM(S.sales)
FROM Sales S, Times T, Locations L
WHERE S.timeid=T.timeid AND S.timeid=L.timeid GROUP BY T.year, L.state
SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid
GROUP BY T.year
SELECT SUM(S.sales)
FROM Sales S, Location L WHERE S.timeid=L.timeid
The CUBE Operator
• Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL
GROUP BY queries that can be generated
through pivoting on a subset of dimensions.
• CUBE pid, locid, timeid BY SUM Sales
– Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid};
– each roll-up corresponds to an SQL query of the form:
SELECT SUM(S.sales) FROM Sales S
GROUP BY grouping-list
Design Issues
• Fact table in BCNF; dimension tables un-normalized.
– Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than query performance.
• This kind of schema is very common in OLAP
applications, and is called a star schema; computing the join of all these relations is called a star join.
price
(Fact table) SALES
TIMES
ROLAP Techniques
1. New indexing techniques: Bitmap indexes, Join indexes
2. Array representations, compression, 3. Pre-computation of aggregations (i.e.,
materialized views), etc.
4. We are going to cover in more details on:
- Bitmap indexes- Materialized views
Bitmap Index
possible value. Many queries can be answered usingbit-vector ops!
M F
For each key value of a dimension table
create a bit-vector telling which tuples
of the fact table have that value.
Example OLAP Query
• Often, OLAP queries begin with a “star join”:
the natural join of the fact table with all or most of the dimension tables.
• Example:
SELECT bar, beer, SUM(price)
FROM Sales NATURAL JOIN Bars
NATURAL JOIN Beers
WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’
GROUP BY bar, beer;
Materialized Views
•
Store the answers to several useful
queries (views) in the warehouse itself.
•
A direct execution of an OLAP query
from the fact and the dimension tables
could take too long (even with bitmap
indexes)
•
If we create a materialized view that
Materialized Views (cont.)
•
A view whose tuples are stored in the
database is said to be
materialized
.
– Provides fast access, like a (very high-level) cache.
– Need to maintain the view as the underlying tables change.
– Ideally, we want incremental view maintenance algorithms.
Example OLAP Query
SELECT bar, beer, SUM(price)
FROM Sales NATURAL JOIN Bars
NATURAL JOIN Beers
WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’
Example --- Continued
• Here is a materialized view that could help:
CREATE VIEW BABMS(bar, addr,
beer, manf, sales) AS
SELECT bar, addr, beer, manf,
SUM(price) sales
FROM Sales NATURAL JOIN Bars
NATURAL JOIN Beers
GROUP BY bar, addr, beer, manf;
34
Example --- Concluded
• Here’s our query using the materialized
view BABMS:
SELECT bar, beer, sales
FROM BABMS
WHERE addr = ’Palo Alto’ AND
Views in DW for OLAP queries
• OLAP queries are typically aggregate queries.
– Precomputation is essential for interactive response times.
– The CUBE is in fact a collection of aggregate queries, and precomputation is especially important
– lots of work on what is best to precompute given a limited amount of space to store precomputed
results.
• Warehouses can be thought of as a collection of periodically updated tables and periodically maintained views.
View Modification (Evaluate On
Demand)
CREATE VIEW RegionalSales(category,sales,state)
AS SELECT P.category, S.sales, L.state
FROM Products P, Sales S, Locations L
WHERE P.pid=S.pid AND S.locid=L.locid
SELECT R.category, R.state, SUM(R.sales)
FROM RegionalSales AS R GROUP BY R.category, R.state
SELECT R.category, R.state, SUM(R.sales)
FROM (SELECT P.category, S.sales, L.state
FROM Products P, Sales S, Locations L
WHERE P.pid=S.pid AND S.locid=L.locid) AS R
GROUP BY R.category, R.state
View
Query
Modified
View Materialization (Precomputation)
• Suppose we precompute RegionalSales and store it with a clustered B+ tree index on
[category,state,sales].
– Then, previous query can be answered by an index-only scan.
38
SELECT R.category, R.state, SUM(R.sales)
View Materialization (Precomputation)
• Suppose we precompute RegionalSales and store it with a clustered B+ tree index on
[category,state,sales].
– Then, previous query can be answered by an index-only scan.
SELECT R.state, SUM(R.sales) FROM RegionalSales R WHERE R.category=“Laptop”
GROUP BY R.state
SELECT R.state, SUM(R.sales) FROM RegionalSales R WHERE R. state=“Wisconsin”
GROUP BY R.category
Index on precomputed view is great!
Data Mining Definition
Data mining is the exploration and
analysis of large quantities of data in
order to discover
valid
,
novel
,
potentially
useful
, and ultimately
understandable
patterns in data.
Example pattern (Census Bureau Data):
Data Mining Definition (Cont.)
Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Why Use Data Mining Today?
Human analysis skills are inadequate:
– Volume and dimensionality of the data
– High data growth rate
– 3V in Big Data: Volume, Velocity, Variety
Availability of:
– Data
– Storage
– Computational power
– Off-the-shelf software
Preprocessing and Mining
Original Data
Target Data
Preprocessed Data
Patterns
Knowledge
Data Integration
Preprocessing
Model Construction
Example Application: Sports
IBM Advanced Scout analyzes
NBA game statistics
– Shots blocked
– Assists
– Fouls
Advanced Scout
• Example pattern: An analysis of the data from a game played between
the New York Knicks and the Charlotte
Hornets revealed that “
When Glenn Rice
played the shooting guard position, he shot
5/6 (83%) on jump shots."
• Pattern is interesting:
Data Mining Techniques
•
Supervised learning
– Classification and regression
•
Unsupervised learning
– Clustering
•
Dependency modeling
– Associations, summarization, causality
•
Outlier and deviation detection
Market Basket Analysis (Example
Dependency Modeling)
•
Consider shopping cart filled with
several items
•
Market basket analysis tries to answer
the following questions:
– Who makes purchases?
– What do customers buy together?
Market Basket Analysis
Given:
• A database of customer
transactions
• Each transaction is a set of items
• Example:
Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}
Market Basket Analysis (Contd.)
•
Coocurrences
– 80% of all customers purchase items X, Y and Z together.
•
Association rules
– 60% of all customers who purchase X and Y also buy Z.
•
Sequential patterns
Confidence and Support
We prune the set of all possible
association rules using two
interestingness measures:
•
Confidence
of a rule:
– X Y has confidence c if P(Y|X) = c
•
Support
of a rule:
– X Y has support s if P(XY) = s
We can also define
•
Support
of an itemset (a co-ocurrence)
XY:
Example
Examples:
• {Pen} => {Milk} Support: 75%
Confidence: 75%
• {Ink} => {Pen} Support: 75%
Confidence: 100%
Example
Example
Market Basket Analysis:
Applications
•
Sample Applications
– Direct marketing
– Floor/shelf planning
– Web site layout
– Cross-selling
Frequent Itemset Algorithms
•
Applications
•
More abstract problem
Applications of Frequent Itemsets
•
Market Basket Analysis
•
Association Rules
•
Classification (especially: text, rare
classes)
•
Seeds for construction of Bayesian
Networks
•
Collaborative filtering
Problem
Abstract:
• A set of items {1,2,…,k} • A dabase of transactions
(itemsets) D={T1, T2, …, Tn}, Tj subset {1,2,…,k}
GOAL:
Find all itemsets that appear in at least x transactions
(“appear in” == “are subsets of”)
I subset T: T supports I
For an itemset I, the number of transactions it appears in is called the support of I.
Concrete:
• I = {milk, bread, cheese, …} • D = { {milk,bread,cheese},
{bread,cheese,juice}, …}
GOAL:
Find all itemsets that appear in at least 1000 transactions
Problem (Contd.)
Definitions:
• An itemset is frequent if it is a subset of at least x
transactions. (FI.)
• An itemset is maximally
frequent if it is frequent and it does not have a frequent superset. (MFI.)
GOAL: Given x, find all frequent (maximally frequent)
itemsets (to be stored in the
FI (MFI)).
Obvious relationship: MFI subset FI
Example:
D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} }
Minimum support x = 3
{1,2} is frequent
{1,2,3} is maximal frequent Support({1,2}) = 4
All maximal frequent itemsets: {1,2,3}
The Itemset Lattice
{}
{2}
{1} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3,4} {1,2,3}
{3,4}
Frequent Itemsets
Frequent itemsets
Infrequent itemsets
{}
{2}
{1} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Breath First Search: 1-Itemsets
{}
{2}
{1} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3,4} {1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
The Apriori Principle: infrequent (
Infrequent
Frequent
Breath First Search: 2-Itemsets
{}
{2}
{1} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3,4} {1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined Don’t know
Breath First Search: 3-Itemsets
{}
{2}
{1} {3} {4}
{1,2} {1,3} {1,4} {2,3} {2,4}
{1,2,3,4}
{1,2,3}
{3,4}
{1,2,4} {1,3,4} {2,3,4}
Infrequent
Frequent
Currently examined The Apriori Principle:
Breadth First Search: Remarks
• We prune infrequent itemsets and avoid to count them
• To find an itemset with k items, we need to count all 2k subsets
• Breadth first search uses Apriori algorithm: Next, we show how to implement A-Priori in SQL
Finding Frequent Pairs
•
The simplest case is when we only want
to find “frequent pairs” of items.
•
Assume data is in a relation
Baskets(basket, item)
.
•
The
support threshold
s
is the
66
Frequent Pairs in SQL
SELECT b1.item, b2.item
FROM Baskets b1, Baskets b2
WHERE b1.basket = b2.basket
AND b1.item < b2.item
GROUP BY b1.item, b2.item
HAVING COUNT(*) >= s;
Look for two Basket tuples with the same basket and
different items. First item must precede second,
so we don’t
count the same pair twice.
Create a group for each pair of items that appears in at least one basket. Throw away pairs of items
that do not appear at least
A-Priori Trick --- (1)
•
Straightforward implementation
involves a join of a huge Baskets
relation with itself.
•
The
a-priori algorithm
speeds the
query by recognizing that a pair of
items {
i
,
j
} cannot have support
s
68
A-Priori Trick --- (2)
• Use a
materialized view
to hold only information about frequent items.INSERT INTO Baskets1(basket, item)
SELECT * FROM Baskets
WHERE item IN (
SELECT item FROM Baskets
GROUP BY item
HAVING COUNT(*) >= s
);
Items that appear in at
A-Priori Algorithm
1. Materialize the view
Baskets1
.
2. Run the obvious query, but on
Baskets1
instead of
Baskets
.
•
Computing
Baskets1
is cheap, since it
doesn’t involve a join.
•
Baskets1
probably
has many fewer
tuples than
Baskets
.
Two Observations
•
Monotonic function
(if x>=y, f(x)>=f(y))•
Antimonotonic fn.
(if x>=y, f(x)<=f(y))•
Apriori
can be applied to any constraint Pthat is antimonotone (e.g., support>const)
– Start from the empty set.
– Prune supersets of sets that do not satisfy P.
•
Apriori
can also be applied to a monotone constraint Q (e.g., sum>const)– Start from set of all items instead of empty set.
– Prune subsets of sets that do not satisfy Q.
Negative Pruning an
Antimonotone
P
{}
Frequent
Infrequent
Currently examined Don’t know
Negative Pruning an
Antimonotone
P
Negative Pruning an
Antimonotone
P
{}
Negative Pruning a Monotone
Q
Satisfies Q
Doesn’t satisfy Q
Currently examined Don’t know
The New Problem
New Goal:
• Given constraints P and Q, with P antimonotone
(support) and Q monotone (statistical constraint).
• Find all itemsets that satisfy both P and Q.
Recent solutions:
Conceptual Illustration of Problem
Satisfies Q
Satisfies P & Q
Satisfies P
{}
D
All supersets satisfy Q
All subsets satisfy
P
Summary
• Decision support is an emerging, rapidly growing subarea of databases.
• Involves the creation of large, consolidated data repositories called data warehouses.
• Warehouses exploited using sophisticated analysis techniques: complex SQL queries
and OLAP “multidimensional” queries
(influenced by both SQL and spreadsheets).
• New techniques for database design,
Summary (Cont.)
•
Data Mining
– Supervised
– Unsupervised
– Dependency modeling
– Outlier detection
– Trend analysis and prediction