COP 5725 Fall 2012 Database Management Systems

(1)

COP 5725 Fall 2012

Database Management

Systems

(2)

Data Warehousing and Data Mining

Warehousing

OLAP

Data Mining

(3)

Introduction

• Increasingly, organizations are analyzing

current and historical data to identify useful patterns and support business strategies.

• Emphasis is on complex, interactive,

exploratory analysis of very large datasets created by integrating data from across all parts of an enterprise; data is fairly static.

(4)

OLTP

• Most database operations involve

On-Line Transaction Processing

(OTLP).

– Short, simple, frequent queries and/or modifications, each involving a small number of tuples.

– Examples: Answering queries from a Web interface, sales at cash registers, selling airline tickets.

(5)

OLAP

• Of increasing importance are

On-Line

Application Processing

(OLAP) queries.

– Few, but complex queries --- may run for hours.

(6)

OLAP Examples

1. Amazon analyzes purchases by its

customers to come up with an

individual screen with products of

likely interest to the customer.

2. Analysts at Wal-Mart look for items

with increasing sales in some region.

(7)

Common Architecture

• Databases at store branches handle

OLTP queries.

• Local store databases copied to a

central warehouse overnight.

• Analysts use the warehouse for OLAP

(8)

Three Complementary Trends

• Data Warehousing: Consolidate data from many sources in one large repository.

– Loading, periodic synchronization of replicas.

– Semantic integration.

• OLAP:

– Complex SQL queries and views.

– Queries based on spreadsheet-style operations and

“multidimensional” view of data. – Interactive and “online” queries.

• Data Mining: Exploratory search for interesting trends and anomalies.

(9)

Data Warehousing

• Integrated data spanning long time periods, often augmented with summary information.

• Several gigabytes to terabytes common.

• Interactive response times expected for

complex queries; ad-hoc updates uncommon.

(10)

Warehousing Issues

• Semantic Integration: When getting data from multiple sources, must eliminate mismatches, e.g., different currencies, schemas.

• Heterogeneous Sources: Must access data from a variety of source formats and

repositories.

• Load, Refresh, Purge: Must load data,

periodically refresh it, and purge too-old data.

• Metadata Management: Must keep track of

source, loading time, and other information for all data in the warehouse.

(11)

Multidimensional

Data Model

• Collection of numeric measures, which depend on a set of dimensions.

– E.g., measure Sales, dimensions

(12)

MOLAP vs. ROLAP systems

• MOLAP systems: Multidimensional data are

stored physically in a (disk-resident, persistent) multi dimensional array.

• ROLAP systems: Multidimensional data are stored as relations.

– The main relation, which relates dimensions to a measure, is called the fact table. Each dimension can have additional attributes and an associated

dimension table.

– E.g., Sales(pid, locid, timeid, sales) or Sales (bar, beer, drinker, time, price)

(13)

MOLAP and Data Cubes

• Keys of dimension tables are the

dimensions of a hypercube.

– Example: for the Sales

(bar,beer,drinker,time,price) data, the four dimensions are bar, beer, drinker, and

time.

• Dependent attributes (e.g.,

price

)

(14)

Visualization - Data Cubes

14 price

bar beer

drinker

Time?

(15)

Data Cube Marginals

• The data cube also includes

aggregation (typically SUM) along the

margins of the cube.

• The

marginals

include aggregations

(16)

Visualization - Data Cube

w/ Aggregation

16 price

bar beer

(17)

ROLAP and Star Schemas

• A

star schema

is a common organization

for data at a warehouse. It consists of:

1. Fact table

: a very large accumulation of facts such as sales.

Often “insert-only.”

(18)

Example: Star Schema

• Suppose we want to record in a

warehouse information about every

beer sale: the bar, the brand of beer,

the drinker who bought the beer, the

day, the time, and the price charged.

• The fact table is a relation:

Sales(bar, beer, drinker, day, time, price)

(19)

Example, Continued

• The dimension tables include

information about the bar, beer, and

drinker “dimensions”:

Bars(bar, addr, license)

Beers(beer, manf)

(20)

Visualization

–

Star Schema

20 Dimension Table (Beers) Dimension Table (Time, etc.)

Dimension Table (Drinkers)

Dimension Table (Bars)

Fact Table - Sales

(21)

Dimensions and Dependent

Attributes

• Two classes of fact-table attributes:

1. Dimension attributes

: the key of a dimension table.

2. Dependent attributes

: a value

(22)

Dimension Hierarchies

• For each dimension, the set of values

can be organized in a hierarchy:

PRODUCT TIME LOCATION

category week month state

pname date city year

quarter country

(23)

OLAP Queries

• Influenced by SQL and by spreadsheets.

• A common operation is to aggregate a measure over one or more dimensions.

– Find total sales.

– Find total sales for each city, or for each state.

– Find top five products ranked by total sales.

• Roll-up: Aggregating at different levels of a dimension hierarchy.

(24)

OLAP Queries

• Drill-down: The inverse of roll-up.

– E.g., Can also drill-down on different dimension to get total sales by product for each year/quarter, etc.

– E.g., Given total sales by state, can drill-down to get total sales by city.

• Pivoting: Aggregation on selected dimensions.

– E.g., Pivoting on Location and Time

 _{Slicing and Dicing: Equality}

and range selections on one or more dimensions.

(25)

Comparison with

SQL Queries

• The cross-tabulation obtained by pivoting

SELECT SUM(S.sales)

FROM Sales S, Times T, Locations L

WHERE S.timeid=T.timeid AND S.timeid=L.timeid GROUP BY T.year, L.state

SELECT SUM(S.sales) FROM Sales S, Times T WHERE S.timeid=T.timeid

GROUP BY T.year

SELECT SUM(S.sales)

FROM Sales S, Location L WHERE S.timeid=L.timeid

(26)

The CUBE Operator

• Generalizing the previous example, if there are k dimensions, we have 2^k possible SQL

GROUP BY queries that can be generated

through pivoting on a subset of dimensions.

• CUBE pid, locid, timeid BY SUM Sales

– Equivalent to rolling up Sales on all eight subsets of the set {pid, locid, timeid};

– each roll-up corresponds to an SQL query of the form:

SELECT SUM(S.sales) FROM Sales S

GROUP BY grouping-list

(27)

Design Issues

• Fact table in BCNF; dimension tables un-normalized.

– Dimension tables are small; updates/inserts/deletes are rare. So, anomalies less important than query performance.

• This kind of schema is very common in OLAP

applications, and is called a star schema; computing the join of all these relations is called a star join.

price

(Fact table) SALES

TIMES

(28)

ROLAP Techniques

1. New indexing techniques: Bitmap indexes, Join indexes

2. Array representations, compression, 3. Pre-computation of aggregations (i.e.,

materialized views), etc.

4. We are going to cover in more details on:

- Bitmap indexes

- Materialized views

(29)

Bitmap Index

possible value. Many queries can be answered using

bit-vector ops!

M F

For each key value of a dimension table

create a bit-vector telling which tuples

of the fact table have that value.

(30)

Example OLAP Query

• Often, OLAP queries begin with a “star join”:

the natural join of the fact table with all or most of the dimension tables.

• Example:

SELECT bar, beer, SUM(price)

FROM Sales NATURAL JOIN Bars

NATURAL JOIN Beers

WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’

GROUP BY bar, beer;

(31)

Materialized Views

• Store the answers to several useful

queries (views) in the warehouse itself.

• A direct execution of an OLAP query

from the fact and the dimension tables

could take too long (even with bitmap

indexes)

• If we create a materialized view that

(32)

Materialized Views (cont.)

• A view whose tuples are stored in the

database is said to be

materialized

.

– Provides fast access, like a (very high-level) cache.

– Need to maintain the view as the underlying tables change.

– Ideally, we want incremental view maintenance algorithms.

(33)

Example OLAP Query

SELECT bar, beer, SUM(price)

FROM Sales NATURAL JOIN Bars

NATURAL JOIN Beers

WHERE addr = ’Palo Alto’ AND manf = ’Anheuser-Busch’

(34)

Example --- Continued

• Here is a materialized view that could help:

CREATE VIEW BABMS(bar, addr,

beer, manf, sales) AS

SELECT bar, addr, beer, manf,

SUM(price) sales

FROM Sales NATURAL JOIN Bars

NATURAL JOIN Beers

GROUP BY bar, addr, beer, manf;

34

(35)

Example --- Concluded

• Here’s our query using the materialized

view BABMS:

SELECT bar, beer, sales

FROM BABMS

WHERE addr = ’Palo Alto’ AND

(36)

Views in DW for OLAP queries

• OLAP queries are typically aggregate queries.

– Precomputation is essential for interactive response times.

– The CUBE is in fact a collection of aggregate queries, and precomputation is especially important

– lots of work on what is best to precompute given a limited amount of space to store precomputed

results.

• Warehouses can be thought of as a collection of periodically updated tables and periodically maintained views.

(37)

View Modification (Evaluate On

Demand)

CREATE VIEW RegionalSales(category,sales,state)

AS SELECT P.category, S.sales, L.state

FROM Products P, Sales S, Locations L

WHERE P.pid=S.pid AND S.locid=L.locid

SELECT R.category, R.state, SUM(R.sales)

FROM RegionalSales AS R GROUP BY R.category, R.state

FROM (SELECT P.category, S.sales, L.state

FROM Products P, Sales S, Locations L

WHERE P.pid=S.pid AND S.locid=L.locid) AS R

GROUP BY R.category, R.state

View

Query

Modified

(38)

View Materialization (Precomputation)

• Suppose we precompute RegionalSales and store it with a clustered B+ tree index on

[category,state,sales].

– Then, previous query can be answered by an index-only scan.

38

(39)

View Materialization (Precomputation)

• Suppose we precompute RegionalSales and store it with a clustered B+ tree index on

[category,state,sales].

– Then, previous query can be answered by an index-only scan.

SELECT R.state, SUM(R.sales) FROM RegionalSales R WHERE R.category=“Laptop”

GROUP BY R.state

SELECT R.state, SUM(R.sales) FROM RegionalSales R WHERE R. state=“Wisconsin”

GROUP BY R.category

Index on precomputed view is great!

(40)

Data Mining Definition

Data mining is the exploration and

analysis of large quantities of data in

order to discover

valid

,

novel

,

potentially

useful

, and ultimately

understandable

patterns in data.

Example pattern (Census Bureau Data):

(41)

Data Mining Definition (Cont.)

Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data.

Valid: The patterns hold in general.

Novel: We did not know the pattern beforehand.

Useful: We can devise actions from the patterns.

(42)

Why Use Data Mining Today?

Human analysis skills are inadequate:

– Volume and dimensionality of the data

– High data growth rate

– 3V in Big Data: Volume, Velocity, Variety

Availability of:

– Data

– Storage

– Computational power

– Off-the-shelf software

(43)

Preprocessing and Mining

Original Data

Target Data

Preprocessed Data

Patterns

Knowledge

Data Integration

Preprocessing

Model Construction

(44)

Example Application: Sports

IBM Advanced Scout analyzes

NBA game statistics

– Shots blocked

– Assists

– Fouls

(45)

Advanced Scout

• Example pattern: An analysis of the data from a game played between

the New York Knicks and the Charlotte

Hornets revealed that “

When Glenn Rice

played the shooting guard position, he shot

5/6 (83%) on jump shots."

• Pattern is interesting:

(46)

Data Mining Techniques

• Supervised learning

– Classification and regression

• Unsupervised learning

– Clustering

• Dependency modeling

– Associations, summarization, causality

• Outlier and deviation detection

(47)

Market Basket Analysis (Example

Dependency Modeling)

• Consider shopping cart filled with

several items

• Market basket analysis tries to answer

the following questions:

– Who makes purchases?

– What do customers buy together?

(48)

Market Basket Analysis

Given:

• A database of customer

transactions

• Each transaction is a set of items

• Example:

Transaction with TID 111 contains items {Pen, Ink, Milk, Juice}

(49)

Market Basket Analysis (Contd.)

• Coocurrences

– 80% of all customers purchase items X, Y and Z together.

• Association rules

– 60% of all customers who purchase X and Y also buy Z.

• Sequential patterns

(50)

Confidence and Support

We prune the set of all possible

association rules using two

interestingness measures:

• Confidence

of a rule:

– X  Y has confidence c if P(Y|X) = c

• Support

of a rule:

– X  Y has support s if P(XY) = s

We can also define

• Support

of an itemset (a co-ocurrence)

XY:

(51)

Example

Examples:

• {Pen} => {Milk} Support: 75%

Confidence: 75%

• {Ink} => {Pen} Support: 75%

Confidence: 100%

(52)

Example

(53)

Example

(54)

Market Basket Analysis:

Applications

• Sample Applications

– Direct marketing

– Floor/shelf planning

– Web site layout

– Cross-selling

(55)

Frequent Itemset Algorithms

• Applications

•

Applications of Frequent Itemsets

• Market Basket Analysis

• Association Rules

• Classification (especially: text, rare

classes)

• Seeds for construction of Bayesian

Networks

• Collaborative filtering

(57)

Problem

Abstract:

• A set of items {1,2,…,k} • A dabase of transactions

(itemsets) D={T1, T2, …, Tn}, Tj subset {1,2,…,k}

GOAL:

Find all itemsets that appear in at least x transactions

(“appear in” == “are subsets of”)

I subset T: T supports I

For an itemset I, the number of transactions it appears in is called the support of I.

Concrete:

• I = {milk, bread, cheese, …} • D = { {milk,bread,cheese},

{bread,cheese,juice}, …}

GOAL:

Find all itemsets that appear in at least 1000 transactions

(58)

Problem (Contd.)

Definitions:

• An itemset is frequent if it is a subset of at least x

transactions. (FI.)

• An itemset is maximally

frequent if it is frequent and it does not have a frequent superset. (MFI.)

GOAL: Given x, find all frequent (maximally frequent)

itemsets (to be stored in the

FI (MFI)).

Obvious relationship: MFI subset FI

Example:

D={ {1,2,3}, {1,2,3}, {1,2,3}, {1,2,4} }

Minimum support x = 3

{1,2} is frequent

{1,2,3} is maximal frequent Support({1,2}) = 4

All maximal frequent itemsets: {1,2,3}

(59)

The Itemset Lattice

{}

{2}

{1} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4}

{1,2,3,4} {1,2,3}

{3,4}

(60)

Frequent Itemsets

Frequent itemsets

Infrequent itemsets

{}

{2}

{1} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

(61)

Breath First Search: 1-Itemsets

{}

{2}

{1} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4}

{1,2,3,4} {1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

The Apriori Principle: infrequent  (

Infrequent

Frequent

(62)

Breath First Search: 2-Itemsets

{}

{2}

{1} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4}

{1,2,3,4} {1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

Infrequent

Frequent

Currently examined Don’t know

(63)

Breath First Search: 3-Itemsets

{}

{2}

{1} {3} {4}

{1,2} {1,3} {1,4} {2,3} {2,4}

{1,2,3,4}

{1,2,3}

{3,4}

{1,2,4} {1,3,4} {2,3,4}

Infrequent

Frequent

Currently examined The Apriori Principle:

(64)

Breadth First Search: Remarks

• We prune infrequent itemsets and avoid to count them

• To find an itemset with k items, we need to count all 2k_subsets

• Breadth first search uses Apriori algorithm: Next, we show how to implement A-Priori in SQL

(65)

Finding Frequent Pairs

• The simplest case is when we only want

to find “frequent pairs” of items.

• Assume data is in a relation

Baskets(basket, item)

.

• The

support threshold

s

is the

(66)

66

Frequent Pairs in SQL

SELECT b1.item, b2.item

FROM Baskets b1, Baskets b2

WHERE b1.basket = b2.basket

AND b1.item < b2.item

GROUP BY b1.item, b2.item

HAVING COUNT(*) >= s;

Look for two Basket tuples with the same basket and

different items. First item must precede second,

so we don’t

count the same pair twice.

Create a group for each pair of items that appears in at least one basket. Throw away pairs of items

that do not appear at least

(67)

A-Priori Trick --- (1)

• Straightforward implementation

involves a join of a huge Baskets

relation with itself.

• The

a-priori algorithm

speeds the

query by recognizing that a pair of

items {

i

,

j

} cannot have support

s

(68)

68

A-Priori Trick --- (2)

• Use a

materialized view

to hold only information about frequent items.

INSERT INTO Baskets1(basket, item)

SELECT * FROM Baskets

WHERE item IN (

SELECT item FROM Baskets

GROUP BY item

HAVING COUNT(*) >= s

);

Items that appear in at

(69)

A-Priori Algorithm

1. Materialize the view

Baskets1

.

2. Run the obvious query, but on

Baskets1

instead of

Baskets

.

• Computing

Baskets1

is cheap, since it

doesn’t involve a join.

• Baskets1

probably

has many fewer

tuples than

Baskets

.

(70)

Two Observations

•

Monotonic function

(if x>=y, f(x)>=f(y))

•

Antimonotonic fn.

(if x>=y, f(x)<=f(y))

•

Apriori

can be applied to any constraint P

that is antimonotone (e.g., support>const)

– Start from the empty set.

– Prune supersets of sets that do not satisfy P.

•

Apriori

can also be applied to a monotone constraint Q (e.g., sum>const)

– Start from set of all items instead of empty set.

– Prune subsets of sets that do not satisfy Q.

(71)

Negative Pruning an

Antimonotone

P

{}

(72)

Frequent

Infrequent

Currently examined Don’t know

Negative Pruning an

Antimonotone

P

(73)

Negative Pruning an

Antimonotone

P

{}

(74)

Negative Pruning a Monotone

Q

Satisfies Q

Doesn’t satisfy Q

Currently examined Don’t know

(75)

The New Problem

New Goal:

• Given constraints P and Q, with P antimonotone

(support) and Q monotone (statistical constraint).

• Find all itemsets that satisfy both P and Q.

Recent solutions:

(76)

Conceptual Illustration of Problem

Satisfies Q

Satisfies P & Q

Satisfies P

{}

D

All supersets satisfy Q

All subsets satisfy

P

(77)

Summary

• Decision support is an emerging, rapidly growing subarea of databases.

• Involves the creation of large, consolidated data repositories called data warehouses.

• Warehouses exploited using sophisticated analysis techniques: complex SQL queries

and OLAP “multidimensional” queries

(influenced by both SQL and spreadsheets).

• New techniques for database design,

(78)

Summary (Cont.)

• Data Mining

– Supervised

– Unsupervised

– Dependency modeling

– Outlier detection

– Trend analysis and prediction