Database Management System

(1)

Database Management System

Lecture 4

Database Design – Normalization and View

* Some materials adapted from R. Ramakrishnan, J. Gehrke and Shawn Bowers

(2)

Today’s Agenda

• Normalization

• View

2 Database Management System

(3)

Normalization

(4)

Normalization

• Process or replacing a table with two or more tables

Database Management System 4

EID Name Dept DeptName

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

EID Name Dept

A01 Joshua 12

A12 Bean 10

A13 Bean 12

A03 Kevin 12

Vs.

DeptID DeptName

10 CS

12 HR

EmpDept

Emp Dept

Which schema is better?

Why?

(5)

Normalization Issues

• The EmpDept schema combines two different concepts

• Employee information, together with

• Department information

• To join or not to join that is the question

• If we separate the two concepts we could save space but some queries would run slower (Joins)

• If we combine the two ideas we have redundancy but some queries would run faster (no Joins)

• So we have a tradeoff ...

• Redundancy has a side effect: “anomalies”

(6)

Types of Anomalies

• “Update Anomaly”: If the CS department changes its name, we must change multiple rows in EmpDept

• “Insertion Anomaly”: If a department has no employees, where do we store its id and name?

• “Deletion Anomaly”: If A12 quits, the information about the HR department will be lost

• These are in addition to redundancy in general

• For example, the department name is stored multiple times

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

EmpDept

(7)

Using NULL Values

• Using NULL values can help insertion and deletion anomalies

• But NULL values have their own issues

• They make aggregate operators harder to use

• Not always clear what NULL means

• May need outer joins instead of ordinary joins

• In this case, EID is a primary care, and so it cannot contain a NULL value!

• They don’t address update anomalies or redundancy issues

A01 Joshua 12 CS

NULL NULL 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

EmpDept

(8)

Decomposition

• Normalization involves decomposing (partitioning) the table into separate tables

• Check to see if redundancy still exists (... repeat)

• The key to understanding when and how to decompose schemas is through ...

“functional dependencies”

• which generalizes the notion of keys

A01 Joshua 12

A12 Bean 10

A13 Bean 12

A03 Kevin 12

DeptID DeptName

10 CS

12 HR

Emp Dept

(9)

Keys

• Because EID is a key:

• If two rows have the same EID value, then they have the same value for every other attribute

• Thus given an EID value, the other values are “determined”

• A Key is like a “function”:

• f : EID → Name × Dept × DeptName – E.g., f(A01) = <Joshua, 12, CS>

• Recall functions always return the same value for a given value

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

EmpDept

(10)

Functional dependencies

• We say that EID “functionally determines” all other attributes

• This relationship among attributes is called a “Functional Dependency” (FD)

• We write FDs as:

EID → Name, Dept, DeptName or

EID → Name, EID → Dept, EID → DeptName

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

(11)

FDs that are not implied by keys

• Is Name → Dept a functional dependency?

• No, e.g., <Bean, 10> and <Bean, 12>

• Is Dept → DeptName a functional dependency?

• Yes in this table it is

• In general, it would be expected that departments only have one name

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

(12)

Functional Dependencies

• For sets A and B of attributes in a relation, we say that A (functionally) determines B ... or A → B is a Functional Dependency (FD)

• if whenever two rows agree on A they also agree on B

• An FD defines a function in the “mathematical sense”

• There are two special kinds of FDs:

• “Key FDs” of the form X →A where X contains a key (X is called a superkey)

• “Trivial FDs” of the form A → B such that A ⊇ B

• ... e.g., (Name, Dept) → Dept

• these are boring but become important later

(13)

Functional Dependencies

• Functional dependencies, like keys, are based on the semantics of the application

• Likely functional dependencies:

• ssn → name

• account → balance

• Unlikely functional dependencies:

• date → trasactionid

• checkamt -> checknumber

(14)

Enforcing Functional Dependencies

• For the table

• There is an FD from dept → deptname

• Although eid is the key for this table ...

• ... is it still possible for there to be two names for the same department?

• YES!

Emp(eid, name, dept, deptname)

(15)

Every Key Implies a Set of FDs

• For the table

• We have the following FDs based on ssn being a key:

• eid → name

• eid → dept

• eid → deptname

• Each key implies a set of functional dependencies from the key to the non- key attributes

Emp(eid, name, dept, deptname) `

(16)

Functional Dependencies and Keys

• Given a table R with attributes a and b together forming a key, the following FDs are implied

• Given R(a, b, c, d, e)

• Which we can also write as ab → cde

ab → c

ab → d

ab → e

(17)

Functional Dependencies May Suggest Keys

• If we know these FDs:

• then ssn is a key for a table with these attributes:

ssn → name ssn → hiredate ssn → phone

Employee(ssn, name, hiredate, phone)

(18)

What are the key and non-trivial FDs?

• Which of these will be enforced?

Customer(CustID, Address, City, Zip, State)

Enrollment(StdntID, ClassID, Grade, InstrID, StdntName, InstrName)

(19)

Non-Trivial Functional Dependencies

• The FDs that are not enforced by the DBMS lead to both redundancy and anomalies (only keys are enforced)

• Not all redundancy is covered by FDs

• name stored redundantly, and same employee can have more than one name

• Cannot be determined from the instance (instead, based on application semantics)

• We can determine what is not an FD

• DB data mining approaches infer “FDs” (i.e., association rules)

Emp(ssn, name, salary, birthdate)

Employee(ssn, name, address)

(20)

Example Decomposition based on FDs

• For this table

• We can move the non-trivial FDs into their own table with dnum as the key:

• The Emp table becomes:

• ... and Emp.dept is now a foreign key to Dept.dnum

Emp(ssn, name, birthdate, address, dnum, dname, dmgr)

Dept(dnum, dname, dmgr)

Emp(ssn, name, birthdate, address, dept)

(21)

Normalization based on FDs

• Identify all all the FDs

• FDs implied by the keys

• FDs not implied by the keys (the “troublesome” ones)

• Generate one or more new tables from the FDs not implied by the keys

• Each new tables should only have FDs implied by the key

• Remove the attributes from original table that are functionally dependent on “troublesome” FDs

• Specify appropriate foreign keys to these new tables

(22)

Reasoning about Functional Dependencies

• Two natural FDs are

• EID → DeptID and DeptID → DeptName

• These two FDs imply EID → DeptName

• If two tuples agree on EID, then by EID → DeptID they agree on DeptID ...

• ... and if they agree on DeptID, then by DeptID → DeptName they agree on DeptName

• The set of FDs implied by a given set F of FDs is called the closure of F ...

which is denoted F⁺

EmpDept(EID, Name, DeptID, DeptName)

(23)

Armstrong’s Axioms

• The closure F+ of F can be computed using these axioms

• Reflexivity(재귀): If X⊇Y, then X→Y

• Augmentation(부가): If X → Y, then XZ → YZ for any Z

• Transitivity(이행): If X→Y and Y→Z then X→Z

• Repeatedly applying these rules to F until we no longer produce any new FDs results in a sound and complete inference procedure ...

• Soundness

• Only FDs in F⁺ are generated when applied to FDs in F

• Completeness

• Repeated application of these rules will generate all FDs in F⁺

(24)

Finding Keys

• We can determine if a set of attributes X is a key for s relation R by computing X⁺ as follows

• Let the set of attributes of R be A

• X is a key for R if and only if X⁺ = A

Compute X⁺ from X let X⁺= {X}

repeat until there is no change in X⁺ {

if Y → Z is an FD and Y ⊆ X⁺ Then X⁺ = X⁺∪ Z

}

return X⁺

(25)

Example

• Given the schema R(A, B, C, D, E) such that

• Find the keys of this schema, besides A ...

• Start with BC → A as one example

• BC determines A is given

• A →ABCDE because A is a key

• BC → ABCDE by transitivity

• Thus, BC is a key!

• You should understand the axioms and the algorithm ...

they will come in handy when normalizing

BC → A DE → C

(26)

Redundancy and Functional Dependencies

• Example schema

• Note that every non-key FD is associated with some redundancy

• Our game plan is to use non-key and non-trivial FDs to decompose any relation into a form that has no redundancy ...

• ... resulting in a so-called “Normal Form”

EmpDept(EID, Name, Dept, DeptName)

Assigned(EmptID, JobID, EmpName, Percent)

Enrollment(StdntID, ClassID, Grade, InstrID, StdntName, InstrName)

(27)

Boyce-Codd Normal Form (BCNF)

• A relation is in “Boyce-Codd Normal Form” if all of its FDs are either

• Trivial FDs (e.g., AB → A) or

• Key FDs

• Which (if any) of these relations is in BCNF?

EmpDept(EID, Name, Dept, DeptName)

Assigned(EmptID, JobID, EmpName, Percent)

Enrollment(StdntID, ClassID, Grade, InstrID, StdntName, InstrName)

(28)

BCNF and Redundancy

• BCNF relations have no redundancy cause by FDs

• A relation has redundancy if there is an FD between attributes

• ... and there can be repeated entries of data for those attributes

• For example, consider

• if the relation is in BCNF, then the FD must be a key FD, and so DeptID must be a key

• implying that any pair such as <12, CS> can appear only once!

DeptID DeptName

12 CS

10 HR

12 CS

(29)

Decomposition into BCNF

• An algorithm for decomposing a relation R with attributes A into a collection of BCNF relations

if R is not in BCNF and X → Y is a non-key FD then decompose R into A – Y and XY

if A – Y and/or XY is not in BCNF then

recursively apply step 1 (to A – Y and/or XY)

(30)

Example

• First use the non-key FD StdntID → StdntName

• ... which gives the decomposition

• Now use the non-key FD ClassID → InstrID

• ... which gives the decomposition

• All relations are now in BCNF!

Enrollment(StdntID, ClassID, Grade, InstrID, StdntName)

Enrollment(StdntID, ClassID, Grade, InstrID) Student(StdntID, StdntName)

Enrollment(StdntID, ClassID, Grade) ClassInstructor(ClassID, InstrID)

Student(StdntID, StdntName)

(31)

Another Example

• Given the schema

• and assuming FDs

• ... lets Decompose it into BCNF relations

Loans(BranchID, LoanID, Amount, Assets, CustID, CustName)

BranchID → Assets CustID → CustName

Loans(BranchID, LoanID, Amount, CustID) Customer(CustID, CustName)

Branch(BranchID, Assets)

– Loans.BranchID REFERENCES Branch.BranchID

– Loans.CustID REFERENCES Customer.CustID

(32)

Lossless Decomposition

• Some decompositions may lose information content

• For example, lets say we decomposed:

• into

• a row (223, A) in StudentGrade implies student 223 received an A in some course

• and a row (421, A) in ClassGrade means that some student received an A in course 421

• but now we have no way to recreate the original table!

• This decomposition is “Lossy”

Enroll(StdntID, ClassID, Grade)

StudentGrade(StdntID, Grade)

ClassGrade(ClassID, Grade)

(33)

Lossless Decomposition

• A decomposition of a schema with FDs F into attribute sets X and Y is

“lossless” if for every instance R that satisfies F:

• That is, we can recover R from the natural join of the decomposed versions of R

R = πX(R) ⋈ πY ( R )

(34)

Example of a Lossless Decomposition

A01 Joshua 12 CS

A12 Bean 10 HR

A13 Bean 12 CS

A03 Kevin 12 CS

EmpDept

A01 Joshua 12

A12 Bean 10

A13 Bean 12

A03 Kevin 12

X = EID, Name, Dept

Dept DeptName

12 CS

10 HR

12 CS

Y = Dept, DeptName

π

_X

(R) π

_Y

(R)

π

_X

(R) ⋈ π

_Y

(R) = R

R

(35)

Example of a Lossy Decomposition

SID ClassID Grade

123 cs223 A

456 cs421 A

Enroll

SID Grade

123 A

456 A

X = SID, Grade

ClassID Grade

cs223 A

cs421 A

Y = ClassID, Grade

π

_X

(R) π

_Y

(R) π

_X

(R) ⋈ π

_Y

(R) ≠ R R

SID ClassID Grade

123 cs223 A

456 cs223 A

123 cs421 A

456 cs421 A

(36)

Producing Only Lossless Decompositions

• We only want to produce lossless decompositions

• This is easy to guarantee:

• The decomposition of R with respect to FDs F into attributes sets A1 and A2 is lossless if and only if A1 ∩ A2 contains a key for either A1 or A2

• If they have a key in common, they can be joined back together – Note that {StdntID, Grade} ∩ {ClassID, Grade} = {Grade}

• See page 620 in the text

• This implies that the BCNF decomposition algorithm produces only lossless decompositions

• In this case F includes the FD X→Y and the decomposition is A1 = A – Y and A2 = X∪Y

• Therefore A1 ∩ A2 = X is a key for X∪A

(37)

Producing Only Lossless Decompositions

• Given the schema R(S, C, G) with FD SC → G

• Is the decomposition into R1(S, G) and R2(C, G) lossless or lossy? Why?

• Take the intersection of the two sets {S, G} ∩ {C, G} = G

• Then determine if G is a key for either table

• That is, does G → C?

• NO

• Does G → S?

• NO

• Therefore, this decomposition is lossy!

(38)

Dependency Preserving Decompositions

• Decompositions should also preserve FDs

• For example

• Addr, City, State → Zip

• Zip → State

• Consider this decomposition

• Although this is BCNF, it does not preserve the FD

• Addr, City, State → Zip

• Here are some values

Emp(EID,Addr, City, State, Zip)

Emp(EID,Addr, City, Zip) ZipState(Zip, State)

<123, 111 W 1st, Spokane, 99999> <99999,WA>

<456, 111 W 1st, Spokane, 00000> <00000,WA>

(39)

Dependency Preserving Decompositions

• Let R be a schema with FDs F and X, Y sets of attributes in R

• A dependency A→ B is in X if all attributes of A and all attributes of B are in X

• The projection FX of dependencies F on attributes X is the closure of the FDs in X

• The decomposition of R into schemas with attributes X and Y is “dependency preserving” if (F_X ∪ F_Y)⁺ = F⁺

(40)

Example

• Consider Emp(Addr, City, State, Zip) with

• If we decompose Emp so that X = {Addr, City, Zip} and Y = {Zip, State} what are the projections FX and FY?

• Is X,Y a dependency preserving decomposition?

• No ... (Zip → State)⁺ does not contain Addr,City,State → Zip and so it can never recreate F⁺

FX = ∅

FY = {Zip → State}

(Addr,City,State→Zip not in X, Zip→State not in X)

(Zip→State is in Y)

F = { Addr, City, State → Zip, Zip → State }

(41)

Third Normal Form (3NF)

• Some schemas do not have both a lossless and dependency preserving composition into BCNF schemas

• Every schema has has a lossless dependency preserving decomposition into 3NF ...

• A schema R with FDs F is in 3NF if for every X→Y in F either:

• X→Y is a trivial FD (i.e., X ⊇ Y)

• X→Y is a key FD (i.e., X is a superkey

• Y is a part of some key for R

Definition of BCNF

(42)

Third Normal Form (3NF)

• In other words, 3NF allows FDs that only partially (i.e., do not fully) depend on the key ...

• For Emp(Addr, City, State, Zip) with

• the keys are: (Addr, City, State) and (Addr, City, Zip)

• Although there is no decomposition of this relation into BCNF ...

• This relation is in 3NF!

F = { Addr, City, State → Zip, Zip → State }

(43)

Wrapping up

• Almost all schemas can be decomposed into BCNF schemas that preserve all FDs

• But every once in a while we get a schema like the previous one

• So, if we do not have an ideal decomposition (lossless, dependency

preserving) into BCNF, we can decompose into 3NF and have a lossless and dependency-preserving schema

• But with some minor redundancy

(44)

View

(45)

Views

• A “view” is a query that is stored in the database and that acts as a “virtual”

table

• For example:

• Views can be used just like base tables within another query or in another view

CREATE VIEW astudents AS SELECT *

FROM Students WHERE gpa > 3.0;

SELECT *

FROM astudentsWHERE age > 20;

(46)

Implementing Views

• The DBMS expands (i.e., rewrites) your query to include the view definition

• • This query is expanded to

SELECT ClassID

FROM astudent S, enrollment E WHERE S.StdntID = E.StdntID

SELECT ClassID

FROM (SELECT * FROM student WHERE gpa >= 4.0) AS S, enrollment E

WHERE S.StdntID = E.StdntID;

(47)

Views for Security

• For a base table:

• This view gives a “secure” version of the student relation

• Here, using the view we avoid exposing the SSN, Telephone,Email,etc.

Student(StdntID, SSN, Name,Address,Telephone, Email, ...)

CREATE VIEW sstudent AS

SELECT StdntID, Name,Address

FROM Student;

(48)

Views for Integration

• Different companies might have different but similar “parts” databases

• We can combine these parts DBs into a single version using a view definition

• For instance, if company 1 uses pounds and company 2 uses kilograms for part weights:

PartsCo1(PartID, weight, ...) PartsCo2(PartID, weight, ... )

CREATE VIEW Part AS

(SELECT PartID, 2.2066*weight, ...

FROM PartsCo1) UNION

(SELECT PartID, weight, ...

FROM PartsCo2);

(49)

View Update Problem

• Views cannot always be updated unambiguously

• For example, for

• And views

• How do we change the GPA of CS majors from 3.5 to 3.6 using majorgpa?

• How do we delete a row (e.g., <jim, cpsc>) from stddept?

Students(stdntid,gpa,deptid,...)

Department(deptid,dname,office,head,...)

CREATE VIEW majorgpa AS SELECT major, AVG(gpa) FROM Students

GROUP BY major

CREATE VIEW stddept AS SELECT stdnNd, dname

FROM Students JOIN Department USING (depNd)

(50)

View Update Problem

• A view can in general be updated if

• It is defined over a single base table

• It uses only selection and projection

• It does not use aggregates, group by

• It does not use DISTINCT

• It does not use set operations (UNION, INTERSECT, MINUS)

• Different products provide different support for views, especially w.r.t updates

• Many more details not discussed here

(51)

Data Independence

• Multiple levels of abstraction support data independence

• Changes isolated to their “levels”

• This is very desirable since things change often!

External View:

What application programmers see Logical View:

The conceptual or logical relations Phyisical View:

Optimized/normalized relations including indexes

Phyisical Storage (on disk(s) ...)

External Schema

A External Schema

A

Logical Schema

Physical Schema

(52)

For Next Week

• Review – Quiz on the material

• Ch. 19 to 19.6

• Reading assignments

• Ch. 19 to 19.6

• Be sure you understand

• Keys, Functional Dependencies, and Boyce-Codd Normal Form (FD)

• Normalization, BCNF, 3NF