SCHEMA REFINEMENT
CHAPTER 5
• Conceptual database design gives us a set of relation schemas and integrity constraints
(ICs) that can be regarded as a good starting
point for the final database design.
• This initial design must be refined by taking the lCs into account more fully than is
possible with just the ER model constructs
and also by considering performance criteria
Introduction to Schema Refinement
•
We now present an overview of the
problems that schema refinement is
intended to address and a refinement
approach based on decompositions.
•
Redundant storage of information is the
root cause of these problems.
•
Although decomposition can eliminate
redundancy, it can lead to problems of
its own and should be used with
Introduction to Schema Refinement
1) Problems caused by Redundancy
Redundant Storage
Update Anomalies
Insertion Anomalies
Hourly_Emps (SSN, Name, Lot,
Rating, Hourly_wages, Hours_worked)
SSN Name Lot Rating Hourly _wages
Hours_ worked
2. Decompositions
• The Problems arising from redundancy can be
solved by replacing a relation with collection of
smaller relations.
• A Decomposition of a relation schema R
consists of replacing the relation schema by two (or more) relation schemas that each contain a subset of attributes of R and together include all
attributes of R.
• Hourly_Emps2 (SSN, Name, Lot, Rating, Hours_worked)
Problems related to Decomposition
• Unless we are careful decomposing a relation
schema can create some problems than it solves.
We need to ask two questions repeatedly 1) Is there reason to decompose a relation?
• To answer this question, several normal forms have been proposed for relations.
2) What problems (if any) does the decomposition cause?
• With respect to the second question, two properties of decompositions are of particular interest. The
lossless-join property enables us to recover any instance of the decomposed relation from
corresponding instances of the smaller relations.
• The dependency-preservation property enables us to enforce any constraint on the original relation by
simply enforcing some constraints on each of the
Functional Dependencies
• A Functional Dependencies (FD) is a kind of IC that generalizes the concept of a key.
• Let R be a relation schema & let X & Y be nonempty sets of attributes in R. then an instance r of R satisfies the FD X Y if
following holds for every pair of tuples t1 & t2
in r
A B C D
a1 b1 c1 d1
a1 b1 c1 d2
a1 b2 c2 D1
a2 b1 c3 d1
AB
C
Closure of a Set of FDs
• We say that an FD f is implied by a given set F
of FDs if f holds on every relation instance that satisfies all dependencies in F; that is, f holds whenever all FDs in F hold.
• The set of all FDs implied by a given set F of FDs is called the closure of F, denoted by F+.
Armstrong’s Axioms
Here X, Y & Z denote sets of attributes of relation
R:
• Reflexivity : If X
Y, then X Y.• Augmentation :
If X Y, then XZ YZ for any Z.
• Transitivity :
If X Y and Y Z, then X Z
• Union : If X Y & X Z, then XYZ
• Decomposition :
Contracts ( contractid, supplierid, projectid, deptid, partid, qty, value)
• This can be denoted as CSJDPQV.
• The meaning of tuple is that the contract with contractid C is an agreement that supplier S will supply Q items of part P to project J
associated with department D, the value V of
• The ICs are known to hold are
1.The contract id C is a key : C CSJDPQV
2.A project purchases a given part using a single contract: JP C
Some additional FDs hold in the closure of the set of given FDs
• From JP C, C CSJDPQV & transitivity
JP CSJDPQV
• From SD P & augmentation SDJ JP
• From SDJ JP & JP CSJDPQV &
transitivity SDJ CSJDPQV
• From C CSJDPQV using decomposition
C C, C S, C J, etc.
Attribute Closure
• If we just want to check whether a given
dependency, say, X Y, is in the closure of a set F of FDs, we can do so efficiently without computing F+.
• We first cornpute the Attribute closure X+ with respect to F, is the set of attributes A such that X
A can be inferred using the Armstrong
Closure = X
Repeat until there is no change: {
If there is an FD V W in F such that V C closure,
Definitions
• Already we know definition of Key, Candidate Key
& Primary Key.
• Superkey – A superkey of a relation schema
R={A1, A2, …An} is a set of attributes S R with
property that no two tuples t1 & t2 in any legal relation state r of R will have t1[S]=t2[S].
In above example Marks is fully functionally dependent on STUDENT# COURSE# and not on subset of STUDENT# COURSE#. This means Marks can not be determined either by STUDENT# OR
COURSE# alone. It can be determined only using STUDENT# AND COURSE# together. Hence Marks is fully functionally dependent on STUDENT# COURSE#.
In the above relationship CourseName,
IName, Room# are partially dependent on
composite attributes STUDENT# COURSE#
because COURSE# alone defines the
In above example, Room# depends on IName and in turn IName depends on COURSE#.
Hence Room# transitively depends on
COURSE#.
Similarly Grade depends on Marks, in turn Marks depends on STUDENT# COURSE# hence Grade depends Fully transitively on
STUDENT# COURSE#.
Normal Forms
• First Normal Form (1NF)
– Atomic values
• Second Normal Form (2NF), Third Normal Form 3NF & Boyce-Codd Normal Form
(BCNF)
– based on primary keys
• Fourth Normal Form (4NF)
– based on keys, multi-valued dependencies
• Fifth Normal Form (5NF )
Levels of Normalization
Each higher level is a subset of the lower level
DKNFNormalization
First Normal Form (1NF)
•
Historically, it is designed to
disallow
–
Composite attributes
–
Multivalued attributes
–
Or the combination of both
•
All the values need to be
0-321-32132-1 Balloon Sleepy,
Small House 714-000-0000 $34.00
0-55-123456-9 Main Street Jones, Smith
123-333-3333, 654-223-3455
Small House 714-000-0000 $22.95
0-123-45678-0 Ulysses Joyce 666-666-6666 Alpha Press 999-999-9999 $34.00
1-22-233700-0 Visual Basic
Roman 444-444-4444 Big House 123-456-7890 $25.00
ISBN Title AuName AuPhone PubName PubPhone Price
0-321-32132-1 Balloon Small House 714-000-0000 $34.00
0-55-123456-9 Main Street Small House 714-000-0000 $22.95
0-123-45678-0 Ulysses Alpha Press 999-999-9999 $34.00
1-22-233700-0 Visual Basic
Big House 123-456-7890 $25.00
ISBN Title PubName PubPhone Price
ISBN AuName AuPhone
Second Normal Form (2NF)
• fd1 and fd4 are partial functional dependencies. Normalize to:
– Emp (eno, ename, title, bdate, salary, supereno, dno)
– WorksOn (eno, pno, resp, hours)
Old Scheme {Studio, Movie, Budget, Studio_City}
1. Key {studio, movie}
2. {studio, movie} {budget} 3. {studio} {studio_city}
4. studio_city is not a part of a key
5. studio_city functionally depends on studio which is a proper subset of the key
New Scheme {Studio, Movie, Budget}
Scheme {City, Street, HouseNumber, HouseColor, CityPopulation}
1. key {City, Street, HouseNumber}
2. {City, Street, HouseNumber} {HouseColor} 3. {City} {CityPopulation}
4. CityPopulation does not belong to any key.
5. CityPopulation is functionally dependent on the City which is a proper subset of the key
New Scheme {City, Street, HouseNumber, HouseColor}
Third Normal Form (3NF)
• Third normal form (3NF) is based on theconcept of transitive dependency.
A functional dependency X Y in a
relation schema R is a transitive dependency if there is a set of attributes Z that is neither a candidate key nor a subset of any key of
R, and both X Z and Z Y hold.
Let R be a relation schema, F be the set of FDs given to hold over R, X be a
subset of the attributes of R and A be an
attribute of R.
R is in third normal form if, for every FD X A in F, one of the following statement is true.
• A
X, that is, it is a trivial FD or• X is a superkey or
Third Normal Form (3NF)
fd2 results in a transitive dependency eno →
Scheme {Title, PubID, PageCount, Price }
1. Key {Title, PubId}
2. {Title, PubId} {PageCount} 3. {PageCount} {Price}
4. Both Price and PageCount depend on a key hence 2NF 5. Transitively {Title, PubID} {Price} hence not in 3NF
New Scheme {PubID, PageCount, Price} New Scheme {Title, PubID, PageCount}
Scheme {BuildingID, Contractor, Fee}
1. Primary Key {BuildingID} 2. {BuildingID} {Contractor} 3. {Contractor} {Fee}
4. {BuildingID} {Fee}
5. Fee transitively depends on the BuildingID
6. Both Contractor and Fee depend on the entire key hence 2NF
• Most 3NF relations are also BCNF relations.
• A 3NF relation is NOT in BCNF if:
Candidate keys in the relation are composite keys (they are not single attributes)
There is more than one candidate key in the relation, and
The keys are not disjoint, that is, some attributes in the keys are common
Boyce-Codd Normal Form (BCNF)
• Let R be a relation schema, F be the set of FDs given to hold over R, X be a subset of the
attributes of R and A be an attribute of R. R is in Boyce-Codd normal form if, for every FD X A in F, one of the following statement is true.
A X, that is, it is a trivial FD or
X is a superkey.
• The difference between 3NF and BCNF is that 3NF allows a FD X → Y to remain in the relation if X is a superkey or Y is a prime attribute. BCNF only
allows this FD if X is a superkey.
• Thus, BCNF is more restrictive than 3NF.
BCNF versus 3NF
• We can decompose to BCNF but sometimes we do not want to if we lose a FD.
• The decision to use 3NF or BCNF depends on the amount of redundancy we are willing to accept and the willingness to lose a functional dependency.
• Note that we can always preserve the lossless-join property (recovery) with a BCNF decomposition, but we do no always get dependency preservation.
An example of not having dependency preservation with BCNF:
Scheme {City, Street, ZipCode } 1. Key1 {City, Street }
2. Key2 {ZipCode, Street}
3. No non-key attribute hence 3NF 4. {City, Street} {ZipCode}
5. {ZipCode} {City}
6. Dependency between attributes belonging to a key New Scheme1 {ZipCode, Street }
• Consider the relation schema LOTS1A
shown in Figure, which describes land for sale
in various countries. Suppose that there are
two candidate keys: PROPERTY_ID#
and {COUNTY_NAME, LOT#}
that is, LOT Numbers are unique only within
each Country, but PROPERTY_ID numbers
• Suppose that we have thousands of lots in the relation but the lots are from only two
countries: Nepal & Srilanka.
• Suppose also that lot sizes in Nepal are only 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0 acres,
whereas lot sizes in Srilanka are restricted to 1.1, 1.2, ... , 1.9, and 2.0 acres.
• In such a situation we would have the
additional functional dependency FD3: AREA
FD3
• If we add this to the other dependencies, the relation schema LOTS1A still is in 3NF
• The area of a lot that determines the country, as specified by FD3, can be represented by 16 tuples in a separate relation R(AREA,
COUNTRY_NAME), since there are only 16 possible AREA values. This representation
reduces the redundancy of repeating the same information in the thousands of LOTS1A tuples.
FD3
This decomposition loses the functional dependency
The closure of F contains all dependencies in F+ AC, BA & CB.
Consequently FAB also contains BA & FBC contains CB. Therefore FAB U FBC contains AB, BC, BA & CB.
The closure of the dependencies in FAB & FBC now includes CA.
Multivalued Dependencies
• Suppose that we have a relation with
attributes course, teacher, and book, which we denote as CTB.
• The meaning of a tuple is that teacher T can teach course C, and book B is a
recommended text for the course.
• There are no FDs; the key is CTB.
• However, the recommended texts for a course are independent of the instructor.
Course Teacher Book
Physics101 Green Mechanics
Physicsl0l Green Optics
Physicsl0l Brown Mechanics
Physics101 Brown Optics
Math301 Green Mechanics
Math301 Green Vectors
Math301 Green Geometry
• The schema is in BCNF
• There is redundancy in schema.
• Green can teach Physics101 is recorded once per recommended text for the course.
• Similarly, the fact that Optics is a text for
Physics101 is recorded once per potential teacher.
• The redundancy can be eliminated by decomposing CTB into CT & CB.
• The redundancy in this example is due to the
constraint that the texts for course independent of the instructors, which cannot be expressed in
terms of FDs.
• Let R be a relation schema and let X and Y be subsets of the attributes of R. Intuitively,
the Multivalued Dependency X Y is
said to hold over R if, in every legal instance r of R, each X value is associated with a set of Y values and this set is independent of the values in the other attributes.
• Formally, if the MVD X Y holds over
and Z = R - XY, the following must be true
for every legal instance r of R If tl r, t2 r and t1.X= t2.X,
• If we are given the first two tuples and told that the MVD X Y
holds over this relation, we can infer that the
Fourth Normal Form
• Fourth Normal Form (4NF) is a directgeneralization of BCNF. R be a relation
schema, X and Y be nonempty subsets of the attributes of R, and F be a set of
dependencies that includes both FDs and
MVDs R is said to be in Fourth Normal Form (4NF), if, for every MVD XY that holds over R, one of the following statements is true:
• Y X or XY = R or
• The relation CTB is not in 4NF because
C T is a nontrivial MVD and C is not a
key.