Normalization is an important part of database design because that process can reduce or eliminate the potential for insert, update, or delete anomalies.
Within a database environment, normalization also helps maintain integrity by preventing such anomalies, such as update anomalies and data inconsisten- cies that can result with duplicated data.
To provide examples of these anomalies and to see how we can resolve them, let’s consider some data in a real estate agency scenario shown in Figure 2.1.
We consider this data to be in list form because all of the data is contained within one list, and a particular row contains the data for a property, the realtor assigned to that property, and the realtor office. For brevity and to focus on the concepts, we only include some pertinent data items and not an exhaustive list of data that one would actually see in a more detailed listing scenario.
FIGURE 2.1 Data in list form for a real estate agency.
Each row in this table (or relation, using the relational database model) rep- resents the listing of a property for sale. For each row, we can obtain some basic information about the property itself (address, number of bedrooms, area or size of the interior, and price). Each row also yields information about the realtor (first name, office address and city location, and phone number for the realtor).
In the relational database model, the relation for this data set can be described with the structure or definition given in Figure 2.2.
Listing ( RealtorName, OfficeAdr, OfficeCity, Phone, PropAdr, PropCity, NBeds, Area, Price)
FIGURE 2.2 Original relation structure or definition of Listing.
where Listing is the name of the relation, the relation attributes are comma- separated within parenthesis, and the attribute(s) that define the primary key are underlined. Here the primary key of Listing is the set of attributes (Phone, PropAdr, PropCity), so that, given a set of values for each of those attributes, we can derive a unique row or property listing and its listing realtor. For exam- ple, given the values (“555-1111,” “997 George,” “CityA”), we can uniquely derive the information in the third row of the Listing table.
While this relation can be used to store and retrieve data about a particular property and its realtor, this relation does present database integrity concerns.
Because certain data is duplicated (such as RealtorName, OfficeAdr, PropAdr, and Price, to name a few), the potential for modification anomalies with updates and inconsistencies is introduced. As an example, suppose realtor Penny updates her office address from “137 Main” to “417 Main.” Unless that address change is applied to every occurrence of the previous address, the previous address will coexist in the database along with the new address. This coexistence of different values for the same data is a data inconsistency, where a certain data retrieval operation may yield one address, but another data retrieval operation may yield a different address for the same realtor. To see how this may happen, suppose that OfficeAdr is changed to “417 Main” for only the first row of the relation (the row with a PropAdr value of “17 Highland”), as shown in Figure 2.3. A subsequent data retrieval with primary key values of (“555-1111,” 17 Highland,”
“CityA”) for (Phone, PropAdr, PropCity) yields the new OfficeAdr value of “417 Main.” But a data retrieval with primary key values (“555-1111,” “1565 State Rd,” “CityB”) yields the previous OfficeAdr value of “137 Main.”
Realtor
Name Office
Adr Office
City Phone PropAdr Prop
City NBeds Area Price
Penny 137 Main CityA 555-1111 17 Highland CityA 3 2000 220000
Penny 417 Main CityA 555-1111 1565 State Rd CityB 4 2900 290000
Penny 137 Main CityA 555-1111 997 George CityA 4 2200 240000
Penny 137 Main CityA 555-1111 123 Big Lane CityA 8 5000 750000
Bob 455 Oak CityB 555-2222 5 Lighthouse CityB 4 2000 230000
Bob 455 Oak CityB 555-2222 190 Brown CityC 2 1700 140000
Bob 455 Oak CityB 555-2222 123 Big Lane CityA 8 5000 719000
FIGURE 2.3 Real estate data with two data inconsistencies.
Another example of a data inconsistency exists with the “123 Big Lane”
property. Suppose Bob reduces the price of that property to $719,000, also as shown in Figure 2.3. Because that property is listed with multiple realtors,
a change to that property’s data (such as a price change) must be applied to each realtor’s listing, otherwise the old price and new price will coexist in the databases, resulting in another data inconsistency. The only way we can resolve such database integrity issues with inconsistencies is to either ensure that all duplicate occurrences of a data value are updated (which may be tricky to accomplish) or better, normalize the relations that hold the data.
To help eliminate duplication and the possibility of data inconsistencies, we can apply database normalization techniques that split tables as necessary to create new tables and new relationships between those tables. The normal- ization technique we demonstrate is based on the use of functional dependen- cies to achieve Boyce-Codd Normal Form (BCNF), which can eliminate most forms of duplication.
The concept of a functional dependency is a way to represent which attri- butes can derive a unique set of other attributes. Given a set of values for the attributes on the left side of a functional dependency, we can derive a unique set of values for the attributes on the right side of that functional dependency.
Because the primary key of (RealtorPhone, PropAdr, PropCity) together can derive a unique row in Listing, the functional dependency for Listing with that understanding is given in Figure 2.4.
(Phone, PropAdr, PropCity) -> (RealtorName, OfficeAdr, OfficeCity, NBeds, Area, Price) FIGURE 2.4 Functional dependency for original Listing table based on the primary key.
Here, given a set of values for (Phone, PropAdr, PropCity), we can derive a unique set of values for (RealtorName, OfficeAdr, OfficeCity, NBeds, Area, Price), which in this case derives a unique row in the Listing relation.
The process of normalization (for our purposes to BCNF) is based on functional dependencies and candidate keys. Like a primary key, a candidate key consists of one or more attributes that can derive a unique row in a table.
A table can have only one primary key but can actually have multiple candi- date keys. If a table has only one candidate key, that key becomes the table’s primary key. On the other hand, if a table has multiple candidate keys, one is chosen to be the table’s primary key, and the others exist as alternate retrieval keys that can still derive a unique row.
To achieve BFNF normalization, for each functional dependency, the attributes that compose the left side of that functional dependency must be a primary or candidate key to some table. If that is not the case then we
split a table into two tables to fulfill that goal. The specific table to split and how we split that table is based on the left and right sides of that functional dependency.
To illustrate the concept of BCNF normalization and see how it can reduce data duplication and resolve the potential of data inconsistencies, let’s consider the relational form of the Listing table and the functional depen- dencies of those attributes. We already established the table definition for Listing in Figure 2.2 and the functional dependency based on its primary key in Figure 2.4.
Suppose we also note the following functional dependencies also exist, given in Figure 2.5. Combined with the functional dependency given in Figure 2.4, we are now considering three functional dependencies total.
Phone -> (RealtorName, OfficeAdr, OfficeCity) (PropAdr, PropCity) -> (NBeds, Area, Price)
FIGURE 2.5 Other functional dependencies for original Listing table.
Note that we do not need to consider the actual table data for this normali- zation task, although the data may be considered in deriving the functional dependencies themselves. However, we will consider the table data by remov- ing duplicate rows that may appear after splitting tables, as we will soon see.
The process of normalization to BCNF is shown in Figure 2.6. The main idea is to examine every functional dependency and determine whether to split a table. If a table is to be split, steps 1, 2, and 3 define how to split the table based on the functional dependency. The last step (3b) establishes a relationship between the table that was split and the new table.
For each functional dependency F:
If the left side of F itself is not a candidate key to some table thenStep 1: Create a new table T that consists only of the attributes
in both sides of F.
Step 2: Set the primary key of T as the attributes in F’s left side.
Step 3: In the existing table E that already had the attributes of F:
a Remove the attributes of F’s right side that are not part of E’s primary key.
b Set the attributes in F’s left side to be a foreign key into T.
FIGURE 2.6 Normalization process to BCNF.
Looking at the first functional dependency, shown in Figure 2.4 we con- firm whether the left side (Phone, PropAdr, PropCity) is itself a primary or candidate key to some table. Because that is the primary key of Listing, we do not need to go any further with that first functional dependency and proceed with the second.
Looking at the second functional dependency, the first in Figure 2.5 we confirm whether the left side, Phone, is itself a candidate key to some table. Note that while Phone is part of the primary key in Listing, Phone by itself is not a primary key or candidate key, so we proceed with the three steps to split Listing.
In step 1, we first create a new table that consists only of the attributes Phone, RealtorName, OfficeAdr, and OfficeCity. Let’s call this new table Realtor. In step 2, we then set Phone to be the primary key of Realtor. Finally, for step 3, in the Listing table we remove RealtorName, OfficeAdr, and OfficeCity. We leave Phone in Listing but set Phone as a foreign key in Listing that relates to the pri- mary key Phone in Realtor. After being split because of the second functional dependency, the result is the two tables defined in Figure 2.7.
Listing (Phone,PropAdr,PropCity,NBeds,Area,Price) Realtor (RealtorName,OfficeAdr,OfficeCity,Phone)
FIGURE 2.7 Original Listing table split into two tables.
Note that in the Listing table, Phone is both underlined and italicized.
This means that Phone is part of the primary key, and is itself a foreign key into Realtor. Figure 2.8 shows how the split tables appear with their data at this time.
Because, by definition in the relational database model, a table does not have duplicate rows (that is, rows with the same value in each corresponding column), we remove any duplicate rows that may exist after splitting a table.
In this example, after the normalization process we have 5 duplicate rows in Realtor, and after removing them the Realtor table is now reduced to that shown in Figure 2.9.
Now on to the last functional dependency. We confirm whether the left side (PropAdr, PropCity) is itself a primary or candidate key to some table. It is not, so we likewise proceed with the three steps to split the Listing table. In step 1, we first create a new table that consists only of the attributes PropAdr, PropCity, NBeds, Area and Price. Let’s call this new table Property. In step 2, we then set (PropAdr, PropCity) to be the primary key of Property. Finally, for step 3, in the Listing table we remove NBeds, Area, and Price. We leave (PropAdr, PropCity) in Listing but set it as a foreign key in Listing that relates to Property. The result is now the following three tables, whose definitions are given in Figure 2.10.
Listing (Phone,PropAdr,PropCity)
Realtor (RealtorName,OfficeAdr,OfficeCity,Phone) Property (PropAdr,PropCity,NBeds,Area,Price) FIGURE 2.10 Resulting table definitions after normalization of real estate data.
The tables with their data are shown in Figure 2.11.
Because we have analyzed all the stated functional dependencies and the left side of each is now a candidate key to some table, we have established these tables to be in BCNF. Hence, duplicated data has been reduced, and
Listing
Phone PropAdr PropCity NBeds Area Price
555-1111 17 Highland CityA 3 2000 220000
555-1111 1565 State Rd CityB 4 2900 290000
555-1111 997 George CityA 4 2200 240000
555-1111 123 Big Lane CityA 8 5000 750000
555-2222 5 Lighthouse CityB 4 2000 230000
555-2222 190 Brown CityC 2 1700 140000
555-2222 123 Big Lane CityA 8 5000 750000
Realtor
RealtorName OfficeAdr OfficeCity Phone
Penny 137 Main CityA 555-1111
Penny 137 Main CityA 555-1111
Penny 137 Main CityA 555-1111
Penny 137 Main CityA 555-1111
Bob 455 Oak CityB 555-2222
Bob 455 Oak CityB 555-2222
Bob 455 Oak CityB 555-2222
FIGURE 2.8 Split tables with their data.
Realtor
RealtorName OfficeAdr OfficeCity Phone
Penny 137 Main CityA 555-1111
Bob 455 Oak CityB 555-2222
FIGURE 2.9 Realtor table with duplicate rows eliminated.
the corresponding data inconsistencies have been resolved. For example, if Penny now updates her office address from “137 Main” to “417 Main,” that involves changing the data in exactly one location (in the Realtor table). The office address is not duplicated anywhere else, so there is now no possibility of a data inconsistency involving the new address and previous address. In a similar manner, if we were to now change the price of the “123 Big Lane” property, that change is applied to exactly one location (in the Property table) and there is now no possible data inconsistency involving the new price and previous price.