Anyone with a modest to advanced background in SQL can benefit from the insights it contains. SQL and Relational Theory: How to Write Accurate SQL Code and related trademarks are trademarks of O'Reilly Media, Inc.
And the thesis of this book is that using relational SQL is the discipline you need. So, if you don't want to hang yourself, you need to understand relational theory (what is and why); you need to know about SQL deviations from this theory; and you need to know how to avoid the problems they can cause.
Second, as advertised, I will talk about theory, but it is an article of faith for me that theory is practical. Moreover, that theory is not only practical, but also fundamental, straightforward, simple, useful, and can be fun (as I hope to demonstrate in the course of this book).
The class was taught by Toon Koppelaars and was based on the book he co-wrote with Lex de Haan (see Appendix G in this book) and it was also very good. First, design theory as such has never really had much to do with the main message of the book, anyway; second, the appendix was becoming so extensive that it threatened to overwhelm the rest of the text.
Chapter 1
You need to know what you need to know (if you know what I mean); in particular, you should make sure that you have the necessary prerequisites to understand the material that will come in later chapters. So if you're a database professional, you should know the relational model, because the relational model is the foundation (or a large part of the foundation, anyway) of the database field in particular.
Throughout this book I use the term operator; so for example I refer to “=” (equality comparison), “:=” (assignment), “+” (addition), DISTINCT, JOIN, SUM, GROUP BY (etc., etc.) all as operators specifically. Most of the SQL features discussed in this book were present in SQL:1992, and often in earlier versions.
For example, suppose you know Oracle; Indeed, suppose you are an expert in Oracle. But if you know the underlying principles – in other words, if you know the relational model – then you have knowledge and skills that will be transferable: knowledge and skills that you will be able to apply in any environment and will never become obsolete.
To begin with – and this point is crucial! – every relation has at least one candidate key.4 A candidate key is simply a unique identifier; in other words, it is a combination of attributes - often but not always a "combination". 4 Strictly speaking, this sentence should read "Each relvar has at least one candidate key" (see section "Relations vs.
"target key"); so, for example, the departments-and-employees database would violate the referential integrity rule if it included an EMP tuple with a DNO value of D2, say no DEPT tuple with the same DNO value. So the referential integrity rule simply spells out the semantics of foreign keys; the name "referential integrity" comes from the fact that a foreign key value can be considered a reference to the tuple with the same value for the corresponding target key.
Now any number of operators can be defined that fit the simple definition of "one or more relations in, exactly one relation out." Here I will briefly describe what are usually considered the original operators (essentially those defined by Codd in his earliest papers);8 I will provide more detail in Chapter 6, and in Chapter 7 I will describe a number of additional operators. Note: This operator is also known variously as Cartesian product (sometimes more specifically extended or extended Cartesian product), cross product, cross connection, and Cartesian join; in fact, it's really just a special case of joins, as we'll see in Chapter 6.
Returns a relation containing all possible tuples that are a combination of two tuples, one from each of the two specified relations, such that the two tuples contributing to any given result tuple have a common value for the common attributes of both relations (and that common value appears only once, not twice, in that double result). The two are equivalent and interchangeable, in the sense that for every algebraic expression there is a logically equivalent calculus expression and vice versa.
However, they do not need to know how the relations are physically stored in the computer, or how individual data values are physically encoded, or what indexes or other physical access paths exist; all that is part. However, they do not need to know how joins are physically implemented, or what expression transformations take place under auspices, or what indexes or other physical access paths are used, or what I/O operations occur; this is all part of the implementation, not part of the model.
Physical data independence—not a great term, by the way, but we seem to be stuck with it—means that we have the freedom to make changes to the way data is physically stored and accessed without having to corresponding changes in the way the data is perceived by the user. In other words, a data model in the second sense is simply a logical, and perhaps somewhat abstract, database design.
For example, the projection of the supplier ratio of Fig. 1.3 on attribute CITY produces the result shown on the left below and not the one on the right:. Note in particular that the table on the right has no double underline; this is because it has no key, and therefore no primary key a fortiori.).
It is that simple representation on paper that makes relational systems easy to use and understand, and makes it easy to reason about the way such systems behave. Unfortunately, it is also the case that the simple representation in question suggests some things that are not true (for example, that there is a top-to-bottom order in the tuples).
In particular, it does not categorically say that basic relations are physically preserved and views are not. The only requirement is that there must be some mapping between whatever is physically stored and the underlying relationships, so that those underlying relationships can somehow be retrieved when needed (conceptually, anyway).
That figure shows three relationship values: namely, the relationship values that happen to exist in the database at a particular time. However, if we were to look at the database at a different time, we would likely see three different relationship values appear in their place.
Indeed, different representations of the same value can appear at any number of different locations in time and space—meaning, loosely, that any number of different variables (see definition that follows) can have the same value, at the same time or different. times. Furthermore, variables, unlike values, can be updated; that is, the current value of the variable can be replaced by another value. variable" means — to be variable is to be upgradeable, to be updatable is to be variable; .. equivalently, to be a variable is to be attributable, to be attributable is to be a variable.).
What are some of the specific points of difference between such photos and related relationships. 1.9 (Try this exercise without looking back at the body of the chapter.) What relationships does the supplier-and-part database contain.
These questions are deliberately not answered in the body of the chapter, and this exercise can best serve as a basis for group discussion.
To drive a car, you don't need to know what's going on under the hood - all you need to know is how to steer, how to shift gears, etc. It's true that you might drive better if you have some understanding of , what goes on under the hood, but you don't need to know.
It is truly disturbing, in the relational context above all others (where precision of thought and articulation was always a key goal), to find such horribly sloppy wording. It is an all too common mistake to think that the relational model is all about structure and to forget about the operators.
Note the need for the preliminary DELETE; also note that, loosely speaking, anything could happen between that DELETE and the subsequent INSERT, while in the relational case there is no sense that there is anything “between the DELETE and the INSERT” (the assignment is a semantically atomic operation). As for the question “Can all relational assignments be expressed in terms of INSERT and/or DELETE and/or UPDATE?”, the answer is yes (although we don't need UPDATE as such at all).
Chapter 2
For example, throughout this book I will assume that attribute STATUS of the suppliers relvar S is defined to be of type INTEGER. Under that assumption, any relation that is a possible value for relvar S must also have a STATUS attribute of type INTEGER—.
I will start with the relational model as originally defined by Codd; so until further notice I will use the term domain, not type. Similarly, I will assume that the part number (PNO) attributes in the P and SP relvars are also of user-defined type (or domain) with the same name, PNO.
Clearly, the terms P.WEIGHT = SP.QTY and P.WEIGHT - SP.QTY = 0 mean essentially the same thing. Note that the comparison P.WEIGHT = SP.QTY falls into this special category, but the comparison P.WEIGHT - SP.QTY = 0 does not.).
Note: Implicit type conversion, as shown in the examples above, is often called coercion in the literature. Continuing with the example: another operator we need to define when we define a type like SNO or PNO is what is generally called the THE_ operator, which - in this case -.
Relation R1 in that figure is a reduced version of the shipping relation from our running example; it shows that certain suppliers supply certain parts, and it contains a tuple for each relevant {SNO,PNO} combination. Now suppose we replace R1 with R2, which shows that certain suppliers supply certain groups of parts (the attribute PNO in R2 is what some authors would call multivalued, and the values of this attribute are groups of item numbers).
More precisely, it does not support what would be its analogue of RVAs, table-valued columns. Oddly enough, however, it supports columns whose values are strings, and columns whose values are rows, and even columns whose values are "multiple sets of rows" - where a multiple set, also known as a bag , is like an array, except that it allows duplicates.9 Columns whose values are multiple sets of rows, so look a bit like "table-valued columns"; however, they are not table-valued columns because the values they contain cannot be operated on by regular SQL table operators and thus are not regular SQL table values, by definition.).
The fact that parameters in particular are declared to be of a type touches on an issue I mentioned but haven't properly discussed yet: namely, the fact that each type has a set of operators associated with it for operating on values and variables of that type—say , that the operator Op is "associated with" type T essentially just means that the operator Op has a parameter of type T.11 For example, integers have the usual arithmetic operators; dates and times have special calendar arithmetic operators; XML documents have so-called "XPath" and "XQuery" operators; relations have operators of relational algebra; and each type has assignment operators (":=") and equality comparisons. Note that, by definition, values and variables of a given type T can only be manipulated using the operators associated with that type T .
The value to be returned is a point and is obtained by invoking the POINT selector; this call has two arguments corresponding to the X and Y coordinates of the point to be returned. 12 This paragraph incidentally touches on another important logical distinction: that between arguments and parameters (see Exercise 2.5 at the end of the chapter).
A generated type is a type obtained by invoking a type generator (in the example, the type generator is specifically RELATION). You can think of a type generator as a special kind of operator; it is special because (a) it returns a type instead of a value, and (b) it is called at compile time instead of runtime.
Note that BINARY does not mean binary numbers, it means bit string (or perhaps more precisely byte string, since the associated length specification gives the corresponding length in octets).16. 16 True bit string types – BIT(n) and BIT VARYING(n), where n was the length in bits – were introduced in SQL:1992 but dropped again in SQL:2003.
You can use them if you want, but don't make the mistake of thinking they are real relational domains (ie types). It is now a generally accepted principle in computer science that coercions are generally best avoided because they are error-prone.
Note: Throughout this book, I use the term row expression to mean either a row subquery or a row selector call (where row selector is my preferred term for what SQL calls a row value constructor —see Chapter 3); in other words, I use row expression to mean any expression that points to a row, just as I use table. Finally, SQL also uses the term constraint in a very special sense in relation to character strings.
It is important to note, however, that such expressions are not allowed in integrity constraints (see Chapter 8), presumably because they could cause unpredictable success or failure of updates. The expression TUPLE {..} here, as you recall, is an invocation of the TUPLE type generator.
Rather, relationships can have properties of any type whatsoever (except as noted in a moment) - the relationship model nowhere prescribes exactly what those types should be, and in fact they can be as complex as you like (they can even be relationships ) types). In other words, the question of what types are supported is orthogonal to the question of support for the relational model itself.
What is the significance of the fact that relvar P (for example) is of a certain bond type. For those that are, state the type of result; give the other an expression that will achieve what appears to be the desired effect.
The difference is most immediately apparent in the fact that SQL supports both the CREATE TYPE statement and the CREATE DOMAIN statement. To a first approximation, CREATE TYPE can be thought of as the SQL counterpart to the TYPE statement from Tutorial D, which I'll discuss in Chapter 8 (although there are many, many differences between the two, not all of which are trivial).
A system-defined (or built-in) type is one that is available for use once the system is installed (it “comes in the same box the system comes in”). 26 A much more detailed discussion of the logical difference between foreign keys and references can be found in the article.
The reason, as I'm sure you know (and as was actually mentioned in Chapter 1), is that (a) we can extend it to apply to relvars as well as relations, and then (b) we can we define a series of "higher" normal forms for variables that turn out to be important in database design. It used to be defined to mean that each tuple had to contain a single "atomic" value at each attribute position.
Thus, all the operation of "defining a type" (the TYPE statement, in the case of the tutorial D— . see Chapter 8) is really introducing a name by which to refer to this set of values. Similarly, dropping a type doesn't actually drop the corresponding values, just the name introduced by the corresponding "define type" operation.
Chapter 3
As we saw in Chapter 1, there is a logical difference between a thing and a picture of a thing, and that difference can be very important. For example, tuples have no left-to-right ordering of their properties, and so the following is an equally good (bad?) picture of the same tuple:.
Obviously, this image represents a set, so the order in which the attributes are displayed is arbitrary. Therefore, like all values, it must be returned by some sort of selector call (obviously a tuple selector call if the value is a tuple).
It follows that the empty headline is a valid headline! - and thus that a tuple with an empty set of components is a valid tuple (although drawing pictures of such a tuple on paper is a bit difficult, and I'm not even going to try). But it's worth explaining the semantics of tuple equality in detail, because so much in the relational model depends on it.
Note carefully in the expanded form of this example that the two individual comparisons in the WHERE clause are ORed, not ANDed. But it is appropriate to include at least this brief mention of SQL row assignment here.
H is the relation head (or simply head) for r, and the degree and attributes of H and the cardinality of B are the degree, attributes, and cardinality of r, respectively. In short, each tuple in a relation represents an n-ary relation, in the ordinary natural language sense of that term, relating a set of n values (one such value for each tuple attribute); the complete set of tuples in a given relation represents the complete set of such relations existing at a given time; and mathematically speaking, that set of tuples is a relation.
Note that, given a given relation type, there is exactly one empty relation of that type, but empty relations of different types are not the same precisely because they are of different types. Tutorial D uses the syntax of the form TUPLE FROM rx for this purpose, where rx is any expression denoting a relation of cardinality one. For example, this could be the expression RELATION {TUPLE {SNO 'S1', PNO 'P1', QTY. 300}}, which is essentially a relationship selector call (actually it's literal).
In other words, the equation (which is a Boolean expression) means: “The set of supplier cities equals the set of component cities” (and of course returns TRUE or FALSE). 8 In other words, the expression IS_EMPTY(r) is logically equivalent to both of the following values: (a) r = r WHERE FALSE; (b) r{}.
Now I explained in Chapter 2 that SQL actually has nothing analogous to the concept of a relationship type; instead, an SQL table is just a collection of rows, where (a) the rows are of a particular row type and (b) the collection is (in general) a bag, and not necessarily a set. It follows that SQL doesn't really have anything analogous to the RELATION type generator either, although as we know from Chapter 2 it does support other type generators including ROW, ARRAY, and MULTISET.
Believe it or not, the operator that turns a table into a bag of rows is called TABLE. 11 Moreover, the standard does not guarantee that a single column in each of these two bags of rows resulting from the two TABLE calls in the example has a prescribed column name (see Note 12, which follows).
The only case where it is impossible to follow the preceding recommendation is when two columns in the same table both represent the same kind of information. As a result, some column renaming will sometimes have to be done, as in the following join example (note the specification "ENO AS MNO" in the third line):. SELECT ENO , MNO FROM EMP ) AS temp1 NATURAL JOIN.
3.5 (This is essentially a repeat of Exercise 1.8 from Chapter 1, but you should now be able to give a more detailed answer.) There are many differences between a relationship and a table. Give an example of (a) a tuple with a tuple-value attribute (TVA), (b) a tuple with a relation-value attribute (RVA).
Note the lack of column names (or rather field names, to use the official SQL term) and the reliance on left-to-right order in these SQL statements. By the way, the fact that there are no parentheses surrounding the list of string value constructor calls is not a bug.
The intended meaning is: CNO course can be taught by any TNO teacher in TEACHER (and no other teachers) and uses any XNO textbook in TEXT (and no other textbooks). Regarding a relationship with an RVA such that there is no relationship without an RVA representing exactly the same information, a simple example can be obtained from Figure 1.
3.9 (Note: You may want to come back and look at this answer again after you read Chapter 10.) We need the concept of relations in general before we can have the concept of relations of degree zero in particular. The second represents a single-row SQL table, where that row consists of four “fields”.
Chapter 4
By definition, the topics in question are SQL topics, not relational; therefore, in this chapter I will use the SQL terminology rather than the relational model (at least for the most part). A detailed analysis of this entire problem, including design aspects, can be found in the paper "Double Trouble, Double Trouble" (see Appendix G).
The objective in question is explicitness; that is, the meaning of the data in the database should be as clear and unambiguous as possible (since databases are supposed to be suitable for sharing among a wide variety of different users and applications). In fact, duplicates can be considered to violate one of the most fundamental principles of relationships: ie. The Information Principle (discussed in Appendix A).
So the first point to note is that the twelve different formulations produce nine different results: different, that is, in the degree of duplication. By the way, I do not claim that the twelve different formulations and the nine different results are the only possible ones; indeed, they generally are not.).
Well, here are the results produced by the same twelve formulations against the revised version of the database in Fig. Here are two possible formulations of the query "Get supplier numbers for suppliers who supply at least one part" on our usual suppliers and parts database (and note that this time the input tables contain absolutely no duplicates):.
One wrote: "Those who really know SQL well will be shocked at the thought of coding SELECT DISTINCT by default." Well, I would like to politely suggest that (a) those who are. 5 The implication is that SELECT DISTINCT may take longer to execute than SELECT ALL, even if that DISTINCT is effectively a "non-option". Well, it may be so; I do not want to elaborate the matter; I will simply note that the reason for these.
The essential point I'm trying to make is that certain Boolean expressions - and thus certain queries in particular - can produce results that are correct in terms of three-valued logic, but not correct in the real world. Note carefully that the shading in this image where the CITY value should be for part P1 means nothing; there is conceptually nothing—not even a string of spaces or the empty string—in that place (which means that the "tuple" for part P1 is not really a tuple, a point I'll return to towards the end of this section).
To all of the above, I can't resist adding that although SQL supports 3VL, and although it does support the UNKNOWN keyword, this keyword does not denote a value of type BOOLEAN, unlike the TRUE and FALSE keywords, in SQL. This is just one of the many shortcomings in SQL 3VL support; there are many, many others, but most of them are beyond the scope of this book.).
For similar reasons, do not use the MATCH option on foreign key constraints, and do not use IS [NOT] DISTINCT FROM. Note: If you are not familiar with COALESCE, let me briefly elaborate on the last of these recommendations.
And even in the simple character string case, an argument can be made that the result misrepresents the semantics of the situation (does 'zero' really represent a partial number?). Nulls and 3VL are supposed to be a solution to the "missing information" problem - but I believe I've shown that, to the extent that they can be considered a "solution" at all, it's a disastrously bad one is.
Note that (a) the X pointer allows updates, (b) the table visible through the X pointer allows duplicates, but (c) the underlying SP table does not (allows duplicates, that is). Then, in general, it is not possible to tell which particular row of the SP table was deleted by this operation.
11 If {SNO,PNO} is the primary key for shipments, then the columns SP.SNO and SP.PNO cannot allow null values without violating the entity integrity rule. Incidentally, the very fact that the entity integrity rule is supposed to apply only to primary keys and not to keys in general seems to me to be an additional reason to view this rule with suspicion.
A detailed analysis of this whole issue can be found in the paper "Double Trouble, Double Trouble" (see Appendix G). The paper "Theory of Bags: An Investigative Tutorial" (see Appendix G) goes into detail about such issues; here let me just say that if we adopt the SQL definitions then the law certainly does not apply.
Speaking for myself, therefore, no, I don't think nulls "occur naturally in the real world". 14 Note that the dyadic tables are presented here in a slightly different style than that used in the body of the chapter.
To clarify the point: It is very natural to assume that expressions that are tautologies in 2VL are also tautologies in 3VL, but this is not necessarily the case. Because in SQL, believe it or not, two zeros do not "compare equally" for joins, but "compare equally" for intersections.
They] can be used interchangeably to mean exactly the same thing.” But of course zero does not always mean "the third truth value". In fact, the keyword NULL cannot usually be used in place of the keyword UNKNOWN, even if UNKNOWN is the intended meaning (see c. and f. in the answer to the last part of the exercise below).
Chapter 5
Now it's time to take a closer look at that difference; more specifically, it is time to take a closer look at issues that are particularly relevant to relvars as opposed to relationships. Warning: Unfortunately, you may find the SQL parts of the discussion that follow a bit confusing, because SQL does not clearly distinguish between the two concepts - as you know, it uses the same term, table, to sometimes mean a table value and sometimes a table variable.
INSERT inserts a string of tuples into the destination relvar; DELETE deletes a set of tuples from the target relvar; and UPDATE updates the tuple set to the destination relvar. All that talk really means is that the set of tuples we're updating has cardinality one.
Now, it is easy to see that this particular task is logically equivalent to the following DELETE: statement. Alternatively, we can say it's shorthand for this (either way, it's about the same thing):.
It follows that an attempt to use I_DELETE to delete a tuple that does not exist - more generally, an attempt to use D_INSERT when the relation denoted by rx is not completely contained in the relation denoted by R is not - will fail. Note: Now that I have introduced D_INSERT and I_DELETE, please understand that discussions elsewhere in this book that refer to INSERT and DELETE operations in Tutorial D should be considered for simplicity to apply to D_INSERT and I_DELETE operations as well, where whatever sense it requires.
As I'm sure you realize, this is basically the definition of the assignment operation. SNO,CITY}, which is certainly a subset of the head of S that has the uniqueness property.
Second, in the case of basic relvars in particular, it is common, as mentioned in Chapter 1, to single out one key as the primary key (and any other keys for the relvar in question are sometimes said to be alternate keys). A given base table can have any number of UNIQUE specifications, but at most one PRIMARY KEY specification.
As an aside, I note that the relational model as originally formulated required that foreign keys correspond not just to a particular key, but very specifically to the primary key of the referenced relvar. Answer: The one that is implicit in the process of checking the foreign key constraint. Recall that tuples must certainly be of the same type if they are to be tested for equality, and 'same type' means that they must have the same attributes, and so.
Such operations can be specified as part of either an ON DELETE clause or an ON UPDATE clause. 8 In case you're wondering about the SQL terminology here, ON DELETE CASCADE is a "reference-triggered action" and CASCADE itself is a "reference action".
Supplier S1 is under contract, called Smith, status 20, and located in London. Supplier S1 is under contract, called Smith, status 20, and located in London.
In other words, types give us our vocabulary—the things we can talk about—and relationships give us the ability to say things about the things we can talk about. Nothing else, that is, except the things logically implied by the things we can say explicitly.
This is also the reason, again in my opinion, that other data models are simply not up to the same scale. In contrast, in SQL INSERT is defined in terms of UNION ALL and there is nothing like D_INSERT.
A key is a set of properties and the empty set is a legal set; so we can define an empty key as a key where the pertinent set of attributes is empty. However, at that stage I did not discuss the logical difference between relationships and relvars; and in this chapter we saw that keys generally apply to relvars, not relations.
Note that if the row r appears exactly n times in temp (n ≥ 0), it also appears exactly n times in T. Then the effect of the specified DELETE statement is to assign the result of the expression. As for the Tutorial D analogs of the two SQL statements: Well, note first that in order for such analogs to exist, it is necessary to assume that the SS table does not allow duplicates, because "duplicate-allowing relvars" are not supported in the Tutorial D (in fact, they are a contradiction in terms).
Here is a definition: Let X be a subset of the real head R; then X is a subkey of R if and only if there exists a key K of R such that X is a subset of K. And X is a proper subkey of R if it is a subkey of R that is not a key of R. ). A non-trivial functional dependency is one for which the right-hand side is not a subset of the left-hand side.).
FOREIGN KEY ( MAJOR_PNO ) REFERENCES P ( PNO ) ON DELETE CASCADE , FOREIGN KEY ( MINOR_PNO ) REFERENCES P . ON DELETE RESTRICTION. Note: In this example, the two foreign keys in table PP both refer to the same key in table P.
But what the predicate is for a particular relvar is in the mind of the definer of that relvar (and also in the user's mind, I believe). As a trivial example, the relation denoted by S{CITY}, the projection of the current value of relvar S onto {CITY}, is not the value of any relvar in the suppliers and parts database.
Chapter 6
This is the first of two chapters on relational algebra operators; it discusses the original operators (that is, those briefly described in Chapter 1) in some way, and also examines some ancillary but important issues—eg, the importance of properly naming attributes (or columns) as well one time. Third, I gave concise descriptions in Chapter 1 of what I there called the "original operators" (limit, project, product, union, intersection, change, and union); however, I am now able to define those operators, and others, much more carefully.
Sometimes it is required that the attributes (or rather columns) in question have the same name – and then the correspondence is sometimes recorded explicitly, sometimes implicitly. And regardless of whether the columns in question should have the same name, sometimes those columns should be of the same type, and sometimes they shouldn't.
In contrast, any operator that produces a result that is not a relation is, by definition, not a relational operator.6 For example, any operator that produces an ordered result is not a relational operator (see the discussion of ORDER BY in next chapter). Obviously, once a given join tuple is formed, the system can immediately test that tuple against the condition PNAME > SNAME (P.PNAME > S.SNAME in the SQL version) to see if it belongs to the final output, discarding down if not.7 Thus, the intermediate result that is the output from the union may never exist as a fully materialized relation in itself.
My SQL examples in this chapter and the next (indeed throughout the rest of this book) will all follow this discipline. It's actually pretty hard to explain how references to the names P and S in the WHERE and SELECT clauses (and possibly elsewhere in the parent expression) can even make sense relative to the result of the FROM clause.