Models of Virtual Data Structure
5.2.7 Text
It is not clear how to classify a data element consisting of natural-language text, such as an abstract in a bibliographic record, the text of an article in a full- text newspaper file, or even the response to a questionnaire item that adds, at the end of the list of choices: “Other (specify).”
Text could be considered a scalar string variable of very great length.
While this is not very practical, there are information systems, particularly older ones, in which the text component could be displayed but not searched. Thus, large as it is, the entire text is treated as a single variable or field. More modern systems will allow searching within a text variable for a particular substring of characters, say the occurrence of STEROID within a text assumed to be dealing with athletics.
A text could also be considered an array of words. This means we would ignore the syntax connecting them, but doing so considerably simplifies the mechanics of searching. Most retrieval systems, in effect, do this, by making a sep- arate file (array of tuples) of words that occur in records of the database and the numbers of the records in which they occur. The new file is sorted alphabetically
5.2 Basic Data Elements and Structures
109
and, as noted in Chapter 1, is called an inverted file. One is illustrated in Fig. 5.4.
It is now quite easy to find where (in which records) a particular word occurred.
This approach does not take any of the syntax of the original text into consider- ation. Expanding on an earlier example, suppose a text were to include the state- ment “This text is about bicycles rather than automobiles,” which denies that automobiles is the subject. If we were to search for all records “about” automobiles, by which we literally mean all that contain the word AUTOMOBILESin any con- text, we would retrieve this text. This is the most commonly used method in commercial database operations today.
Finally, we could treat a text as a structure, made up of a series of words with a syntax relating them to each other or to the entity they describe. This
110
5 Models of Virtual Data StructureFourscore and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal.
fourscore 1
and 2
seven 3
years 4
ago 5
our 6
fathers 7
brought 8
forth 9
on 10
this 11
continent 12
new 14
nation 15 conceived 16
in 17
liberty 18
and 19
dedicated 20
to 21
the 22
proposition 23
that 24
all 25
men 26
are 27
created 28 equal 29
Text
Words and position number Words and position numbers in alphabetic order
in order of appearance
a
ago 5
all 25
and 2,19
are 27
brought 8 conceived 16 continent 12 created 28 dedicated 20 equal 29 fathers 7 forth 9 fourscore 1
in 17
liberty 18
men 26
nation 15
new 14
on 10
our 6
proposition 23 seven 3
that 24
the 22
this 11
to 21
years 4 13
a 13
Figure 5.4
An inverted file: shown are a short text, the list of words in order of occurrence, with the sequential word number appended, and the same word list sorted into alphabetic order.
The occurrence order of a word within a file enables a user to search for the phrase new nation rather than merely new and nation occurring anywhere with respect to each other because the location of the words can be seen to be adjacent and in the desired order.
Ch005.qxd 11/20/2006 9:55 AM Page 110
would be a great advantage to the searcher, because he or she could then use the meaning conveyed by the combination of syntax and vocabulary. The problem, of course, is that the syntax of natural language is too complex and variable for this purpose. We would have to considerably restrict the syntax or the interpre- tation of the text to use this method.
The uncertainty about how to search text, and how to interpret it if we do search it, represents the principal difference between database management sys- tems such as Paradox or Access and information retrieval systems such as Alta Vista, LEXIS, or MEDLINE. The latter three systems use a great deal of text. Their design is based on two assumptions. The first is that users will not know exactly how to phrase a question because they cannot know how words were used in every text to be searched; hence, they will not know exactly how to phrase a particular topic. The second assumption is that the users will not know how much information on the desired topic is in the file; therefore, they do not know how much is likely to be found—possibly more or less than desired. Probing, analysis, and revision are necessary. By contrast, searching for the number of red sedans on hand at an automobile agency means using a precise syntax and vocab- ulary, knowing that this information will be explicitly represented in the file and once only, whether the inventory level is 0 or 100.
5.3
Common Structural Models
A data model, in our usage, is primarily a mental construct. It is descrip- tive of the structure of data, not necessarily its meaning, although the two con- cepts are not entirely separable because both involve the relationship of information elements to each other. We present here a review of the four major data models currently in use. They represent data structures as visualized by a user, who may be a person querying a database, or an applications pro- grammer writing a program to process data in a database. Both must under- stand the logical structure and the relationships among data elements. They do not necessarily need to know the physical structure used inside the computer, so long as programs exist that can access data in accordance with the structures these users visualize. The values of the data elements and the rules governing them are also separate, but not entirely so. We discuss physical data structure in Chapter 6.
Record location and sequence are almost invariably based upon a key which consists of one or more attribute values that are within the record.
Sorting a file of student records by last name alone would result in many records with the same key. Therefore, a secondary key of first nameis almost always used and perhaps even a tertiary key of date of birthto get as close as possible to having a unique key for each record. If there should be more than one stu- dent with the same last name, first name, and date of birth, their records would be in random order.
5.3 Common Structural Models