Text - Models of Virtual Data Structure - Text Information Retrieval Systems

Models of Virtual Data Structure

5.2.7 Text

It is not clear how to classify a data element consisting of natural-language text, such as an abstract in a bibliographic record, the text of an article in a full- text newspaper file, or even the response to a questionnaire item that adds, at the end of the list of choices: “Other (specify).”

Text could be considered a scalar string variable of very great length.

While this is not very practical, there are information systems, particularly older ones, in which the text component could be displayed but not searched. Thus, large as it is, the entire text is treated as a single variable or field. More modern systems will allow searching within a text variable for a particular substring of characters, say the occurrence of STEROID within a text assumed to be dealing with athletics.

A text could also be considered an array of words. This means we would ignore the syntax connecting them, but doing so considerably simplifies the mechanics of searching. Most retrieval systems, in effect, do this, by making a separate file (array of tuples) of words that occur in records of the database and the numbers of the records in which they occur. The new file is sorted alphabetically

5.2 Basic Data Elements and Structures

109

and, as noted in Chapter 1, is called an inverted file. One is illustrated in Fig. 5.4.

It is now quite easy to find where (in which records) a particular word occurred.

This approach does not take any of the syntax of the original text into consider- ation. Expanding on an earlier example, suppose a text were to include the state- ment “This text is about bicycles rather than automobiles,” which denies that automobiles is the subject. If we were to search for all records “about” automobiles, by which we literally mean all that contain the word AUTOMOBILESin any con- text, we would retrieve this text. This is the most commonly used method in commercial database operations today.

Finally, we could treat a text as a structure, made up of a series of words with a syntax relating them to each other or to the entity they describe. This

110

5 Models of Virtual Data Structure

Fourscore and seven years ago, our fathers brought forth on this continent a new nation, conceived in liberty and dedicated to the proposition that all men are created equal.

fourscore 1

and 2

seven 3

years 4

ago 5

our 6

fathers 7

brought 8

forth 9

on 10

this 11

continent 12

new 14

nation 15 conceived 16

in 17

liberty 18

and 19

dedicated 20

to 21

the 22

proposition 23

that 24

all 25

men 26

are 27

created 28 equal 29

Text

Words and position number Words and position numbers in alphabetic order

in order of appearance

ago 5

all 25

and 2,19

are 27

brought 8 conceived 16 continent 12 created 28 dedicated 20 equal 29 fathers 7 forth 9 fourscore 1

in 17

liberty 18

men 26

nation 15

new 14

on 10

our 6

proposition 23 seven 3

that 24

the 22

this 11

to 21

years 4 13

a 13

Figure 5.4

An inverted file: shown are a short text, the list of words in order of occurrence, with the sequential word number appended, and the same word list sorted into alphabetic order.

The occurrence order of a word within a file enables a user to search for the phrase new nation rather than merely new and nation occurring anywhere with respect to each other because the location of the words can be seen to be adjacent and in the desired order.

Ch005.qxd 11/20/2006 9:55 AM Page 110

would be a great advantage to the searcher, because he or she could then use the meaning conveyed by the combination of syntax and vocabulary. The problem, of course, is that the syntax of natural language is too complex and variable for this purpose. We would have to considerably restrict the syntax or the interpre- tation of the text to use this method.

The uncertainty about how to search text, and how to interpret it if we do search it, represents the principal difference between database management sys- tems such as Paradox or Access and information retrieval systems such as Alta Vista, LEXIS, or MEDLINE. The latter three systems use a great deal of text. Their design is based on two assumptions. The first is that users will not know exactly how to phrase a question because they cannot know how words were used in every text to be searched; hence, they will not know exactly how to phrase a particular topic. The second assumption is that the users will not know how much information on the desired topic is in the file; therefore, they do not know how much is likely to be found—possibly more or less than desired. Probing, analysis, and revision are necessary. By contrast, searching for the number of red sedans on hand at an automobile agency means using a precise syntax and vocabulary, knowing that this information will be explicitly represented in the file and once only, whether the inventory level is 0 or 100.

5.3

Common Structural Models

A data model, in our usage, is primarily a mental construct. It is descrip- tive of the structure of data, not necessarily its meaning, although the two con- cepts are not entirely separable because both involve the relationship of information elements to each other. We present here a review of the four major data models currently in use. They represent data structures as visualized by a user, who may be a person querying a database, or an applications pro- grammer writing a program to process data in a database. Both must under- stand the logical structure and the relationships among data elements. They do not necessarily need to know the physical structure used inside the computer, so long as programs exist that can access data in accordance with the structures these users visualize. The values of the data elements and the rules governing them are also separate, but not entirely so. We discuss physical data structure in Chapter 6.

Record location and sequence are almost invariably based upon a key which consists of one or more attribute values that are within the record.

Sorting a file of student records by last name alone would result in many records with the same key. Therefore, a secondary key of first nameis almost always used and perhaps even a tertiary key of date of birthto get as close as possible to having a unique key for each record. If there should be more than one student with the same last name, first name, and date of birth, their records would be in random order.

5.3 Common Structural Models

111

Dalam dokumen Text Information Retrieval Systems (Halaman 128-131)