The Physical Structure of Data
6.4.5 Direct-Acess Structures
The single most common disadvantage of all the other methods considered thus far is the time it takes to find where a record with a known key is stored. A direct-access structure permits making use of a family of techniques known as hashing or mapping (Knott, 1975; Knuth, 1998) to transform a key value into an address.
Figure 6.16 shows the basic method. A common method of address com- putation is to divide the key by a prime number and use the remainder as an address. If the key is alphanumeric it can be treated as a binary number, or the nonnumeric characters can be converted first and then the division, carried out.
For example, if a key were 12,345 and the prime number, 31, then the result of a division is 398 with a remainder of 7. If the key included a letter, e.g., 1234E, convert it into its sequential position number in the alphabet, 5. The 7 is used as the address or relative address. Note that the address computed uses fewer digits than the key, else we would not need the computation. This method can map any set of keys into 31 address locations. In another method, square the key and extract the required number of digits from the middle of the product. For exam- ple, the square of 12,345 is 152,389,025. One might extract digits 4 and 5
138
6 The Physical Structure of Data1 2 3 4 5 6 7 8 9 10
03 10 4 5 6 7 8 9 10
0
2 31 Directory
First data record Next available space
State of the file after deleting the record, in position 1.
324 287
a
b
c
Figure 6.12
Deleting a record from a list: to delete the record with key 324, follow the chain of point- ers until that key is reached. Set the pointer of that record pointer to the next available empty record (3). The first data record value remains 2, but the Next available space, in the directory, is set to 1. It is not necessary to remove the data from record 1; when a new record arrives, the old information will simply be written over.
Ch006.qxd 11/20/2006 9:55 AM Page 138
6.4 Organizational Methods
139
patient no.
patient no.
Pointers to other records
Pointers to other records
Pointers to other records H
H H
P P
P D
D
D R
R R patient no.
H: Hematology File P: Pharmacy File
D: Dietary File R: Radiology File Main File
Figure 6.13
Multilist structure: there is a main file record for each hospital patient. Each points to the first relevant record in one of the other files which contains specific information about the patient. A patient record in a subordinate file may point to another record about that same patient, and so on. All these records could be physically interfiled in one large file.
The number of attributes in use and the number of values occurring for patients may vary considerably from person to person.
119 23 606
606 432 1959 23
37 84
84 0 0
432 160 1322
1959 0 2467
15
0 0 160
0 0
1322 0 0
2467 0 0
Low High
Directory (119)
37 16 0
Figure 6.14
A balanced binary tree: the directory points to the record holding key value 119.
Thereafter, each record has two pointers, shown below the record number, one to the next lower valued key, one to the next higher valued key. A horizontal line indicates an end of the tree. Only keys are shown here, not other record content.
140
6 The Physical Structure of Data119 MIA
37 STU
606 WXY
23 84
BCD 432 1959
PQR
15 abc
160 1322
FGH 2467
Low High
Directory (119) (MIA)
15 ABC
Figure 6.15
Two trees combined: a second set of pointers in each record enables the tree to serve to connect records based on two different keys, one shown as numeric, one alphabetic. The numeric pointers have been omitted but would be the same as in Fig. 6.14. The dashed line arrows represent alphabetic pointers. Note that the records are not in the same sequence in the alphabetic tree as the numeric. The record with numeric key 1959 has no lower-valued key. The one with numeric key 84 ends a branch: it has neither lower nor higher next keys, but there are both lower and higher next keys for the alphabetic key BCDin that record.
3 8 7
5
6 9 10
11 12 13 14 15
17 18 19 20
21 22 23 24 25
2 Search key = 12345
Divisor = 31
12345 / 31 = 398 7/31 Hashed address = 7
27 28 29 30
31 1
26 16
4
Figure 6.16
Hashing or address mapping: the key of a record being searched for, or to be stored, is divided by a prime number equal to the amount of addressing space. The remainder is the actual or relative storage location.
Ch006.qxd 11/20/2006 9:55 AM Page 140
(counting from the right), yielding an address of 89. The number of digits used must be consistent with the size of the memory into which the keys are mapped.
In some cases, it would be possible to set aside a location for every possi- ble key value; then the key instantly translates into a location. It could be done with a social security number by setting aside “only” a billion locations (ssn is a 9-digit number). For an employer with 9,000 employees whose records are to be sorted on this key, this is an impractical solution. Instead, this employer would surely prefer to be able to do a simple computation on the ssnand com- pute a 4-digit address from it, thereby limiting required memory to locations numbered 0 to 9,999. An employer with 100 employees could afford to use a sequential search, simplifying the search program.
Hashing or mapping can be used to decide both where to store a new record and where to find one already stored. It is extremely fast. It uses high- speed computation instead of slow disk access to an index. If used for a file that needs only one search key and has relatively low volatility it is probably the best of methods. But it has two major drawbacks. First, note that key 76, with a divisor of 13, hashes to 11, key 77 hashes to 12, and 78, to 0. Thus, successively occur- ring keys (76, 77, 78) will not normally be stored in adjacent locations; hence, in doing a sequential search of the file, it cannot be known where a record iden- tified only as “next” will be stored. We must have the key. Variations have been developed to alleviate this problem (Garg and Gotlieb, 1986).
The second and usually more important disadvantage of hashing is that more than one key can hash to the same value. Key 76 with a divisor of 13 hashes to 11, as do 89, 167, etc. Since two records cannot be placed at the same location, this attractive method of finding locations seems flawed. The reader is invited to ponder this problem. A solution is discussed in Section 6.6.2.
6.5
Parsing of Data Elements
In a personnel file for which an index is to be created on the attribute, ssn we would expect to include in this index each ssn and the number of the record in which it occurred. If, on the other hand, there were a text attribute in the record, say a narrative summary of the employee’s job history, then we would hardly expect to index this by including the complete text of the summary in the index. That would be useless, because we would never expect a searcher to use a complete summary as one long search key.
With the key-word method of indexing text, the question arises of which words to use. In this chapter, we are not so concerned about which actual words to choose as about deciding how to organize them. One possible selection rule is to select every word in the text. This has the advantage of mechanical simplicity, but results in selection of many words that common sense suggests cannot make a meaningful contribution to a good index, hence fills memory with useless entries. We can improve on this approach by omitting those words found on a
6.5 Parsing of Data Elements
141
stop list, typically a list of very common words generally deemed to lend no sig- nificance to identification of the subject matter of a text (Fox, 1990). Such words, in English, include the, of, an, and the like. The Dialog Corporation, with one of the largest collections of online files in the world, uses only AN, AND, BY, FOR,
FROM, OF, THE, TO, and WITHin its stop list, because so many seemingly innocu- ous words may have meaning to some users. The word A, e.g., would be dropped by most of us, but it identifies a vitamin of importance to others.
Another approach altogether is for a human indexer to read the text and select the most meaningful words that describe the subject. This method can be counted on to give good descriptive material, but adds to the cost and time of preparing the index. Automatic selection of key words by computer is possible but has never worked well enough to be successfully used commercially. By this we mean selection of keywords as an index to be used as a surrogate for search- ing. This is different from using user-supplied key words to find relevant docu- ments. Also, indexers do not always agree on which words are most meaningful (Sievert & Andrews, 1991). Searchers, then, have some guessing to do.
Indexing may use the complete content of an attribute whose value requires relatively little memory (such as ssn), or it may not use the complete content if the attribute value takes up a great deal of space and is such that no user is ever likely to use the entire value as a search key.
Consider a person’s name when used in an author field of a bibliographic record. Certainly it seems clear that it is desirable to index bibliographic records by author, but in what form should the name appear in the index? Should we list only the last names? Also list the first names if we have them? Should the names be indexed in some strict format? If using only initials, is it necessary to use periods after each?
Subject headings also raise problems because these are usually syntactic state- ments of several words, and there are several for each bibliographic record. Should the index entry consist of the complete set of subject headings used in a record, as a unit, together with record number? Should each subject heading be separately entered? Should each word in each subject heading be indexed separately?
There are two basic mechanical ways of parsing a syntactic expression to create an index, by word or by phrase. There can also be a combination of the two or, of course, no index at all.