Pre-Computer IR Systems - Text Information Retrieval Systems

Introduction

1.6.2 Pre-Computer IR Systems

Until the 1950s, the means used to describe the content of documents was almost universally to apply subject headings or classification codes. A subject heading is a short description of a subject, such as France, History and Middle Ages. Mortimer Taube, founder of the company Documentation Incorporated, is generally credited with adapting an idea that dated back to the 1930s (Taube, 1953–1965). A set of single words or short phrases, which he called Uniterms, would be used to describe the document’s content. Instead of the subject heading given above, we might have indexed a document by three separate terms:

France, History, and Middle Ages. The difference is that the subject heading is one syntactic expression and someone searching on France in the Middle Ages, or History of the Middle Ages, might not find that heading. Taube saw that the subject heading contained three separate concepts (France, History, and Middle Ages), which had been pre-coordinated, or pre-formed, into a syntactic unit. His idea of entering each term separately into an index allowed the searcher to look for any combination of the terms of interest. This came to be known as post- coordination, where the terms are associated after indexing, at search time, as the needs of the search dictate. The older method of combining index terms in a syntactic statement came to be known as pre-coordination.

Two methods of mechanical searching of cards soon became popular (Casey et al., 1958). Taube used cards, each with a single-subject heading, called a Uniterm, as shown in Fig. 1.5. The information on the cards was arranged in 10 columns for the posting of item numbers by their rightmost digit. This facil- itated manual comparisons. In a serial file of document records, to search for multiple concepts (say the subject LUNARand the subject EXCURSION) we would find the cards corresponding to these two concepts. We would then compare the record numbers listed to see if there were any in common. If so, the corresponding documents satisfied the request. In the figure, it can be seen that documents numbered 241, 44, and 17 are found in both cards, hence these satisfy the search

1.6 A Brief History of Information Retrieval

23

requirements. Finding these document numbers was a manual task using post- coordinate indexing and a method of mechanical searching.

W. E. Batten reported a similar method in 1948 at the Royal Society’s Scientific Information Conference, which may have influenced Taube (Batten, 1951, pp. 169–181). Again there was a card for each term, but the document numbers were represented by coordinates of a point on the card rather than writ- ten explicitly. The Batten card was a carefully laid out grid, rather like a piece of graph paper on card stock. If a term appears in a document, a hole would be drilled at an appropriate location to represent that document’s accession or serial number. If we wished to search for DOCUMENTand RETRIEVAL, we pulled those two concept cards and placed them over one another on a light box. The numbers corresponding to the positions where light shown through indicated documents indexed on both concepts. In Fig. 1.6 document number 1234 contains both the illustrated terms. The number of concepts that could be represented this way was very large, but even with very large cards and very fine grids, the system was limited as to the file size it could handle. Such an approach came to be called an optical coincidence system.

The second type of card device was the Edge Notched Punch Card, first reported by Calvin Mooers (1951) as the mechanical core of his Zatocode system. The card took the opposite approach to that of Taube and Batten. It represented a document, not a term. A description of the item was typed on the body of the card. The edge of the card had a series of holes, each labeled with a

24

1 Introduction

EXCURSION 43871

90 241* 52 63 34 25 66 17* 58 49 130 281 92 83 44* 75 86 57 88 119 640 122 93 104 115 146 97 158 139 157 178 199 207 248 269

5 4 3 2 1 R

A N U L

110 181 12 73 44* 15 46 7 28 39 430 241* 42 94 85 76 17* 78 79 870 761 602 124 95 126 87 118 109 901 982 194 165 136 147 168 179

Figure 1.5

Cards used with the Uniterm system: each card represents one term, shown as both a word or phrase and a number, and the accession numbers of all documents to which that Uniterm pertains. The document numbers are listed in 10 columns, according to the low- order digit of the number. This facilitates manual scanning. The starred numbers indicate those in common to both terms.

Ch001.qxd 11/20/2006 9:53 AM Page 24

number which could represent a single descriptor. This was Mooers’ term for a multi-word precoordinated phrase describing an indexed concept. The holes for the numbers were notched out with a punch to carry out the coding of the assigned concepts, as in Fig. 1.7. The first part of the figure shows detail of a similar card system. The presence of a value was indicated by punching out the edge of the card. A needle was then run through a deck of such cards in the hole posi- tion representing the concept sought, and the deck vibrated. This caused cards that had been notched to represent the desired concepts to drop out (or, equiv- alently, those not notched are lifted). It is from this action that the expression false drop originated. A card that dropped but was not really relevant, was a false drop. The second part of Fig. 1.7 shows Mooers’ Zatocard (Perry et al., 1956, pp. 52–53). Descriptors were encoded, using several numeric hole positions each. A unique feature was that descriptor codes did not have to be unique. An occasional false drop could result from this, but in the days of relatively small databases, it was assumed these could easily be detected and dropped from con- sideration. Logically, this process is similar to Batten’s light coincidence.

Both these systems and Taube’s Uniterm cards were intended to effectu- ate the coordination of concepts at the time a search was carried out rather than at the time of indexing. Each requires significant effort at the time of indexing

1.6 A Brief History of Information Retrieval

25

Term No. 12345 RETRIEVAL

Term No. 8723 DOCUMENT

12 units

34 units

Figure 1.6

Optical coincidence system: this method inverts the logic of the Uniterm system, by hav- ing a card for each document and showing, with it, the numbers corresponding to index terms that pertain to the document. The numbers are represented as small holes in the card or surface at coordinates corresponding to the term number, e.g. term number 1234 with be represented as a hole at a location 12 units to the right of the left edge and 34 units up from the bottom.

so that greater speed and flexibility will be available at the time of search. Each uses the algebra of sets, developed by George Boole (1958) to create sets of items during a search, and they are clear precursors of the automated retrieval systems that followed.

Dalam dokumen Text Information Retrieval Systems (Halaman 42-45)