• Tidak ada hasil yang ditemukan

Transformation of Sound

Dalam dokumen Text Information Retrieval Systems (Halaman 110-117)

Attribute Content and Values

4.3.5 Transformation of Sound

usually each element representing the gray-scale value (0 = black, 255 = white).

One technique, called global, is to use a photograph as the query, comparing it with stored ones in a database by matching all the pixels without regard for what part of a face is being compared at any pixel. Potentially better is a method called local. It makes use of the facial features that have been detected and the distance and angles between them, somewhat as in fingerprint matching. But people change, shave or do not shave, dye their hair, turn grey, scowl or smile, and they are photographed under different lighting and posture conditions. The problem is not fully solved. (Bruce,1988; Chellappa et al. 1995, Eigenface Tutorial; 2006;

Harmon, 1973; Hjelmas and Low, 2001.)

retrieval, but is intended to enable a searcher to find a small number of possibil- ities quickly. Figure 4.5 shows the encoding of the opening bars of God Save the Queen (or America).

Several systems bypass the need for the searcher to encode music. Instead the user provides the musical sounds, themselves (Fig. 4.6). One such system is the New Zealand Digital Melody Index, MELDEX. According to its authors,

“[i]t accepts acoustic input from the user, transcribes it into the ordinary music notation that we learn at school, then searches a database for tunes that contain the sung pattern or patterns similar to it.” (McNab et al., 1997).

Other direct music input systems that can retrieve identification of pas- sages are the Humdrum **kern representation from Ohio State University (Everything You Need to Know About the “**kern” Representation, nd) and Themefinder (About Themefinder, nd). Several systems, such as Finale and Sibelius (Fogwall, 2001) primarily enable a composer to produce conventional musical notation using only the standard computer keyboard or playing the music into the computer.

All these methods illustrate the same central point, which also applies to graphics and, less obviously, to all other forms of retrieval. Retrieval does not require that a unique match be found for a query. A surrogate record may be used instead of the original—as fingerprint minutiae instead of the complete print image, up or down coding instead of the original notes, or title, author, or

92

4 Attribute Content and Values

Figure 4.4

The Barlow–Morgenstern music indexing method: the musical notes for a line of a Smetana opera are shown. Its encoding, after transcription, would be: DDAABB#.

Figure 4.5

The Parsons music indexing method. The musical notes for God Save the Queen (or America). Their encoding would be: *RUDUU URUDDD.

Figure 4.6

The opening line for cello of Beethoven’s Fifth Symphony, third movement. There are slight variations in the way this line is represented in the alphanumeric code in different papers by Brook and Gould. This is a compromise.

Ch004.qxd 11/20/2006 9:54 AM Page 92

subject heading instead of a text. If the use of these surrogates saves enough stor- age space and search time, and produces few enough output records, then they are adequate.

A more complete survey of modern musical retrieval systems is found in Downie (1997, 2003).

4.4

Uniqueness of Values

While we often want attribute values to reflect membership in a set or class, so as to associate entities with like attributes, we also sometimes have the opposite need: to separate each entity from all others. Particularly, when the entities of a file are people, it is convenient, important, and may even be legally necessary to clearly distinguish each person from all the others. In the United States and Canada there is no universal identifying number, a number such that each citizen or resident has one and only one number, which no one else has.

There is the U.S. social security number and the Canadian social insurance num- ber, but these were not intended as universal identifiers and their issuing agen- cies prefer that they not be so used. If we had such a system then it could be used to advantage by anyone with a database of people, whether employers’ person- nel files, bank deposit records, credit card accounts, school, police, medical, or tax records. There would be no need for each database owner to create a new numbering system for the people represented, and, if necessary, it would be quite easy to find information about a person in someone else’s database, because the one necessary key would be known.

The very universality that would be so convenient for record keepers is found abhorrent by many people in North America because it takes away some degree of anonymity and therefore of protection from prying or from errors in interpretation. We all make mistakes, whether it be an unpaid bill, a poor grade in school, or a youthful scrape with the police. The inability of information seek- ers such as future employers or motor vehicle agencies to readily find this embar- rassing information is a protection for us. As long as this is so, the opportunity exists to start over. Not every case is so dramatic as that of Benjamin Franklin, who ran away from an indentured apprenticeship in Boston to a spectacularly productive new life in Philadelphia. But most of us, if we have had a dispute with a department store over a bill, do not want an allegation of credit unrelia- bility to be made available to every other business we might wish to deal with.

This is essentially the reason behind laws that prohibit release of the names of juveniles arrested by the police or even convicted in the courts.

If there is going to be an identifying number, even for use within one organization (a student or employee number), it is worth some effort to pre- vent duplicate assignments and to reduce the likelihood of error in transcrib- ing or transmitting the number. One way to do this is to include in the number a check digit. A simple example is for a 9-digit attribute to add a 10th digit to it.

4.4 Uniqueness of Values

93

This would be determined by adding together the nine digits, then using the low-order, or right-most digit to become the tenth digit of the attribute. The U.S. social security number is such a 9-digit field, usually written in the form 123-45-6789. The sum of this sample of digits is 45 and the units digit of the sum is 5. Hence, we might use the 10-digit number 123-45-6789-5 as a more reliable ssn. If a single digit is misread or mistyped, say as 124-45-6789-5 then the check digit is wrong (it would be computed as 6 here), and the existence (but not the location) of an error would be detected. We could also include the year of birth, or a code for it, as part of the identifying number, as done in Sweden (Westin, 1976, p. 261). This would reduce the likelihood of mistakenly using a 2-year-old’s number for a 60-year-old person. Date of birth is less useful in a military or school setting, because the ages of most of the people involved tend to fall within a relatively narrow range. However, one of the issues in identifier design (Secretary’s Advisory Committee on Automated Personal Data Systems, 1973, pp. 109–113) is whether or not the number or code should include any personal information at all about the identified person.

The simplest method of assigning numbers is sequentially: the next entity for which we create a record gets the next number in numerical sequence. This is conventional when there are no privacy issues involved and no severe penalty for error in transcription (hence no need for redundancy). Libraries do this with acquisition numbers for books. The acquisition number contains no information about subject matter or author identification. It serves solely to distinguish one book from another, even two copies of the same work, and is used largely in accounting and inventory operations. This is contrasted with a call number, used as the basis for placing a book on a shelf in close physical proximity with works of a similar subject content, and composed of an indicator of subject matter and of the author’s last name.

4.5

Ambiguity of Attribute Values

While unique identifiers are used for some applications and indicators of class membership, in yet other cases there is uncertainty about values or mean- ings, causing confusion to a human reader or a computer program, or both. One source of ambiguity is semantic—the meaning of symbols. Synonyms are two or more different symbols having essentially the same meaning. Homonyms or homo- graphs are two or more symbols that either sound or appear alike but have dif- ferent meanings. Anaphora (Liddy, 1990) are words, such as it, whose function it is to represent other words.

Synonymy is often context dependent. In general usage the verbs counsel and advise mean essentially the same thing, as do street and road. But, we do not ask which street to take to get from Indianapolis to Chicago, and we retain legal counsel, not advise, when in trouble with the law.

94

4 Attribute Content and Values

Ch004.qxd 11/20/2006 9:54 AM Page 94

Examples of homonyms are red (the color) and read (the past tense of the verb to read ), and of homographs, pound (the verb) and pound (the monetary unit of Great Britain).

Consider the color word red and some related words: cerise, scarlet, and crim- son. To an artist these may have quite different, or at least differentiable, deno- tations, which are well understood and can be readily communicated to other artists. To the average person, these words may all have the same denotation, or we may understand that there are “supposed” to be differences, but we really do not know what they are. Similarly, a news article might describe a person as imposing, which will not bring up the same image to all readers. Worse, if a user tries to retrieve news articles about imposing people he may not find any, or may find very few and be unable to comprehend the commonality of meaning of imposing in this set of articles. For this reason, in describing the content of books or journal articles or the like, libraries try to use standardized phrases or a con- trolled vocabulary to do so.

A controlled vocabulary is one for which some authority decides which words or codes are to be used and defines the meanings of these terms and the rela- tionships among them. Although use of a controlled vocabulary cannot guaran- tee that each reader or cataloger will select the same terms to describe an item, at least each term can be explained as to assumed meaning and differentiated from other terms; hence, a controlled vocabulary causes each term to be unique in meaning, for those who understand the vocabulary and its documentation. We shall discuss means of achieving vocabulary control and some implications of its use later in this chapter.

The homonym or homograph problem is generally worse for computers than for humans because humans rely heavily on context to resolve ambiguity.

The very word mean has a number of different, unrelated meanings in common usage: it is a noun in statistics (average), a verb in linguistics (to denote or connote), and an adjective in describing an unpleasant personality (“a mean junk-yard dog”).

In IR, the existence of homographs suggests that the occurrence of a sym- bol in an attribute value may not mean that the entity actually has the attribute that could be inferred—a person of mean character versus a person of mean height versus a person whose example can mean much to us. Homographs, therefore, force us to consider different ways that (1) a concept or attribute we wish to search for might have been expressed in the records to be searched and (2) a value might occur that could appear to be what is wanted, but is not. In the first case, we would search for alternative values, e.g., AVERAGE, MEDIAN, or

MODE. In the second, we would be aware that MEANcould retrieve completely irrelevant records. We might have to use other terms to establish context. If we want material about mean heights of people, we might want to look as well for such words as AVERAGE, STATISTICAL, or HEIGHT. Only AVERAGEis a synonym in this group. The other words help establish a context of statistics about per- sons’ bodies.

Although simplistic rules for the use of pronouns make them appear unambiguous, such is often not the case, and the resolution of anaphora is a

4.5 Ambiguity of Attribute Values

95

problem of great difficulty in IR. In the sentence, “The dog wagged its tail,” the meaning of its is clear, and its use here is grammatically correct—it refers to the most recent noun. In, “Will everyone remove their hats,” the meaning of their is clear to most readers, but grammatically incorrect because it does not agree in number with the noun it actually represents. This could make it difficult to form a correct association by computer. In the sentence, “Books have illustrations, but their meanings may be unclear,” it is not at all clear whether their refers to books or illustrations. IR systems permitting natural-language queries tend to ignore anaphora.

Another form of ambiguity arises from syntax. The fact that a word occurs in a text, even aside from homographic considerations, does not mean that the text is about that word. For example, the word earthquake occurs in this book. It has been used in an example. This does not mean that the book should be said to be about earthquakes.

In summary, uncertainty can creep into recorded information or its inter- pretation, however carefully we design or edit. There can always be someone who interprets words differently or who does not read instructions. There can always be typographical errors, even if we do devise error-detecting techniques to find most of them. As we shall point out later, ambiguity in language can be a great benefit, although achieved at a cost. We cannot eliminate it. We must learn to live with it and even to take advantage of it.

4.6 Indexing of Text

In mathematics and computer science, indexing is a procedure or method for accessing information. In mathematics the notation xi tells us that there is a sequence of values of the variable x—a one-dimensional array. The value of i identifies a specific element, and i is called an index. In computer science the concept of indexing is more general. An index may also be an array or a file whose elements point to elements of another file. If there is a file whose records are in order by ssn, there may be a separate file (see inverted files in Section 6.4.2), each record of which contains a name and an ssn and is in order by name. This, too, is an index.

Most professional books or textbooks have a subject index at the end and may have a separate author index as well, listing each author cited and the page on which the citation occurs. Even the table of contents of a book is, in a sense, an index. It lists the numbers and names of chapters and sections and the pages on which they begin, giving a good general sense of subject content.

In library and information science an index has a still broader meaning, as discussed in Section 1.6. To index a book, journal article, or technical report is to record the values of various attributes expected to be used as a basis for search- ing. If the attribute is subject, then this form of index functions like a book’s sub- ject index. If the attribute is the author or a named person in the text, then the

96

4 Attribute Content and Values

Ch004.qxd 11/20/2006 9:54 AM Page 96

index functions like a book’s author index. Traditionally, a great deal of effort has gone into subject indexing, especially of journal literature and technical reports.

There are three types of subject-describing attributes: classification codes (see Section 4.2), subject headings, and individual words descriptive of a subject, often called key words. The term descriptor can be applied to any of these, but is most often used for the latter two.

Subject headings are taken from a prepared list of headings. In North America the most common such list, for books, is the Library of Congress Subject Headings (1998). Typically, two or three headings from this collection are made a part of a library catalog record for a book. For a journal article or report, there are usually separate sets of subject headings for each discipline or profession.

There is the Thesaurus of ERIC Descriptors (1995) in education, Medical Subject Headings (MESH) (2006) in medicine, and the Ei Thesaurus (2001) in engineer- ing. The vocabulary and needs of each discipline may be so different that these sets of subject headings may bear little resemblance to each other, except struc- turally. An article on the effect of asbestos materials on the health of children in school buildings may be indexed from totally different points of view in the sep- arate professions of medicine, education, and architecture.

At the end of World War II, a great quantity of scientific literature (pro- duced as part of the Allied war effort and captured Axis files) had never been made public. In the 1950s the cold war triggered a spurt of scientific and tech- nological activity, and documentation. During the period from 1945 to the 1960s, attention began to be paid to the needs of subject control, or indexing of the literature (Becker and Hayes, 1963; Bourne, 1963; Committee on Government Operations United States Senate, 1960; Herner, 1984; Weinberg and the President’s Science Advisory Committee, 1963). By some definitions, information science was equated with documentation, which, in turn, was largely devoted to methods of indexing.

Gradually, the emphasis began to shift from indexing toward retrieval, as the volume of material required not just the creation of descriptive records that could be published in book form (such as Chemical Abstracts or Index Medicus) but mechanical assistance in retrieval, as well. There is no sharp delineation in time between these emphases. Indeed, one of the landmark publications in IR was in 1945 (Bush, 1945); but nevertheless, not everyone agrees that the shift in empha- sis has ever occurred.

As computers became faster and less expensive, it has become feasible, suc- cessively, to include a natural language abstract as part of a bibliographic record (for display only); then to permit search of the abstract for any word or combi- nation of words occurring in its text; then to allow use of word roots and spe- cific sequences of words in a search; and to find words that are related to but not equal to words used in a query. Now we have many databases that contain and allow search of the full text of books, journals, or newspapers. In a search engine using the WWW as its database, it is possible to build the index using a restricted set of HTML tags, or to use all terms from the complete document.

4.6 Indexing of Text

97

Dalam dokumen Text Information Retrieval Systems (Halaman 110-117)