Ranked and Fuzzy Sets - Querying the Information Retrieval System

Querying the Information Retrieval System

7.3.4 Ranked and Fuzzy Sets

broad set definition, and then successively reduce its size by adding more restric- tive criteria. It is not necessary to name the set being reduced, because it is always the last one created, nor is it necessary to write the Boolean AND between set numbers because it is understood. For the search illustrated above, the searcher might have started with an even broader term, ANIMALS, reduced this to PETS

(assumed to mean ANIMALS AND PETS), then to CATS OR DOGS(now giving us

ANIMALS AND PETS AND(CATS OR DOGS) ). Under this approach, if a user wants to take another path, say CANINE NUTRITION, it is necessary to start the search again. There is no objective measure of which approach is better. Searchers tend to learn one method, and then to prefer that one thereafter.

The advantages of the Boolean query logic are that it is easy to write inter- pretive programs for it and that the logic is crisp and decisive: records are in a set or they are not. The disadvantage is a “soft” one: most people do not naturally think in Boolean terms; hence, the logic seems artificial, and it is sometimes difficult for users to express what they want or to understand what they get. Even though we may not know exactly how people think, the sense among researchers in the field tends to be that forcing users to work in terms of binary set-defining functions is not a popular approach (Bookstein, 1985; Croft and Thompson, 1987; Hildreth, 1983; Salton and McGill, 1983, pp. 118–155; Salton, 1989, pp. 345–361).

records D_isuch that S(D_i⫻Q)⫽1. This is a crisply defined set, each record of the database being clearly in the set or not in it.

If the function could take any value in the range between 0 and 1, then the set is fuzzy and the membership function can be written as

S(D_i⫻Q) →[0,1]

which maps the records and query into the set of numbers lying between 0 and 1, inclusive. Then, no record (except those for which S is exactly 0 or 1) can be said to be clearly in or not in the retrieval set. Effectively, S no longer expresses whether or not D_iis in the set, but the degree or strength of the association of D_iwith the set. Any users, at any time, can set a threshold value for S, which could determine the set membership. Each record has a rank, S, and the membership consists of all records of rank ⱖS.

The use of ranking or measurement of the probability of a match, instead of binary Boolean logic, is receiving a great deal of attention in IR research (Noreault et al., 1977; Noreault et al., 1981; Ro, 1988; Robertson et al., 1986;

Salton and McGill, 1983, pp. 146–151). Some of the most common ways of achieving ranking of records are described below.

1. Weighted terms—Membership in the set is not precisely defined. A user’s query might include a weight for each term, showing its relative importance.

Then the user can select the n records with the highest total weight. For example, the query might state (TENNIS(.8) OR GOLF(.4)) AND CHAMPION(.6).

Probably, an IRS executing such a command would treat it first as an unweighted statement for formation of an initial set, then use the assigned weights to compute a rank for each record. Since CHAMPIONis required here by the Boolean logic, the assignment of a weight to it is actually superfluous and represents one of the complications of this method. But if TENNISand CHAMPION

both appeared in a record, it would have a weight of .8 ⫹ .6 ⫽ 1.4. If the combination GOLFand CHAMPIONappeared, the record would have a weight of .4 ⫹ .6 ⫽ 1.0. If both TENNISand GOLFappeared with CHAMPION, the record might be construed to have either the weight .8 ⫹.4 ⫹.6 ⫽1.8 or .8 ⫹.6 ⫽ 1.4 if we use only the higher weight in the ORexpression. A variant is to add the weight each time a word appears; if GOLFappeared six times in a record, its total contribution to record weight would be 2.4.

This method has the apparent advantage of allowing users to express vari- ations in the relative importance of terms used, but the same criticism applies as to Boolean logic: it is not a “natural” way to communicate. Users would have to learn how to assign weights, by experience, and until they became adept, results might be worse than without weights.

2. Word lists—To avoid the requirement for the precision of weights, a user can simply be allowed to use a list of words as a query statement. A record’s score would be computed by adding 1 for each time any of the words appears in the record. The query could be TENNIS, GOLF, CHAMPION, CUT, TOURNAMENT,

WINNER.

160

7 Querying the Information Retrieval System

Ch007.qxd 11/20/2006 10:07 AM Page 160

An article about the winner of a golf or tennis tournament is likely to have most of these words in it, probably some of them repeated several times. An article that mentions a champion of a political cause might have but a single instance of this one word, and few if any of the others, and so is unlikely to be highly ranked with respect to this query. Actually, the word list is simply a set of words with an implied ORbetween each pair, with equal weights. The method does not allow expression of the concept AND. Users are thereby saved the trouble of learning to express the logic but lose some precision of expression. A natural lan- guage query can be interpreted this way, by ignoring all syntax, perhaps elimi- nating common words, leaving only a list of unconnected words. The basic method can be used with an assumed AND, yielding a much smaller retrieval set.

Both these techniques can be interpreted as using fuzzy sets. In both cases, a large number of records might have a score greater than zero, and these form an initial set, but they may vary considerably on total score. Hence, the user makes the choice of a cut-off level.

3. Ranking Boolean sets—Ranking is not the same as set formation, and there is no reason why ranking must be based only on information contained in the command used to define the set. A set could be formed in a conventional manner, using a binary set membership function. Then a separate command could define the basis for ranking members of the set, such as providing a list of terms of particular interest to the user. Records could also be ordered on the frequency of occurrence of attribute values within a set, such as ranking in order by frequency of occurrence of an author’s name (see Section 7.4.6 and Section 8.3.1.). For example, a set might be formed based on the name of an author, or institution and then the records ranked on the basis of relevance to some subject-defining terms.

4. Ranking by link structure—The Web search engines all apparently use an inverted file structure and some variant of the vector space model, a weighted term method (see Section 10.3.1), to provide term weights for their indices, which are created by scanning the text of pages found by a Web crawling pro- gram. These indices can be used for Boolean searching, and are used in this way by some engines, particularly in what they term their advanced search modes.

While specific algorithms are proprietary, most ranking appears to be done by creating similarity measures between page and query word lists. Since the Web is extremely large, and free text searches tend to retrieve very large sets, order becomes exceedingly important since most users will not read too deeply into the retrieved set. Pages are ranked after the retrieved set is created, perhaps using the similarity value with the query alone or, more likely, using other methods such as the Google page-rank algorithm (Brin and Page, 1998), which uses a ranking based upon the number of times page is the target of links from other pages. This utilizes information only available in the World Wide Web and is thus a departure from previous IR practice. It provides a quality ranking rather than a topical one. The number of times a link to a page has been clicked, that is to say followed, by a searcher may also be recorded and utilized as a measure of page quality for ranking purposes. With page sponsors that desire a high ranking,

7.3 Query Logic

161

and are willing to pay for it, some engines will incorporate an amount payed into the ranking algorithm.

Dalam dokumen Text Information Retrieval Systems (Halaman 178-181)