• Tidak ada hasil yang ditemukan

Class Membership: Binary, Probabilistic, or Fuzzy

Dalam dokumen Text Information Retrieval Systems (Halaman 97-101)

Attribute Content and Values

4.2.3 Class Membership: Binary, Probabilistic, or Fuzzy

In Section 4.2.1 we introduced the complex question of uncertainty in the degree to which an entity has a particular attribute value (such that it falls in sub- ject class C). The reality is that we must consider the extent to which members of a particular user group agree that it has the attribute, or the extent to which it

78

4 Attribute Content and Values

A

A1 A2 A3 . . .

a. Basic hierarchy

A1 . . .

b. Multi-level hierarchy A2

X

A3

A1 . . .

c. Network structure. Members may be linked to other than a parent or child.

A2 X

A

A3 B A

B C

Figure 4.1

Forms of networks and hierarchies: the basic hierarchy (a) consists of a single “parent”

with several subordinates. In a multilevel hierarchy (b) the parent of one element may be a subordinate of another, as A is to X. In a network structure (c), any element may

“belong” to or be related to any other element.

Ch004.qxd 11/20/2006 9:54 AM Page 78

measurably has the attribute. For example, we can assert that the home language of a family in the Province of Québec is French, with probability 0.83, accord- ing to Canadian Markets 1986, and the melting point of copper is 1083.0 ⫾ 0.1⬚C according to the McGraw-Hill Encyclopedia of Science and Technology. Both these variables may have different values if measured under different circum- stances. Subject classification is based on group concurrence. Demographic val- ues are based on measurement by survey. Physical measurement under controlled laboratory conditions, using universal standards for units and for methods of measuring, is generally accorded the highest level of acceptance by scientists. But, even slight variations in materials used or test conditions can result in different values. Scientists must learn to cope with these differences in reports.

A vivid example of this is a famous graph showing the thermal conductivity of tungsten, at different temperatures, as measured by a large number of observers.

The individual values were all plotted (Fig. 4.2) on a single set of axes and a smooth curve fitted to the data. The curve is very smooth and suggests that

4.2 Class Relationships

79

Figure 4.2

A graph of physical measurements showing variations around a smooth average: this chart portrays a number of individual measurements of the thermal conductivity of tungsten, indicating how many variations there can be of even so well defined an attribute. (First published in Journal of Physical and Chemical Reference Data.)

values along it are the most reliable. But there are dozens of points not on the curve that someone, under some circumstances, thought reliable.

Two approaches to membership classification, binary and fuzzy (or prob- abilistic), are described below.

1. Binary membership—When a person has been awarded a university degree, he or she may thereafter be classified as a graduate at the level of bachelor, master, or doctor. Barring fraudulent credentials, there is no doubt about the degree obtained, which is a different matter from that of what the person knows, although the former is often used as a measure of the latter. When a bank opens an account for a customer, it may be a demand savings account, checking account, or time deposit. The bank will determine the classification with the customer’s concur- rence, and that is that. The decision may subsequently be changed, but is not sub- ject to evaluation or review; the customer may have meant to open a different type account, but there is no question about what type was opened. In these cases, we have examples of binary membership in a class or set. An entity is in a class or not. These are the only possibilities; there is no in-between possibility. Most bibliographic classification and indexing is of this form.

2. Fuzzy or probabilistic membership—When a library cataloger reads a 300-page book about IR, a decision has to be made about which class to place it in. There are at least two choices in the Library of Congress classification system as we noted in Section 4.2.1. There is, of course, a quite different possibility, QA76 (Computer Science). The choice is a matter of opinion, or the choice may be stated as a matter of probabilities. We could say that the book is in class Z699 with probability 0.7 and in QA76 54 with probability 0.8. The probabilities are independent of each other and do not need to sum to 1.

Multiple classification is inconvenient as the basis for shelving books, since a given book can only be in one place at a time, and all of it, not seven-tenths of it, must be there. But the use of probabilities with subject headings might be quite helpful to a library user browsing in a catalog. The user would see that the cataloger was uncertain, and this provides the user with more, not less, information. While the book cannot be physically in two places at once, it can simultaneously have two applicable subject headings or classifications. Catalogers and users may have differ- ent opinions as to the probability or strength of membership of an entity in a class.

When we are uncertain whether or not an entity belongs in a particular set, we can create a fuzzy set, one whose boundaries, content, or definition are incompletely specified, or fuzzy (Kraft and Buell, 1983; Zadeh, 1965). This is an extension of the concept of probabilistic membership. We can do this by assign- ing the entity to the set and assigning a measure of strength of membership explicitly stated as an additional attribute.

If all book classifications were accompanied by probabilities, then we might do library searching by asking such questions as

“Get me all books that are in class Cwith probability greater than .5.” Such a statement means the user will accept books with even a somewhat tenuous rela- tion with the given class.

80

4 Attribute Content and Values

Ch004.qxd 11/20/2006 9:54 AM Page 80

“Get me all books in class C1 with probability greater than .9 or in class C2 with probability greater than .6.” This means we want to be quite sure about c1 but are willing to accept a relatively obscure connection with subject C2.

If we list the possible classes in which an entity falls without assigning a probability and without implying certainty of membership, this is binary mem- bership. This method is commonly used in manually assigning descriptors to articles in bibliographic files. As many as 10–20 terms may be used to describe the article, but there is no implication beyond the fact that the article is con- cerned with this term to some degree; no weights or degrees of certainty are assigned. In other words, the indexer may be implicitly acknowledging that variation in strength of association exists, but no attempt was made to quantify its extent.

Yet another variation is to have a multilevel indicator, or weight, showing the extent to which an article belongs in a subject class, e.g., that the term describes the article to a major or minor extent. For example, in the ERIC data- base the descriptors preceded by “*”are considered by the indexer to be major, e.g.,* INFORMATION RETRIEVAL; * ONLINE SYSTEMS. The same descriptors:

INFORMATION RETRIEVAL; ONLINE SYSTEMS, without the preceding “*”would be minor, i.e., relevant to the document, but not of primary importance. When terms representing content classes are selected by a computer program analyzing the text of a document, probabilities or weights based upon occurrence counts are often assigned. We shall meet fuzzy sets again, when we discuss the logic of searching in Chapter 10.

4.3

Transformations of Values

Earlier, we examined similarity measures for names such as SMITHand

SMYTHE or TSCHAIKOWSKY and CZAJKOWSKI. Similar questions are those of determining how close one numeric measure or one fingerprint is to another.

Some kinds of symbols may have so much variation that, as is the case with fingerprints, even two records of the same finger, taken at different times, may show some differences if we consider all detail. One way to compare or match them is to transform the original symbols into a higher order or more general symbol; that is, to abstract the important characteristics, which is what we do with the subject matter of a text when preparing an abstract or catalog record.

With numeric data we most commonly do this by establishing a series of value ranges, or class intervals, such as reporting taxpayer incomes in such ranges as $0–9,999, 10,000–19,999, etc. Hierarchic codes that represent a sequence of nodes in a hierarchical structure can be truncated for comparison purposes. This is commonly done in libraries that use a truncated version of the classification code to inform users where to find books in the various classes.

4.3 Transformations of Values

81

Dalam dokumen Text Information Retrieval Systems (Halaman 97-101)