Relevance and Value - Data, Information, and Knowledge

Data, Information, and Knowledge

2.2.8 Relevance and Value

Much of text information retrieval is concerned with these two terms: relevance and value. Basically, it is not so much a matter of what is retrieved being true or not, as whether or not it is of use to the searcher. We may be searching for the source of a known lie, or the name of the current president of the XYZ Corporation. In the first case, we want the lie, to see exactly what it said. In the second case, we might easily have retrieved a five-year old record with the name of the then president. Is that person still president? Which of five items retrieved from a request for an explanation of global warming is best? We discuss this issue is more detail in Chapter 16.

2.3 Metadata

Metadata—refers to data about data or information about information.

Typically, metadata is descriptive of the organization or content of a body of data, such as a record or database. One of the oldest forms is a library catalog, an entry in which tells its reader where another information item is found (location) and something of its content and source. Location information is typically in the form of a classification and an author code indicating where on a shelf a book is

2.3 Metadata

43

stored. The card may also tell in which portion of the library the book is to be found. Content is described by the same subject classification, possibly additional subject classification codes (perhaps both Library of Congress and Dewey codes) and subject headings. Source information includes such attributes as publisher, date, and author name. If the item described is a mathematical table, relevant metadata may tell about the format of the data, number of significant digits, source of information, perhaps even the mathematical formula used to calculate the values and the value ranges in which the formula provides certain levels of precision. There is no ordained set of metadata. Whatever will help a user know how to find, use, or interpret the indicated information may be included.

As libraries have increasingly become depositories for large numbers of computer databases as well as conventional books, periodicals, and graphic images, it becomes increasingly important for them to be able to tell prospec- tive users what is in these databases, what the data look like, and even what they mean. Hence, the equivalent of the old card catalog has grown in size and comprehensiveness.

In a sense, as simple a device as the thumb index of a dictionary constitutes metadata. It tells where entries beginning with a given letter begin and end. A library catalog card entry describing a book’s location or subject content clearly qualify. Is a review of that book also metadata? Is biblical exegesis? Each is a dis- cussion about some other information item. But they may be original items in themselves. Later in this volume we describe data structures that use indexes, then indexes to the indexes, possibly continuing for several levels. These, col- lectively, are metadata.

The U.S. Library of Congress developed a method for describing the structure of data records to enable libraries to exchange catalog records among institutions which may have different software and file structures. Called Machine Readable Cataloging (MARC), it was a standard for creating files that could be translated from or to almost any database system (Avram, 1969). Much of the information in a MARC record is metadata. Actually, it combines metadata with the data content.

In summary, metadata tells about an information item. It may tell what elements are present, such as author or title. It may give descriptive information that is not explicitly present in the item, such as the language of a text. Texts rarely say, “The language herein is English.” Or, as noted above, metadata may describe mathematical properties of data, such as precision of elements.

The answer to the question raised earlier about where metadata begins and ends is that it does not matter. There is no need to make hard and fast rules about whether book reviews are data or metadata. What is important is that we recog- nize the need for metadata and learn how to use it effectively. In the new world of electronic publishing, files are stored in a server and transmitted to a client computer. These files may contain text, representations of sound, or graphics.

The file will contain information about layout, type fonts, and colors. All this is needed by a receiving computer in order for it to be able to display the records appropriately, possibly with different software than the sending computer used.

44

2 Data, Information, and Knowledge

Ch002.qxd 11/20/2006 9:54 AM Page 44

The Standard Generalized Markup Language—(SGML) was developed to aid in exchanging text files (Goldfarb, 1990). It allows a user to define various elements of a document and these definitions can then be used to locate information or decide how to display them. A derivative of SGML has been developed for use with World Wide Web files, called the Hypertext Markup Language, or HTML (Graham, 1995). HTML is primarily used to control displays.

Figure 2.1 shows a small sample of text marked up with HTML. The metadata is the set of tags, which denote the role of an information element, its display color, type size, or other such facts. The figure shows the beginning of a bibliography that appeared in a page of a Web site. Some of the tags are standard. Users can create their own. For example, there is a tag for title that instructs an interpreting program how to display the text that follows. It could also be used by a search program to identify words that occur in the title of an article.

The author of a text is not always tagged as such, but any using organization could create such a tag, specify how authors’ names are to be displayed, and make authors’ names available to search programs. This sample shows only a limited number of usages and is intended only to provide a sense of what the language is like.

Attempts are being made to develop metadata standards, i.e., to specify what attributes of a document should be made explicit. One such is called the Dublin Core (after Dublin, OH where it was mainly developed). This is a continuing attempt to develop a standard set of metadata attributes to be used for

2.3 Metadata

45

<html>

<head>

<title>Charles T. Meadow Personal Bibliography</title></head>

<body bgcolor="#ffffff" text="#000000" link="#008080"

vlink="#ff0000"

alink="#ffff00">

PUBLICATIONS, BOOKS:

<ul><li>Boyce, Bert R.; Meadow, Charles T., and Kraft, Donald H.

Measurement in Information Science. San Diego: <a href="http://www.apnet.com">Academic Press</a>, 1994.

<li>Meadow, Charles T. Text Information Retrieval Systems. San Diego: <a

href="http://www.apnet.com">Academic Press</a>, 1992.

Figure 2.1

Sample of an html-marked text: the symbol ⬍html⬎denotes the start of a text; ⬍head⬎ indicates the start of a text element called header and ⬍/head⬎ends that element. ⬍li⬎

starts a new line, ⬍p⬎ends a paragraph, and ⬍a href=⬎introduces a link to a related Web site, in this case that of the publisher of the book cited.

describing a document in a library context (Weibel, 1997). The core consists of 15 data elements, not all of which are going to have a value for each document (The Dublin Core, 1998). They are shown in Table 2.1.

Other contexts might want to emphasize somewhat different attributes, such as reliability of source, security classification, or type of paper or binding.

2.4 Knowledge Base

A database, however true or valid its content, is a collection of data that can be searched by a retrieval program as directed by a user. A knowledge base (KB) in this context is a set of information used by a person or program to perform a function. In IR, a KB may be used (Teskey, 1989) by the retrieval software to

Select the database to be searched;

Assist users in composition or modification of a search;

Parse data (messages) received from the database, meaning to decompose a message into its components and identify the syntactic role of each;

Interpret, or assist the user to interpret output; and

Make decisions or assist in deciding whether to accept retrieved output or revise a query.

Information retrieved from the database or provided directly by the user can be used by the system as knowledge. If a user places high weight or importance on certain subject-describing terms, then this information can be used by a program to improve retrieval results. Note that the knowledge in this case is knowledge of importance of terms, not of the existence of the terms. It is also pos- sible for a program, given a set of search terms, to find other terms likely to be useful. This could be based on enough knowledge of a subject field to be able to make word associations within the context of that field, to know, e.g., that file and database are closely related terms in the context of IR or computer science and either could be used for the other, but not so in the context of carpentry.

46

2 Data, Information, and Knowledge

Table 2.1

Elements of the Dublin Core

Title Other contributor Source

Author/creator Date Language

Subject/key words Resource type Relation

Description Format Coverage

Publisher Resource identification Rights management

Note: The core consists of 15 data elements used to describe a document expected to be stored and retrieved electronically. Not all are of equal value for each document.

Source: Dublin Core (1998).

Ch002.qxd 11/20/2006 9:54 AM Page 46

Alternatively, relatedness can be based entirely on statistical characteristics of a text or set of texts.

In brief, when a program retrieves data or information from a database, or an online user, and uses that information as the basis for a subsequent query, then that information is knowledge by our usual definition. It changes the state of a system (the information retrieval system), or it is the basis for action (terminating a search or revising and continuing it). A query might ask for the name of the CEO of each company marketing a given type product, and earning revenues of at least $X. The next query can use the retrieved names in a request for information on their addresses, or universities attended, or directorships held. (Such a search might be done by a fund-raising group within a university.) In this second query, the names are now treated as factual information or knowledge; that is, we are no longer questioning whether these are the names of CEOs because the KB has told us they are.

There is no special form or structure of a KB. It can have any form that can be used by an interpreting program. The list of names of CEOs we just pos- tulated would have acquired, at least temporarily, the status of a “model of the world,” which has been proposed as the distinction between information and knowledge (Teskey, 1989). Probably the most common forms are frames and rules. A frame is essentially a record that describes a context or entity. Rules are typically stated as in programming languages

IF TITLE CONTAINS“EXECUTIVE” THEN RELEVANCE(RECORD_NUMBER) = 1.

Where does the knowledge in a KB come from? People who build these are coming to be called knowledge engineers. The content of a KB may come from experts in a field in which the program operates (for example, epidemiology or securities investing). It is the knowledge engineer’s task to find the information and put it in a form useful to the software. At times, software designers and pro- ducers do not seek expert knowledge of a subject domain or of user behavior.

Probably all readers have seen examples of software in which the developing programmer’s assumptions about its use have been substituted for knowledge of how the intended users will react to it.

The term knowledge base is relatively new, but “smart” programs have been in existence for some time. Programs that can play chess or checkers were available as early as the 1950s, the first decade in which computers were a commer- cial product (Samuelson, 1959). “Intelligent” information retrieval dates to the 1960s (Salton, 1980). In these early days, the knowledge used by the programs tended to be incorporated into the program, not maintained as a separate database for use by the program.

Suppose a company wants to find its most valuable employees for a management development program. It could take any of several approaches. Before it can find these people, or more accurately, before it can find their records in the personnel database, the company must define its meaning of valuable. Here are some approaches to the combined problem of defining and finding records.

2.4 Knowledge Base

47

1. Look for all those earning more than $100,000 per year in salary. This is very easy but overly simplistic. The decision to pay the person that much has already been made. Now, you are in effect looking for those who deserve it or who do not yet earn at that level but have the potential to do so.

2. Look for all those earning a salary above $90,000, having a title includ- ing the term MANAGER or VICE PRESIDENT or, if in the SALES DEPARTMENT, recently exceeded sales quota by at least 10%. This begins to suggest that the selection criteria are not easily fixed, that perhaps it would be worth having a separate program or set of rules to define the current meaning of most valuable employees.

3. Maintain a copy of the corporate organizational structure, coded in such a way that the x% highest ranking people can be readily identified, and maintain a payroll file so that the top y% of earners in all categories can be identified, and maintain a list of accomplishments such as patents or management per- formance awards, and maintain a history of each employee from which the speed of previous promotions can be computed, and so on. By this time, we are rec- ognizing that selection is a complex process, likely to change with time, and is something we may want personnel specialists, not programmers, to be in charge of defining. So we separate the set of rules for finding valuable people, the knowledge base, from the programming of software that will use the knowledge base to search the database. When these more complex criteria are used, the data are not likely to be available in neat, well-tagged files, but scattered through many text fields.

2.5

Credence, Justified Belief, and Point of View

It seems clear that in everyday use of the words, “higher” appellations like information, knowledge, meaningful, and wisdom are used for texts that we understand and find useful, while data implies some degree of lack of proof of validity or value or that some processing is needed to bring out the information, like developing photographic film. It also seems clear that these higher designations are not inherent in the data but are a function of the beholder, because surely different people will understand and trust messages differently. What is data to one may be wisdom to another. What is wisdom to one may be a falsehood to another. If a program, rather than a human, is the beholder, and if it has been designed to “believe” certain data, then those data become its knowledge. Do computers believe? Most of us might say no, but they do accept some data and act upon or reject others. Practitioners of propaganda, whether of the commer- cial advertising or political persuasion type, know well that people can be con- ditioned to believe certain messages regardless of the content of the messages.

Why does it matter? There are still those who believe that any computer output is “right.” Others will say that errors in output are the fault of the computer, not the person who provided the data or who wrote the program. In

48

2 Data, Information, and Knowledge

Ch002.qxd 11/20/2006 9:54 AM Page 48

information retrieval, users must understand what can be considered true, correct, or valid and what merely contains words that were requested.

When a program retrieves a record, because it contains a particular key word, there is no implication that the program values the record for any reason other than the simple fact that it contains the specified word. When a program uses a record of its own KB to learn how to parse a database attribute, it

“believes” this record—bases its actions on the record’s content without further verification. That is credence, even for a machine.

Similarly, a user searching in a library catalog, having found a card for a book on the desired subject, really knows only that the book’s classification code as assigned by the library matches the desired one. Using his or her personal KB, the KB user is able to make use of words in the title, the name of author or publisher, citations, published book reviews, or friends’ recommendations to estab- lish a higher level of potential value than is implied by the subject code.

Perhaps the user will accept the information in the classification schedule, or apply specialized knowledge to reject it. A user not familiar with the subject field being searched is probably more likely to accept this authority than is an expert in the field whose more detailed knowledge may lead him or her to question the classification. Experienced library users come to learn that catalogers do differ among themselves, and hence that a subject classification perhaps should not be treated as knowledge; it is data, subject to verification of its truth to each user.

It has been reported (this may be apocryphal, but the story makes an inter- esting point) that during the Cold War computers at the U.S.–Canadian North American Air Defense Command had more than once indicated a ballistic mis- sile attack to be in progress against North America. The human operators of the information systems did not credit this computer-evaluated data. They would not act on it as if it were established, justified truth, because it did not coincide with their perception of world conditions. In other words, a world war is unlikely to start in the absence of high international tension. This kind of tension is not readily measurable by a computer, but can be perceived by intelligent, informed people. There was the possibility, of course, that there was an attack, but it was unintended. At any rate, the information system users were reputedly unwilling to take action on the basis of computer messages in which they could not place much credence.

A common criticism of early bibliographic IRS among inexperienced users was that they did not retrieve information—they only retrieved citations to information. Citations are metadata. This complaint had some validity because the bibliographic records are not usually what the searcher is ultimately looking for. On the other hand, if the searcher is looking for the correct spelling of an author’s name, the proper citation for a known book, or its location in the library, then the record does have information, not merely data, and will usually be believed.

If the object of a search is to find out how to produce room-temperature nuclear fusion, then the searcher has to resolve two questions after retrieving citation records: (1) Are the articles “pointed to” really on the desired subject, i.e., do

2.5 Credence, Justified Belief, and Point of View

49

Dalam dokumen Text Information Retrieval Systems (Halaman 62-72)