• Tidak ada hasil yang ditemukan

Interpretation and Execution of Query Statements

Dalam dokumen Text Information Retrieval Systems (Halaman 196-200)

8.1

Problems of Query Language Interpretation

In Chapter 7, we described the languages and logic of information retrieval systems from the point of view of a user. Here, we begin our look inside the IRS to understand how commands are interpreted and executed.

Earlier, in Chapter 2, we discussed some of the problems of language in general and the opinions of scholars in the field that words do not have fixed meanings, established before use. Meaning is to some extent determined by con- text, even to some extent with artificial languages, but the problems attendant upon imprecise meanings are much worse in natural language. Natural language, moreover, has inexact syntax, and this adds to the ambiguity of words. The process of analyzing a statement in a language and breaking it into its constituent elements or roles is known as parsing, some aspects of which were discussed in Chapter 6.

Knowing the role of a word helps us understand its meaning. For exam- ple, here are some almost classic, ambiguous sentences in English. In each, it is not possible to know the exact meaning of all the words because there is more than one possible way to parse the sentence, hence more than one possible meaning. These are presented without the context usually needed to help resolve the ambiguity.

At the physics conference, she gave her paper on time. Was this a paper about the subject of time, or a paper on an unstated subject that was delivered as scheduled?

Time flies like an arrow. On first reading this seems to be unambiguously say- ing that time passes quickly. That interpretation makes time the subject and flies the verb. But, suppose a computer, lacking human experience, is interpreting the sen- tence. It could decide that flies is the subject, in which case time could be an adjec- tive, denoting a kind of fly. Now, one need not be an entomologist to know (or suspect) that there is no such thing as a time fly. This ambiguity might be resolved

with the aid of a thesaurus, but a thesaurus of the general language is hardly likely to list every species of every insect. Can the computer know what kinds of flies exist?

At any rate, if the subject is flies, then the verb could be like, and the whole thing could mean that this kind of fly likes to eat arrows. Yet a third interpretation is that, the sentence says one should measure the speed of flies in the same manner as one measures the speed of an arrow.

He ate the cake—Here is a sentence without much apparent variation in syntax, but the meaning can vary according to emphasis placed on the spoken words. As the emphasis shifts to different words of the sentence, the meanings may vary in these ways

Heate the cake. That male person, or animal, is the one who ate the cake.

He ate the cake. Among the many possible things that male might do with cake, he chose to consume it.

He ate the cake. Of all things that he might have eaten, he chose the cake.

Emphasis can be shown in print with italics or quotation marks, but at least as often it is not shown typographically. Then, of course, there are idiomatic expressions that have meaning that bears no apparent relation to the meanings of the individual words. That takes the cake means that some thing or activity is highly rated and has nothing to do with food, although this expression appar- ently has its origin in the awarding of a cake as a contest prize.

These, then, are some of the problems. Let us examine some of the pro- cedures for handling them.

8.1.1 Parsing Command Language

Many subsidized systems and most Web search engines simply assume that any entry is a set of search terms to be combined by a default Boolean operation.

If they have an advanced searching module the parsing discussion below will normally apply, as it does in traditional remote and CD-ROM systems. Thus our examples are mostly from Dialog, a service of Dialog Information Services.

The usual parsing method is to make a left-to-right scan of a statement, looking at each word or code for occurrences of strings of characters that can

“legally” occur in a given position. This logic applies to most of the classic IRS but few of modern Web-based search engines.

Most command language syntax calls for a command to be followed by one or more arguments, attributes, or parameters. The command is stated as an imperative verb, such as SELECT. The parsing program would begin at the left end of the statement and look for a substring that constitutes a valid com- mand in the language. The arguments may be meaningful only in the context of a specific command. TYPE1/5/8 tells Dialog to type record 8 of set 1, using

178

8 Interpretation and Execution of Query Statements

Ch008.qxd 11/20/2006 9:56 AM Page 178

format 5, but EXPAND1/5/8 will cause Dialog to look for the string 1/5/8 in its inverted file. The language of BRS (founded in the United States, now operated in Waterloo, Ontario, Canada by Open Text Corporation) uses modes. A command causes the IRS to enter a mode, and thereafter query state- ments are interpreted in the context of that mode. The set-forming command

FINDneeds no verb because it is understood. This causes the system to enter search mode, and the string PRINT following the mode statement would be interpreted as a search term, not a command. It is as if a full command . . . FIND PRINThad been issued. A new mode statement, . . . PRINT, say, would change the mode from search to printing, hence the defaults would change to those of the PRINTmode.

Recognizing the command can be done by looking for the first blank character following the first non-blank or for the end of the command string (computer code for the ENTER key), then treating everything between as the potential command. However, many languages “forgive” the absence of a space between the command and its argument. Therefore, the most likely way to locate the command in the input string of characters is to start at the beginning and look for any substring that matches a legal command. Search in descending order of command or abbreviation length: SELECT (6 characters), EXPAND (6), print (5), TYPE(4), PAGE(4), PT(2), E(1), T, P, . . . . This assures that an initial substring of PR, if not followed by INT, will be matched with the abbreviation

PR, rather than P. The list must always be so ordered that the full command pre- cedes its abbreviation, not normally much of a problem.

Once the command is identified, the program would use a table (part of a knowledge base) to tell it what kind of argument or parameter list follows this particular command. Possible argument forms include: single term, sequence of terms unmarked as a phrase, single equivalence statement, Boolean combination of equivalence statements. Following SELECT, e.g., the rest of the statement might be NAPOLEON or AUTHORSMITH, J.C. Another sequence following

SELECTcould be S1 OR AUTHORSMITH, J.C. where S1 denotes a previously defined set number. Yet another possibility is that the first character after the command SELECTis a left parenthesis, as SELECT(SUBJECTCAT OR SUBJECT

DOG) AND AUTHORTERHUNE, A.P. (Remember that most text retrieval systems do not require quotation marks around a string constant, while most database management systems do.)

In the statement SELECT AUTHORSMITH, J.C. the usage AUTHOR

SMITH, J.C., is an attribute-value statement, as is just DOG, but in the latter case the attribute name and equivalence symbol are understood, or revert to default values. This is a de facto standard in text-retrieval systems. The default may be called a basic index and apply to terms from a title, abstract, or descriptor field.

The exact choice of attributes to constitute the default basic index may vary with the database in use.

The usage S1 OR AUTHORSMITH, J.C. is a compound attribute value statement, as is S1 AND(SUBJECTCAT OR SUBJECTDOG) AND AUTHOR

SMITH, A.B. The knowledge base must, in some manner, specify all the variations

8.1 Problems of Query Language Interpretation

179

of such usages that are permitted in the language in use. In Dialog, if there has been no set 3 created, the usage SELECT S3 AND DOGwill be interpreted as ask- ing for the string S3 and the string DOG, both in the basic index. On occasion this is the source of confusion to the user because, e.g., S3 could be the model number of some device. In the long run, the omission of quotation marks prob- ably saves more trouble than it creates.

In the Dialog language, both SELECTand EXPANDare valid commands. If the parser receives the statement SELECT EXPAND CAT, it would find a valid com- mand in the first position within the statement and, because the context is now established, will not consider that EXPANDis also the name of a command. It will treat EXPANDas part of the argument of the SELECTcommand and will not con- sider the possibility that the user made a mistake but did not know how to cor- rect it, so simply typed a new command following the previous incomplete one.

A number of similar statements that appear ambiguous to a human reader are not so to a program because it takes the query statement apart in a prescribed order and never notices any ambiguities:

SSELECT CAT(SELECTmisspelled, two leading ss) is interpreted as ss (abbrevi- ation for the two-word command SELECT SETS) followed by the argument ELECT CAT. There is no legal full command at the start of this statement; hence, an abbre- viation is sought and found, and of the two possible abbreviations, SSand S, it selects the longer. What follows is neither a set number nor a parenthetical expression, so it is treated as a search term, even though it was intended as part of the command.

Further, since the argument consists of a phrase of two or more words with no attribute designated, Dialog takes it to imply the descriptor attribute because in that system a descriptor, even though a phrase, is treated as a single character string for searching purposes.

SELECTIONis treated as the command SELECTfollowed by the argument ION. Nearly all Internet search engines, ORBIT and the National Library of Medicine’s MEDLINE assume that the absence of a command implies the command FIND, that is that the system is in Find mode, essentially the same as Dialog’s SELECT. MED- LINE would treat FINDERas a term not a command. (ORBIT, an early IRS is now part of Questel·Orbit, Inc.)

S ELECTIONis treated by Dialog as the abbreviation Sfollowed by the argu- ment ELECTION.

P1 /TI,AU/1-5 is a complex-looking statement. The argument fits the pat- tern for a PRINTcommand and specifies the set (1), the format (attributes TIfor title and AUfor author), and the record numbers 1–5 of the designated set. The problem, though, is that Pis the abbreviation for the PAGEcommand, which is contextually dependent, i.e., PAGEis only valid if a preceding command had pro- duced a multipage display (or was another PAGE). This is a common and easy mistake for users to make. The left-to-right scanning method will completely misunderstand it: to Dialog this is a PAGE command, to a user it looks like a

PRINTcommand.

In the last example, a right-to-left scan might be able to do better, i.e., one that first recognizes a range of what would be taken to be record numbers,

180

8 Interpretation and Execution of Query Statements

Ch008.qxd 11/20/2006 9:56 AM Page 180

Dalam dokumen Text Information Retrieval Systems (Halaman 196-200)