Querying the Information Retrieval System
7.3.3 Boolean Query Logic
Historically the first and still the most common method used for express- ing micro logic in query statements is Boolean algebra. This involves specifying the operations to be performed on sets that have been defined by relationship statements or previous set operations. The sets are all subsets of a universe of dis- course, which in this case is the database being searched. Sets within this universe of discourse are combined using the Boolean operators defined below and illus- trated in Fig.7.2:
1. AND—Given sets 1 and 2, a third set can be defined that includes only records that are in both sets 1 and 2. Set 1 may contain records with date > 881014,
156
7 Querying the Information Retrieval System1
1
2
3
2 3 4 5 6 7
Figure 7.2
Venn diagram showing meanings of Boolean operations; regions 1, 2, and 3 represent three sets, or concepts. Regions 4, 5, and 6 are the intersections of sets 1-2, 2-3, and 1-3, respec- tively. Region 7 is the intersection of all sets 1, 2, and 3. Set 1 OR 2 is anything in region 1 or 2 or both. Set 1 NOT 2 is region 1 less region 4. Set 2 EXCLUSIVE OR 3 is region 2 and 3 but not anything in region 6.
Ch007.qxd 11/20/2006 10:07 AM Page 156
and set 2, those with subject ⫽TENNIS. Then if set 3 is defined as SET1 AND SET2, it would contain records with both the desired date range and the desired subject.
This is called the logical product, intersection, or conjunction of the first two sets.
2. OR—Given two sets a third can be formed from records that are in either the first or the second or both. This is written SET1 OR SET2, and is called the logical sum, union, or disjunction of the first two sets. ORis almost always used in the inclusive sense, meaning records that are in set 1 or set 2 or both.
3. NOT—While ANDand OR are binary operators in the sense that their operations require at least two sets within the universe of discourse, NOT is a unary operator, meaning that it can be applied to a single set within the universe of discourse to specify all elements other than those in the set to which it is applied. If the single set is A, then NOT A is known as the complement of A.
Normally in a retrieval context, NOT is considered to mean AND NOT, and is treated as a binary operator. Thus if records are wanted about WIMBLEDONbut not about TENNIS, the logic is that set 1 is to contain records with WIMBLEDON; set 2, records without TENNIS; and set 3, their intersection, records containing
WIMBLEDONbut not TENNIS. Symbolically, SET3 ⫽SET1 AND NOT SET2. Set 3 is called the logical difference of sets 1 and 2.
4. EXCLUSIVE OR—Found only occasionally in operational systems, this operator forms a new set containing members that are in one set or another, but not both. It may be expressed as the union of two sets from which the intersec- tion has been excluded, or (S1 OR S2) NOT(S1 AND S2). An example of its use might be a search for medical patients who are taking drug A or drug B but not both, because a research project wants statistics on side effects, but wants to exclude the complications of interaction between these two drugs. If allowed in a query language, its symbol is likely to be XOR.
The not operator is easily misused. A record does not really have to be about Wimbledon as a geographic entity to mention that name. (She played bet- ter in the U.S. Open than last year at Wimbledon.) The use of NOTlogic can be sub- tle. The searcher may be interested in the sociology of the Wimbledon area of London and not the sport of tennis, but refusal to accept occurrence of the word
TENNIS in a record would also exclude one that says something like, “But Wimbledon is not just the home of a tennis stadium, it also . . .” NOTis practi- cal to use where the values of the attributes are controlled and thus have very low ambiguity for the records under consideration. If a record has a language field, NOTmight be used to exclude certain languages. Symbolically: SET1 NOT LA⫽JAPANESE.
This example brings up a point worth some expansion. The fact that the term Japanese appears in the text of the record is no indication that the record is in the Japanese language. Likewise a person’s name in the text of a record is no indication that the person is the author of the record, or even a prime topic of the record. Thus records have normally been structured into fields where stored information has a specified role. A field label may be appended or prefixed at the time indexing and record creation takes place. Some fields (normally those
7.3 Query Logic
157
considered to have a high likelihood of indicating record topicality) will be searched by the search engine when no fields are specified by the query (the search default). These tend to be fields such as the title or abstract, or fields con- taining indexing terms. Other fields tend to be searched only if the field label is specified; for example, author, language, or publication date.
This is only effective when a common controlled record structure is uti- lized in the search engine’s databases. Since the records of the Web have only the common structure of Hyper-Text Markup Language, and indication of a term’s role is dependent on the proclivities of each individual Web page author, one is largely searching the Web as free text.
Normally, experienced searchers do not expect to get the exact results they want on the first try, especially when searching records on the basis of words contained in a natural-language text. Therefore, their approach is to use a series of search statements or queries, beginning as probes and gradually improving on earlier results. Such users define sets, find out their size, perhaps browse through their contents, and then define a new set based on a modification of an earlier one. Sometimes it is necessary to back up, to return to a set other than the most recent one. To accommodate this approach to searching, the set number may be stated in a set definition statement, as if it were an attribute-value statement.
For example, suppose we are looking for information about the nutri- tional value of prepared foods for pet cats and dogs. It is worth a diversion here to point out that use of and in the preceding sentence has the meaning of or in Boolean logic. “Cats and dogs” in colloquial usage means, “either cats or dogs or both,” but a rigidly literal interpretation might be “what is in common to, or a member of both sets of, cats and dogs.” This is one of the reasons it can be difficult to teach Boolean logic applied to natural language. Now, the search might start with:
SELECT PET OR PETS (Creating Set 1, say of 52,000 records) Next, try
SELECT DOGS OR CATS (Creating Set 2 of 8000 records, seemingly better than the 52,000 of the first try)
SELECT FOOD AND (Set 3, 5000 records)
NUTRITION
SELECT S2 AND S3 (Set 4, 3 records)
But this last may seem too small, so the searcher goes back to the more general
SELECT S1 AND S3 (Set 5, 27 records)
Of course, using fictitious set sizes is quite artificial, but this example does convey the value of being able to return to any previously defined set and resume the search from that point.
Some IRS use the logic of automatically ANDing a subsequent query state- ment to the most recent one. Searchers using such a system would start with a
158
7 Querying the Information Retrieval SystemCh007.qxd 11/20/2006 10:07 AM Page 158
broad set definition, and then successively reduce its size by adding more restric- tive criteria. It is not necessary to name the set being reduced, because it is always the last one created, nor is it necessary to write the Boolean AND between set numbers because it is understood. For the search illustrated above, the searcher might have started with an even broader term, ANIMALS, reduced this to PETS
(assumed to mean ANIMALS AND PETS), then to CATS OR DOGS(now giving us
ANIMALS AND PETS AND(CATS OR DOGS) ). Under this approach, if a user wants to take another path, say CANINE NUTRITION, it is necessary to start the search again. There is no objective measure of which approach is better. Searchers tend to learn one method, and then to prefer that one thereafter.
The advantages of the Boolean query logic are that it is easy to write inter- pretive programs for it and that the logic is crisp and decisive: records are in a set or they are not. The disadvantage is a “soft” one: most people do not naturally think in Boolean terms; hence, the logic seems artificial, and it is sometimes difficult for users to express what they want or to understand what they get. Even though we may not know exactly how people think, the sense among researchers in the field tends to be that forcing users to work in terms of binary set-defining functions is not a popular approach (Bookstein, 1985; Croft and Thompson, 1987; Hildreth, 1983; Salton and McGill, 1983, pp. 118–155; Salton, 1989, pp. 345–361).