Inverted files - The Resource Description Framework (RDF)

The Resource Description Framework (RDF)

Panel 10.6 Inverted files

An inverted file is a list of the words in a set of documents and their locations within those documents. Here is a small part of an inverted file.

Word Document Location

abacus 3 94

19 7

19 212

actor 2 66

19 200

29 45

aspen 5 43

atoll 11 3

34 40

This inverted file shows that the word "abacus" is word 94 in document 3, and words 7 and 212 in document 19; the word "actor" is word 66 in document 2, word 200 in document 19, and word 45 in document 29; and so on. The list of locations for a given word is called an inverted list.

An inverted file can be used to search a set of documents to find every occurrence of a single search term. In the example above, a search for the word "actor" would look in the inverted file and find that the word appears in documents 2, 19, and 29. A simple reference to an inverted file is typically a fast operation for a computer.

Most inverted lists contain the location of the word within the document. This is important for displaying the result of searches, particularly with long documents. The section of the document can be displayed prominently with the search terms highlighted.

Since inverted files contain every word in a set of documents, except stop words, they are large. For typical digital library materials, the inverted file may approach half the total size of all the documents, even after compression. Thus, at the cost of storage space, an inverted file provides a fast way to find every occurrence of a single word in a collection of documents. Most methods of information retrieval use inverted files.

Boolean searching

Panel 10.6 describes inverted files, the basic computational method that is used to compare the search terms against a collection of textual documents. Boolean queries consist of two or more search terms, related by a logical operators, such as and, or, or not. Consider the query "abacus and actor" applied to the inverted file in Panel 10.6.

The query includes two search terms separated by a Boolean operator. The first stage

in carrying out this query is to read the inverted lists for "abacus" (documents 3 and 19) and for "actor" (documents 2, 19, and 29). The next stage is to compare the two lists for documents that are in both lists. Both words are in document 19, which is the only document that satisfies the query. When the inverted lists are short, Boolean searches with a few search terms are almost as fast as simple queries, but the computational requirements increase dramatically with large collections of information and complex queries.

Inverted files can be used to extend the basic concepts of Boolean searching. Since the location of words within documents are recorded in the inverted lists, they can be used for searches that specify the relative position of two words, such as a search for the word "West" followed by "Virginia". They can also be used for truncation, to search for words that begin with certain letters. In many search systems, a search for

"comp?" will search for all words that begin the four letters "comp". This will find the related words "compute", "computing", "computer", "computers", and "computation".

Unfortunately, this approach will not distinguish unrelated words that begin with the same letters, such as "company".

Ranking closeness of match

Boolean searching is a powerful tool, but it finds exact matches only. A search for

"library" will miss "libraries"; "John Smith" and "J. Smith" are not treated as the same name. Yet everybody recognizes that these are similar. A range of techniques address such difficulties.

The modern approach is not to attempt to match documents exactly against a query but to define some measure of similarity between a query and each document.

Suppose that the total number of different words in a set of documents is n. A given document can be represented by a vector in n-dimensional space. If the document contains a given word, the vector has value 1 in the corresponding dimension, otherwise 0. A query can also be represented by a vector in the same space. The closeness with which a document matches a query is measured by how close these two vectors are to each other. This might be measured by the angle between these two vectors in n-dimensional space. Once these measures have been calculated for every document, the results can be ranked from the best match to the least good. Several ranking technique are variants of the same general concepts. A variety of probabilistic methods make use of the statistical distribution of the words in the collection. These methods derive from the observation that the exact words chosen by an author to describe a topic or by a user to express a query were chosen from a set of possibilities, but that other words might be equally appropriate.

Natural language processing and computational linguistics

The words in a document are not simply random strings of characters. They are words in a language, such as English, arranged into phrases, sentences, and paragraphs.

Natural language processing is the branch of computer science that uses computers to interpret and manipulate words as part of a language. The spelling checkers that are used with word processors are a well-known application. They use methods of natural language processing to suggest alternative spellings for words that they do not recognize.

Computational linguistics deals with grammar and linguistics. One of the achievements of computational linguistics has been to develop computer programs

that can parse almost any sentence with good accuracy. A parser analyzes the structure of sentences. It categorizes words by part of speech (verb, noun, adjective, etc.), groups them into phrases and clauses, and identifies the structural elements (subject, verb, object, etc.). For this purpose, linguists have been required to refine their understanding of grammar, recognizing far more subtleties than were contained in traditional grammars. Considerable research in information retrieval has been carried out using noun phrases. In many contexts, the content of a sentence can be found by extracting the nouns and noun phrases and searching on them. This work has not been restricted to English, but has been carried out for many languages.

Parsing requires an understanding of the morphology of words, that is variants derived from the same stem, such as plurals (library, libraries), and verb forms (look, looks, looked). For information retrieval, it is often effective to reduce morphological variants to a common stem and to use the stem as a search term. This is called stemming. Stemming is more effective than truncation since it separates words with totally different meanings, such as "computer" from "company", while recognizing that "computer" and "computing" are morphological variants from the same stem. In English, where the stem is almost always at the beginning of the word, stemming can be carried out by truncating words and perhaps making adjustments to the final few letters. In other language, such as German, it is also necessary to trim at the beginning of words.

Computational linguists have developed a range of dictionaries and other tools, such as lexicons and thesauruses, that are designed for natural language processing. A lexicon contains information about words, their morphological variants, and their grammatical usage. A thesaurus relates words by meaning. Some of these tools are general purpose; others are tied to specific disciplines. Two were described earlier in this chapter; the Art and Architecture Thesaurus and the MeSH headings for medicine. Linguistic can greatly augment information retrieval. By recognizing words as more than random strings of characters, they can recognize synonyms ("car" and

"automobile"), relate a general term and a particular instance ("science" and

"chemistry"), or a technical term and the vernacular ("cranium" and "brain"). The creation of a lexicon or thesaurus is a major undertaking and is never complete.

Languages change continually, especially the terminology of fields in which there is active research.

User interfaces and information retrieval systems

Information retrieval systems depend for their effectiveness on the user making best use of the tools provided. When the user is a trained medical librarian or a lawyer whose legal education included training in search systems, these objectives are usually met. Untrained users typically do much less well at formulating queries and understanding the results.

A feature of the vector-space and probabilistic methods of information retrieval is that they are most effective with long queries. An interesting experiment is to use a very long query, perhaps an abstract from a document. Using this as a query is equivalent to asking the system to find documents that match the abstract. Many modern search systems are remarkably effective when given such an opportunity, but methods that are based on vector space or linguistic techniques require a worthwhile query to display their full power.

Statistics of the queries that are used in practice show that most queries consist of a single word, which is a disappointment to the developers of powerful retrieval systems. One reasons for these short queries is that many users made their first searches on Boolean systems, where the only results found are exact matches, so that a long query usually finds no matches. These early systems had another characteristic that encouraged short queries. When faced with a long or complex query their performance deteriorated terribly. Users learned to keep their queries short. Habits that were developed with these systems have been retained even with more modern systems.

However, the tendency of users to supply short queries is more entrenched than can be explained by these historical factors, or by the tiny input boxes sometimes provided.

The pattern is repeated almost everywhere. People appear to be inhibited from using long queries. Another unfortunate characteristic of users, which is widely observed, is that few people read even the most simple instructions. Digital libraries are failing to train their users in effective searching and users do not take advantage of the potential of the systems that they use.

Evaluation

Information retrieval has a long tradition of performance evaluation. Two long- standing criteria are precision and recall. Each refers to the results from carrying out a single search on a given body of information. The result of such a search is a set of hits. Ideally every hit would be relevant to the original query, and every relevant item in the body of information would be found. In practice, it usually happens that some of the hits are irrelevant and that some relevant items are missed by the search.

x Precision is the percentage of the hits that are relevant, the extent to which the set of hits retrieved by a query satisfies the requirement that generated the query.

x Recall is the percentage of the relevant items that are found by the query, the extent to which the query found all the items that satisfy the requirement.

Suppose that, in a collection of 10,000 documents, 50 are on a specific topic. An ideal search would find these 50 documents and reject all others. An actual search identifies 25 documents of which 20 prove to be relevant but 5 were on other topics. In this instance, the precision is 20 out of 25, or 0.8. The recall is 20 out of 50, or 0.4.

Precision is much easier to measure than recall. To calculate precision, a knowledgeable person looks at each document that is identified and decides whether it is relevant. In the example, only the 25 documents that are found need to be examined. Recall is difficult to measure, since there is no way to know all the items in a collection that satisfy a specific query other than to go systematically through the entire collection, looking at every object to decide whether it fits the criteria. In this example, all 10,000 documents must be examined. With large numbers of documents, this is an imposing task.

Dalam dokumen Digital Libraries (Halaman 160-164)