The World Wide Web - Text Information Retrieval Systems

Introduction

1.6.6 The World Wide Web

We will begin our discussion of the Web with a second quotation from Vannevar Bush, who, speaking of traditional retrieval structures, suggested in

1.6 A Brief History of Information Retrieval

31

1945 that a model for retrieval closer to that of the human brain might be most effective.

The human mind does not work in this way. It operates by association. With one item in its grasp, it snaps instantly to the next that is suggested by the association of thoughts, in accordance with some intricate Web of trails carried by the cells of the brain. It has other characteristics, of course, trails that are not frequently fol- lowed are prone to fade, items that are not fully permanent, memory that is transi- tory. Yet the speed of action, the intricacy of trails, the detail of mental pictures, is awe-inspiring beyond all else in nature. (Bush, 1945, p. 106)

The WWW or “Web” is an attempt to use a hypertext model, an approach not unlike that suggested by Bush. The simplest definition of hypertext is “non-sequential writing” (Nelson, 1988). The thought is that as we pur- sue a text we find topics in need of expansion or explanation. The author can provide links or pointers to the location of related material and these will appear with the material, not separately, as in a bibliography at the end of the work. The reader may then proceed forward sequentially or move to the linked location for further information. These links are like citations embedded in a text but the difference is that the reader of hypertext may “go to” the cited work immediately, at any time.

At the completion of a review of this material the reader may return to the initial text or follow further links in other directions. The model can be used in a conventional book, but is more effective in automated files where the physical page shuffling can be handled automatically. The resulting linking structure is Web-like.

If we assume items of information on computers at widely separate locations with such a linking structure in place, as well as a telecommunication network that allows interaction among the various computers, we extend the hypertext model worldwide, and approach the associative structure that Bush suggested.

The Internet provides the telecommunications network upon which such a structure can be implemented. The initial work was carried out by the Advanced Research Projects Agency of the U.S. Department of Defense (ARPA/DARPA). The first recorded description of what might be achieved was in a series of memos written in 1962 by J.C.R. Licklider of MIT, the first head of the computer research program at DARPA. He envisioned a “Galactic Network,” a globally interconnected set of networks through which anyone could access data and programs from any site (Leiner et al., 2003). Such an entity would allow use of computer resources physically remote from the person with the need, and not require the costly duplication of such resources locally. The first paper on packet switching theory was published by Leonard Kleinrock (1961). He convinced DARPA of the theoretical feasibility of communications using packets rather than circuits.

In 1967, DARPA produced a plan for the ARPANET, and by the end of 1969, four host computers were connected together into the initial network.

The idea of open-architecture networking was first introduced by R.E. Kahn shortly after he arrived at DARPA in 1972. In 1973, he began the Internet

32

1 Introduction

Ch001.qxd 11/20/2006 9:53 AM Page 32

research program there. He and Vint Cerf became instrumental in the design of the Transmission Control Program (TCP), the basic design for the packet switching control program that governs today’s Internet (Leiner et al., 2003).

ARPANET was demonstrated in 1972, the same year that electronic mail was introduced over the network.

In Geneva, Switzerland, Tim Berners-Lee, a physicist at the Centre Européen de Recherche Nucléair (CERN) created the WWW in 1991 by using a hypertext model (Berners-Lee, 1996). Using an existing document markup language called the standard generalized markup language (SGML) with some extensions, he was able to code documents with embedded Internet addresses that could be read by a program called a browser which copied the hypertext documents from servers with Internet addresses. Not only text but embedded graphics could be transmitted and viewed at any site with a browser.

From the original four hosts in 1967, the Internet has grown to over 100 in 1977, over 28,000 in 1987, nearly 20,000,000 in 1997, and currently over 350 million active hosts (Zakon, 2005). This is impressive growth indeed. The Web has become a main-stream communication channel, with significant advertising, news, and commercial sales enterprises making use of a medium originally thought to be for the exchange of scientific information among scholars. The growing use of broadband cable connections in homes and offices can only improve the speed of transmission and the growth of the medium. We thus have an interactive retrieval system whose boundaries have no predictable limits, rap- idly becoming part of the economics and culture of the modern world.

The Internet is a communication medium; the Web is a means of linking documents together. Initially, the main type of link is to the source of some related information, the equivalent of a footnote in a printed article or book. It could also have been to related work by an author or publisher. A third major innovation was the Web crawler, a computer program that could systematically connect to Web sites, index what was found there and add the index to a gigantic index of, now, bil- lions of documents. The word document here initially meant the text of an article from a scientific journal, newspaper, judicial finding, or personal not otherwise published statement. Today, what is found at a Web site can be not only these things, but a computer program capable of such tasks as searching for information, eliciting information from a user, performing calculations, or processing graphic images.

The sites that contain programs that search other sites for information requested by a user have come to be called search engines (SE). They are text IRS of a new order. What is the difference between these kinds of searching programs and the earlier IRS? Briefly, the differences are:

1. An IR system deals with databases of known structure. For example, if it is a library catalog, the search program knows what fields are contained (title, author, etc.) and it knows where within a record any of these fields may be found. Typically, a SE knows neither the content nor the structure of a record it finds, only the words contained, although more extensive use of SGML and the like could improve on this.

1.6 A Brief History of Information Retrieval

33

2. An IR system typically is linked to one or more databases each of which has some common attributes, usually subject mater, but it could be events in a time period or court proceedings. The managers of the IR system select with care the databases to which they wish to connect. Therefore, it can offer its users more assurance than an SE can that records found are likely to be relevant to a search request. The SE applies minimal screening to what it finds and leaves it to the searcher make a final determination of relevance.

3. To counter the disadvantage just described, SEs almost always provide some form of relevance ranking and present findings to searchers in decreasing order of assumed relevance. The ranking methods normally are based on statistical relations between words in a query and words in a text, anchor words of in-links to the text, the location of the words in the text, and number of in-links to this site from other sites.

A more detailed analysis of IR–SE system differences is found in chapter 15. We have come a long way from China Lake, in terms of computer power and communications technology, but the basic principles of retrieval system design and operation have not significantly changed; rather they have adapted to the new environment. It is our hope to set forth these principles in the follow- ing chapters.

Recommended Reading

These are somewhat aged, but still of interest.

Debons, A., Horne, and Cronenweth, S (1988). Information Science: An Integrated View. G.K. Hall, Boston.

Fenichel, C. H., and Hogan, T. H. (1990). Online Searching: A Primer. (3rd ed.) Learned Information, Inc., Marlton, NJ.

Hawkins, D. T. (1976). Online Information Retrieval Bibliography. This has been updated and published annually in Online Review since 1976.

Salton, G. (1989). Automatic Text Processing. Addison-Wesley, Reading, MA.

Spark Jones, K. (1981). Information Retrieval Experiment. Butterworths, London.

Vickery, B., and Vickery, A. (1987). Information Science in Theory and Practice. Butterworths, London.

These are newer.

Berry, M. W., ed. (2004). Survey of Text Mining: Clustering, Classification, and Retrieval. Springer, New York.

Berry, M. W., and Browne, M. (1999). Understanding Search Engines: Mathematical Modeling and Text Retrieval. Society for Industrial and Applied Mathematics, Philadelphia.

Chowdhury, G. G. (2003). Introduction to Modern Information Retrieval (2nd ed.). Neal-Schuman Publishers, New York.

Frakes, W. B., and Baeza-Yates, R. (eds.) (1992). Information Retrieval: Data Structures and Algorithms.

Prentice-Hall, Englewood Cliffs, NJ. (Revised Version—1998).

Grossman, D. A., and Frieder, O. (2004). Information Retrieval: Algorithms and Heuristics. Dordrecht, The Netherlands, Springer.

Korfhage, R. R. (1997). Information Storage and Retrieval. Wiley, New York.

34

1 Introduction

Ch001.qxd 11/20/2006 9:53 AM Page 34

Manning, C. D., and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. MIT, Boston.

van Rijsbergen, C. J. (2004). The Geometry of Information Retrieval. Cambridge University Press, Cambridge.

Zakon, R. H. (1993). Hobbes’ Internet Timeline, found at http://info.isoc.org/guest/zakon/

Internet/History/HIT.html.

35

This page intentionally left blank

2

Dalam dokumen Text Information Retrieval Systems (Halaman 50-56)