Page ranks and Google - Digital Libraries

crawler has millions of unexplored URLs, but has little information to know which to select. Possible criteria for choice might include currency, how many other URLs link to the page, whether it is a home page or a page deep within a hierarchy, whether it references a CGI script, and so on.

The biggest challenges concern indexing. Web crawlers rely on automatic indexing methods to build their indexes and create records to present to users. This was a topic discussed in Chapter 10. The programs are faced with automatic indexing at its most basic: millions of pages, created by thousands of people, with different concepts of how information should be structured. Typical web pages provide meager clues for automatic indexing. Some creators and publishers are even deliberately misleading;

they fill their pages with terms that are likely to be requested by users, hoping that their pages will be highly ranked against common search queries. Without better structured pages or systematic metadata, the quality of the indexing records will never be high, but they are adequate for simple retrieval.

Searching the index

The web search programs allow users to search the index, using information retrieval methods of the kind described in Chapter 10. The indexes are organized for efficient searching by large numbers of simultaneous users. Since the index records themselves are of low quality and the users likely to be untrained, the search programs follow the strategy of identifying all records that vaguely match the query and supplying them to the user in some ranked order.

Most users of web search programs would agree that they are remarkable programs, but have several significant difficulties. The ranking algorithms have little information to base their decisions on. As a result, the programs may give high ranks to pages of marginal value; important materials may be far down the list and trivial items at the top. The index programs have difficulty recognizing items that are duplicates, though they attempt to group similar items; since similar items tend to rank together, the programs often return long lists of almost identical items. One interesting approach to ranking is to use link counts. Panel 11.1 describes Google, a search system that has used this approach. It is particularly effective in finding introductory or overview material on a topic.

Panel 11.1

following. Most people would agree that this is a good list of high-ranking pages that refer to Stanford University.

Stanford University Homepage (www.stanford.edu/)

Stanford University Medical Center (www-med.stanford.edu/)

Stanford University Libraries & Information Resources (www-sul.stanford.edu/)

Stanford Law School

(www-leland.stanford.edu/group/law/) Stanford Graduate School of Business (www-gsb.stanford.edu/)

Stanford University School of Earth Sciences (pangea.stanford.edu/)

SUL: Copyright & Fair Use (fairuse.stanford.edu/)

Computer Graphics at Stanford University (www-graphics.stanford.edu/

SUMMIT (Stanford University) Home Page (summit.stanford.edu/ )

Stanford Medical Informatics (camis.stanford.edu/)

The basic method used by Google is simple. A web page to which many other pages provide links is given a higher rank than a page with fewer links. Moreover, links from high-ranking pages are given greater weight than links from other pages. Since web pages around the world have links to the home page of the Stanford Law School, this page has a high rank. In turn, it links to about a dozen other pages, such as the university's home page, which gain rank from being referenced by a high-ranking page.

Calculating the page ranks is an elegant computational challenge. To understand the basic concept, imagine a huge matrix listing every page on the web and identifying every page that links to it. Initially, every page is ranked equally. New ranks are then calculated, based on the number of links to each page, weighted according to the rank of the linking pages and proportional to the number of links from each. These ranks are used for another iteration and the process continued until the calculation

converges.

The actual computation is a refinement of this approach. In 1998, Google had a set of about 25 million pages, selected by a process derived from the ranks of pages that link to them. The program has weighting factors to account for pages with no links, or groups of pages that link only to each other. It rejects pages that are generated dynamically by CGI scripts. A sidelight on the power of modern computers is that the system was able to gather, index, and rank these pages in five days using only

standard workstation computers.

The use of links to generate page ranks is clearly a powerful tool. It helps solve two problems that bedevil web search programs: since they can not index every page on the web simultaneously, which should they index first, and how should they rank the pages found from simple queries to give priority to the most useful.

Web search programs have other weaknesses. Currency is one. The crawlers are continually exploring the web. Eventually, almost everything will be found, but important materials may not be indexed until months after they are published on the web. Conversely, the programs do an indifferent job at going back to see if materials have been withdrawn, so that many of the index entries refer to items that no longer exist or have moved.

Another threat to the effectiveness of web indexing is that a web crawler can index material only if it can access it directly. If a web page is protected by some form of authentication, or if a web page is an interface to a database or a digital library

collection, the indexer will know nothing about the resources behind the interface. As more and more web pages are interfaces controlled by Java programs or other scripts, this results in high-quality information being missed by the indexes.

These problems are significant but should not be over-emphasized. The proof lies in the practice. Experienced users of the web can usually find the information that they want. They use a combination of tools, guided by experience, often trying several web search services. The programs are far from perfect but they are remarkably good - and use of them is free.

Business issues

A fascinating aspect of the web search services is their business model. Most of the programs had roots in research groups, but they rapidly became commercial

companies. Chapter 6 noted that, initially, some of these organizations tried to require users to pay a subscription, but Lycos, which was developed by a researcher at

Carnegie Mellon University, was determined to provide public, no-cost searching.

The others were forced to follow. Not charging for the basic service has had a profound impact on the Internet and on the companies. Their search for revenue has led to aggressive attempts to build advertising. They have rapidly moved into related markets, such as licensing their software to other organizations so that they can build indexes to their own web sites.

A less desirable aspect of this business model is that the companies have limited incentive to have a comprehensive index. At first, the indexing programs aimed to index the entire web. As the web has grown larger and the management of the search programs has become a commercial venture, comprehensiveness has become

secondary to improvements in interfaces and ancillary services. To build a really high quality index of the Internet, and to keep it up to date, requires a considerable

investment. Most of the companies are content to do a reasonable job, but with more incentives, their indexes would be better.

Federated digital libraries

The tension that Figure 11.1 illustrates between functionality and cost of adoption has no single correct answer. Sometimes the appropriate decision for digital libraries is to select simple technology and strive for broad but shallow interoperability. At other time, the wise decision is to select technology from the top right of the figure, with great functionality but associated costs; since the costs are high, only highly motivated libraries will adopt the methods, but they will see higher functionality.

The term federated digital library describes a group of organizations working together, formally or informally, who agree to support some set of common services and standards, thus providing interoperability among their members. In a federation,

the partners may have very different systems, so long as they support an agreed set of services. They will need to agree both on technical standards and on policies,

including financial agreements, intellectual property, security, and privacy.

Research at the University of Illinois, Urbana Champaign provides a revealing example of the difficulties of interoperability. During 1994-98, as part of the Digital Libraries Initiative, a team based at the Grainger Engineering Library set out to build a federated library of journal articles from several leading science publishers. Since each publisher planned to make its journals available with SGML mark-up, this appeared to be an opportunity to build a federation; the university would provide central services, such as searching, while the collections would be maintained by the publishers. This turned out to be difficult. A basic problem was incompatibility in the way that the publishers use SGML. Each has its own Document Type Definition (DTD). The university was forced to enormous lengths to reconcile the semantics of the DTDs, both to extract indexing information and to build a coherent user interface.

This problem proved to be so complex that the university resorted to copying the information from the publishers' computers onto a single system and converting it to a common DTD. If a respected university research group encountered such difficulties with a relatively coherent body of information, it is not surprising that others face the same problems. Panel 11.2 describes this work in more detail.

Panel 11.2

The University of Illinois federated library of

Dalam dokumen Digital Libraries (Halaman 169-172)