History.and.Background
Tim Berners-Lee wrote the initial proposal for the World Wide Web in 1989, and developed it online in 1991 by using a hypertext model (Berners-Lee, 1989, 1996).
The World Wide Web was developed to allow people to collaborate on projects; it began at CERN, the European Particle Physics Laboratory in Geneva, Switzerland, and expanded across nations and disciplines. Berners-Lee (1996) defined the com- ponents of the Web: the boundless information world, the address system (URI), a network protocol (HTTP), a markup language (HTML), a body of data, and the client-server architecture of the Web. The creation in 1993 of Mosaic, a graphic Web interface that was the precursor of Netscape, enabled millions of people to easily access the Web. Since then, the increase in Web resources has been phenomenal, and Web search engines are the essential tools for navigating those Web resources.
The emergence of the Web signifies the era of end users. In IR history, this is the first time that millions of users have been able to search for online information themselves without help from intermediaries. Nielsen//NetRatings (Sullivan, 2006), a global leader in Internet media and market research, reported that the volume of Internet search queries grew to more than 5.1 billion by October 2005; the top five search engines are Google, Yahoo!, MSN, AOL, and Ask Jeeves.
Montgomery and Faloutsos (2001) analyzed data collected from Internet users from 1997 to 1999 and found that Internet usage had grown dramatically. However, the way users interact with the Web remains same, and their viewing habits have not changed despite changes in Web size and content. Hills and Argyle (2003) surveyed 220 adults to assess the frequency and location of their use of Internet services.
The results showed that getting information in general is the second most popular service used by participants. One hundred seventy of the participants searched the Web, and the mean frequency of use was 3.27 (between sometimes to frequently).
According to Fox (2002), 85% of American Internet users have used search engines to find information. For a typical day, men (33%) and college students (39%) are more likely to use a search engine than women (25%) and high school graduates (20%). Search engines are the most popular tools for finding health, government, and religious information. Based on the 2004 digital future report (USC Annenberg School, Center for the Digital Future, 2004), Web surfing, or browsing (ranked 2nd), finding hobby information (ranked 4th), finding entertainment information (ranked 5th), finding medical information (ranked 7th), and finding travel information (ranked 8th). About 77.2% of users used the Internet for Web surfing and browsing. The results of this study are comparable to the previous data.
Definitions and Types of Web.Search.Engines
Search engines include crawler-based engines, human-powered directories, and hybrid search engines. Search engines in general can be classified into four types:
1. Web directories are hierarchically organized indexes that guide users in brows- ing through lists of Web sites by category or subject, such as Yahoo! Directory (http://www.dir.yahoo.com).
2. Search engines create a database of sites using robots or spiders, and they assist users in searching for information, such as Google (http://www.google.
com).
3. Meta-search engines query multiple search engines simultaneously and return a complete set of hits, such as MetaCrawler (http://www.metacrawler.com).
Interactve IR n Web Search Engne Envronments
4. Specialized search engines create a database of sites on a specific topic using robots or spiders, such as Diseases, Disorders, and related topics (www.mic.
ki.se/Diseases/index.html).
Sullivan (2004) provided a guide for users to choose major Web search engines based on their reputation and usage. The top choices are: Google for its comprehensive coverage and great relevancy, Yahoo for its excellent search results and oldest direc- tory, and AskJeeves for its smart search. Those strongly considered are: AllTheWeb.
com for its customizability, AOL search for AOL users, Hotbot for easy access to three major crawler-based search engines, and Teoma for its relevancy and “Refine”
feature. Although there have been changes in top choices of search engines over time, the criteria for ranking Web search engines are still relevant.
A search engine represents one type of IR system and has a mechanism similar to that of an IR system. Liddy (2001) summarized four essential modules of a search engine: a document processor, a query processor, a search and matching function, and a ranking capability. The ranking capability is based on term frequency, location of terms, link analysis, popularity, date of population, length, proximity of query terms, and proper nouns. Popularity yields good relevant retrieved results. For example, Google’s PageRanking technology determines relevance based on how frequently a site is linked to other sites. Arasu, Cho, Garcia-Molina, Paepcke, and Raghavan (2001) identified the following modules for a search engine: crawlers, crawler control, indexer module, collection analysis module, utility index, query engine, and ranking. Compared with Liddy’s modules, Arasu et al. added a crawler module that extracts URLs in the retrieved pages and sends this information to the crawler control module. This module determines which links to visit next time.
Crawlers visit the Web until the local resources are exhausted.
Current.Developments
Web search engines have been experiencing new advancements in recent years. First, there is a trend of developing personalized searching tools on Web. Notess (2006) noted that search engines have recently begun exploring personalized searching.
These personalized search engines offered such features as saving URLs, archiving pages, organizing saved results into folders, blocking specific sites, and recording a search history. After search history had been offered in online databases for decades, a search history feature was first introduced by A9, the Amazon-owned search engine, in April 2004. The search history feature is also available at Ask Jeeves, Google, Yahoo!, and several other search engines. In order to use this feature, a searcher usually needs to establish a free account and log in. It is a useful tool for searchers to track their own searches and understand their search patterns. However, that
also raises the issue of privacy. Question answering (QA) systems are developed to satisfy users who want the answers directly instead of browsing the documents in which the answers to their queries are embedded (Bar-Ilan, 2004).
Second, visual media account for a large portion of Web content, but very few search engines allow users to search effectively for images. According to Lew (2000), Web search engines such as Webseek, PictoSeek, and ImageRover apply the query-by-similar images paradigm. By applying the query-by-icons paradigm, Lew and his colleagues developed a prototype system named ImageScape to search visual media on the Web. The main difference between the query-by-similar images paradigm and the query-by-icons paradigm is that the latter allows users to state their queries in their own language and specify the importance of local pictorial features. The system enables users to search for image via keywords, semantic icons, and user-drawn sketches. O’Leary (2006) introduced blinkx.tv, a search engine that has the ability to search Web audio and video (AV) content. Blinkx.tv automatically reads AV content and creates text metadata that can be searched and browsed. For each item, the blinkx technology generates a text record consisting of a title, short description, date, source, and short video or thumbnail image for most videos. For the time being, it only indexes AV content on 41 news, entertainment, and informational Web sites.
Third, researchers have worked on the best practice and design for new Web search engines and interfaces to facilitate users’ interactions with search engines. The one- size-fits-all approach of Web search engines cannot satisfy diverse user needs. Rose (2006) suggested the design of different interfaces or different forms of interactions to match different search goals. The interface needs to facilitate the selection of contexts for the search as well as support iterative task process. Users interact with the Internet via searching, browsing, and monitoring. Based on the nature of interac- tions, Beale (2006) designed and implemented a system called Mitsukeru to support browsing behaviors..It employs an agent-based system to model the user’s behavior and determines interaction context. The system consists of three parts: determining the current browsing context, determining the relevance of future pages, and com- municating to users. Jones, Buchanan, Cheng, and Jain (2006) explored a relaxed Web searching style that asynchronously combined an off-line handheld computer and an online desktop personal computer. Users can enter search terms on the off- line handheld computer. All the queries captured are sent to a search engine when the handheld computer is connected to the PC. The search results can be distributed in different ways depending on the device.
Fourth, the technology development focuses on results presentation. Contradictory to the general search engine’s list of retrieved results, Grokker sorts search results into subject categories (O’Leary, 2005). Grokker offers users an opportunity to explore the different aspects of a complex topic and examine all of the Web sites related to a particular aspect of a subject. Grokker is an interface to Yahoo! recently created
Interactve IR n Web Search Engne Envronments
by Groxis and Yahoo!. It provides a visual representation of categories and subcat- egories of retrieved results, and further enables easy browsing among them.
Fifth, another new development in Web search engines is the emergence of a new breed of “community” search engines, sites where users share among themselves the search results, such as Clipmarks. According to Broida (2005), communities of knowledgeable, interested people can identify relevant sites with greater accuracy than a search engine. Moreover, users can save time and effort by building their own work on other users’ work.
Sixth, Web search engines offer services beyond searching Web sites. Many of the Web search engines extend their services from Web search to desktop search ap- plication. Google, AskJeeves, HotBot Desktop, Yahoo, and AOL all offer their ver- sions of desktop applications either as a stand-alone system or an Internet Explorer add-on (Pace, 2005). Rupley (2005) reported that Google offered the following new services: 1) allow users to search within the text of books (http://print.google.com), and 2) enable scientists and academic researchers to search across peer-reviewed papers, books, abstracts, and more (Google Scholar, http://scholar.google.com).
Google Scholar has the potential to become the world’s most exhaustive academic library.
Challenges.for.Users
The Web is associated with “cognitive overload” and “disorientation” (Bilal, 2000, 2002). Web search engines are one type of IR system, and they use IR algorithms and techniques. However, IR algorithms were developed for relatively small and coherent collections. Web materials are massive, less coherent, and change rapidly (Arasu et al., 2001). Sullivan (2005) reported how search engines have increased their sizes over time. Altavista indexed the largest number of documents in December 1995 when it first became available. At the end of 1997, AltaVista and Northern Light hit the 150 million document mark. At the same time, AllTheWeb reached the 200 million record. Google’s 500 million pages in June 2000 set a new record. After several years of competition with AllTheWeb and MSN, in November 2004, Google increased its index to 8 billion pages to compete with MSN’s 5 billion increase. That leads to one of the most cited problems, that users are not able to find information effectively (Kobayashi & Takeda, 2000). The huge size of search engines does not guarantee equal accessibility of information. Lawrence and Giles (1999) discussed the problem of accessibility of information on the Web. People cannot access all the information on the Web because no search engine indexes more than 16% of it, and it takes months before search engines index new pages. To make things worse, search engines only index a bias sample of sites based on links and popularity.
The emergence of the Internet has brought millions of users to search for informa- tion on the Web. Web users bring their mental models in searching Web search engines to other types of information retrieval systems, such as OPAC and online systems, because of the simplicity of search engines’ interfaces. At the same time, users bring their mental models of one search engine to another even though each search engine has its own interface and search functions. Based on their study re- sults, Wang, Hawk, and Tenopir (2000) concluded that there was little evidence that users changed their mental models from one search engine to another. Moreover, users did not change their search strategies. If a particular strategy did not work, they instead moved from one search engine to another.
Another challenge that users face is that they engage in low levels of interaction with search engines. Studies of commercial search engines show that users enter short queries, and they do not apply complicated search strategies, nor do they use Boolean operators and advanced search features. Moreover, they only view very few retrieval results. They expect Web search engines to act as humans, and the way they communicate with systems is the same as they communicate with humans.
Moukdad and Large (2001) investigated users’ perceptions of the Web based on an analysis of the transaction logs of WebCrawler. They found the extensive use of either single keywords or complete sentences, and the linguistic structure of their queries was similar to that of the human-human communication model; this cannot produce useful results in a human-computer communication environment. Their findings indicated that the Web search engine was approached as a human expert.
It is crucial to design more intelligent and interactive Web search engines.
Further, users cannot effectively interact with Web search engines to find relevant information, and they cannot effectively evaluate the retrieved information. They normally spend only a little time reviewing retrieved documents. More importantly, in the Web, there is no quality control mechanism. It is a challenge for users to make judgments about information quality and authority on the Web. Henzinger, Motwani, and Silverstein (2002) discussed the challenges in Web search engines, one of the major problems being content quality on the Web. The Web consists of low-quality, unreliable, and sometimes contradictory information. Henzinger et al.
called for the need for Web search engines to offer quality Web pages for all search requests. After reviewing a series of Web studies and conducting her own study, Rieh (2002) concluded that the Web environment, with its heterogeneous objects and diverse approaches of information organization, made this problem worse. Not all user groups challenged the quality and authority of the retrieved information.
Children in particular blindly trusted information they retrieved on the Web. They need to be taught to challenge and question what they found there (Schacter, Chung,
& Dorr, 1998).
Interactve IR n Web Search Engne Envronments
Research.Overview
Although research on Web search engines and their uses started in the 1990s, there have been quite a few review articles providing overviews of various aspects of Web search engine researches. Bar-Ilan (2004) comprehensively reviewed the literature about the use of Web search engines in information science research. This review concentrated on the following aspects of Web search engine research: 1) social perspectives (the ways users interact with Web search engines and the social effects of Web searching), 2) theoretical perspectives (the structure and dynamic nature of the Web, link analysis, Web impact factors, other bibliometric applications for the Web, and characterizing information on the Web), and 3) applications-centered perspectives (evaluation of search engines,.improvements of existing tools,.and new directions). Yang (2005) presented an overview of information retrieval on the Web emphasizing Web retrieval strategies. In addition, the review also includes studies on characteristics of the Web search environment, essential approaches in Web IR research, and the classification of Web documents. In his review, Large (2005) focused on the Web use of children and teenagers, ranging from a national survey of access to and use of the Web; information-seeking behavior; designed criteria;
Web applications of education, leisure, and social interaction; Web content and personal safety in the Web environment; and future research agendas. He pointed out that more research on children and teenagers’ information-seeking behavior on the Web is needed despite the increasing number of studies on them, especially the comparison of their behavior and adults’ behavior.
Some of the reviews concentrate on patterns of Web searching. For example, Jansen and Pooch (2001) reviewed Web-searching studies on query analysis mainly based on log analysis. They further compared traditional IR, OPAC, and Web search studies in terms of document collection size, number of queries in the data set, session length, query length, use of Boolean operators, failure rate, use of modifiers, and number of relevant documents viewed in a session. Spink (2003) provided an overview of research on Web searching from 1997 to 2002 focusing on large-scale Web data from commercial Web search engines. The overview covers the search topics, query usage patterns, and types of searches for different types of information. According to the review, while users still entered short queries across time, they did shift their searches from entertainment to e-commerce. She also noted the emergence of suc- cessive and multitasking searches in the Web environment.
Technology and techniques for Web search engine retrieval is another important aspect for review. Rasmussen (2003) summarized current research on indexing and ranking of Web search engines focusing on automated techniques for indexing and retrieval. Kobayashi and Takeda (2000) reviewed studies of the Internet and technology that are useful for information retrieval on the Web. The review focused
on three sessions. The first session discussed the three major components on the Internet: search engine ratings and features, information covered on the Internet, and the growth of users. The second session covered the tools for Web retrieval, which consisted of both traditional retrieval tools and new generation tools. The third session pointed out the future directions of Web retrieval. According to Kobayashi and Takeda, intelligent and adaptive Web services are the future direction.
Evaluation of Web search engines is the essential component of research on search engines. Oppenheim, Morris, McKnight, and Lowley (2000) reviewed the litera- ture of the evaluation of Web search engines, mainly emphasizing methodologies for evaluation and the actual evaluation criteria. The problem for evaluation is there are no standard tools developed for the evaluation of Web search engines.
Su (2003a) reviewed relevant literature from 1995-2000 for the development of a model of user evaluation of Web search engines. The proposed model focuses on performance measures associated with both users and systems and nonperformance characteristics related to users. She found there was a lack of evaluation from the end-user perspective.
Interaction.Studies
Interaction studies in Web search engine environments can be classified into the following categories: (1) levels of user goals/tasks, (2) usage pattern: patterns of query formulation and reformulation, (3) patterns of multimedia IR, (4).information search behaviors/strategies of different user groups, (5) the impact of knowledge structure, (6) criteria for the evaluation of Web search engines, and (7) comparison with other online IR systems.