Exercise 5 - Think Data Structures

In WikiPhilosophy.java you’ll find a simple main method that shows how to use some of these pieces. Starting with this code, your job is to write a crawler that:

1. Takes a URL for a Wikipedia page, downloads it, and parses it.

2. It should traverse the resulting DOM tree to find the first valid link. I’ll explain what “valid” means below.

3. If the page has no links, or if the first link is a page we have already seen, the program should indicate failure and exit.

7.4 Exercise 5 67

4. If the link matches the URL of the Wikipedia page on philosophy, the program should indicate success and exit.

5. Otherwise it should go back to Step 1.

The program should build a List of the URLs it visits and display the results at the end (whether it succeeds or fails).

So what should we consider a “valid” link? You have some choices here. Var-ious versions of the “Getting to Philosophy” conjecture use slightly different rules, but here are some options:

1. The link should be in the content text of the page, not in a sidebar or boxout.

2. It should not be in italics or in parentheses.

3. You should skip external links, links to the current page, and red links.

4. In some versions, you should skip a link if the text starts with an upper-case letter.

You don’t have to enforce all of these rules, but we recommend that you at least handle parentheses, italics, and links to the current page.

If you feel like you have enough information to get started, go ahead. Or you might want to read these hints:

1. As you traverse the tree, the two kinds of Node you will need to deal with are TextNode and Element. If you find an Element, you will probably have to typecast it to access the tag and other information.

2. When you find an Element that contains a link, you can check whether it is in italics by following parent links up the tree. If there is an <i> or

<em> tag in the parent chain, the link is in italics.

3. To check whether a link is in parentheses, you will have to scan through the text as you traverse the tree and keep track of opening and clos-ing parentheses (ideally your solution should be able to handle nested parentheses (like this)).

4. If you start from the Java page, you should get to Philosophy after following seven links, unless something has changed since I ran the code.

OK, that’s all the help you’re going to get. Now it’s up to you. Have fun!

Chapter 8 Indexer

At this point we have built a basic Web crawler; the next piece we will work on is the index. In the context of web search, an index is a data structure that makes it possible to look up a search term and find the pages where that term appears. In addition, we would like to know how many times the search term appears on each page, which will help identify the pages most relevant to the term.

For example, if a user submits the search terms “Java” and “programming”, we would look up both search terms and get two sets of pages. Pages with the word “Java” would include pages about the island of Java, the nickname for coffee, and the programming language. Pages with the word “programming”

would include pages about different programming languages, as well as other uses of the word. By selecting pages with both terms, we hope to eliminate irrelevant pages and find the ones about Java programming.

Now that we understand what the index is and what operations it performs, we can design a data structure to represent it.

8.1 Data structure selection

The fundamental operation of the index is a lookup; specifically, we need the ability to look up a term and find all pages that contain it. The simplest

implementation would be a collection of pages. Given a search term, we could iterate through the contents of the pages and select the ones that contain the search term. But the run time would be proportional to the total number of words on all the pages, which would be way too slow.

A better alternative is a map, which is a data structure that represents a collection of key-value pairs and provides a fast way to look up a key and find the corresponding value. For example, the first map we’ll construct is a TermCounter, which maps from each search term to the number of times it appears in a page. The keys are the search terms and the values are the counts (also called “frequencies”).

Java provides an interface called Map that specifies the methods a map should provide; the most important are:

get(key): This method looks up a key and returns the corresponding value.

put(key, value): This method adds a new key-value pair to the Map, or if the key is already in the map, it replaces the value associated with key.

Java provides several implementations of Map, including the two we will focus on, HashMap and TreeMap. In upcoming chapters, we’ll look at these imple-mentations and analyze their performance.

In addition to the TermCounter, which maps from search terms to counts, we will define a class called Index, which maps from a search term to a collection of pages where it appears. And that raises the next question, which is how to represent a collection of pages. Again, if we think about the operations we want to perform, that guides our decision.

In this case, we’ll need to combine two or more collections and find the pages that appear in all of them. You might recognize this operation as set in-tersection: the intersection of two sets is the set of elements that appear in both.

As you might expect by now, Java provides a Set interface that defines the operations a set should perform. It doesn’t actually provide set intersection, but it provides methods that make it possible to implement intersection and other set operations efficiently. The core Set methods are:

Dalam dokumen Think Data Structures (Halaman 80-85)