Using the crawler4j web crawler - Java for Data Science

out.println((pageList.size() + 1) + ": [" + url + "]");

pageList.add(url);

// Process page links

Elements questions = doc.select("a[href]");

for (Element link : questions) {

if (link.attr("href").contains(urlLimiter)) { visitPage(link.attr("abs:href"));

} } }

This approach only examines links in those pages that contain the topic text. Moving the ^for loop outside of the if statement will test the links for all pages.

The output follows:

1: [https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly]

2: [https://en.wikipedia.org/wiki/Bishop_Rock_Lighthouse]

3: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&oldid=717634231#Lighthouse]

4: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=717634231]

5: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&oldid=716622943]

6: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716622943]

7: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&oldid=716608512]

8: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716608512]

...

20: [https://en.wikipedia.org/w/index.php?

title=Bishop_Rock,_Isles_of_Scilly&diff=prev&oldid=716603919]

In this example, we did not save the results of the crawl in an external source. Normally this is necessary and can be stored in a file or database.

Crawl storage folder: The location where crawl data is stored

Number of crawlers: This controls the number of threads used for the crawl Politeness delay: How many seconds to pause between requests

Crawl depth: How deep the crawl will go

Maximum number of pages to fetch: How many pages to fetch Binary data: Whether to crawl binary data such as PDF files The basic class is shown here:

public class CrawlerController {

public static void main(String[] args) throws Exception { int numberOfCrawlers = 2;

CrawlConfig config = new CrawlConfig();

String crawlStorageFolder = "data";

config.setCrawlStorageFolder(crawlStorageFolder);

config.setPolitenessDelay(500);

config.setMaxDepthOfCrawling(2);

config.setMaxPagesToFetch(20);

config.setIncludeBinaryContentInCrawling(false);

...

} }

Next, the CrawlController class is created and configured. Notice the RobotstxtConfig and

RobotstxtServer classes used to handle robot.txt files. These files contain instructions that are intended to be read by a web crawler. They provide direction to help a crawler to do a better job such as specifying which parts of a site should not be crawled. This is useful for auto

generated pages:

PageFetcher pageFetcher = new PageFetcher(config);

RobotstxtConfig robotstxtConfig = new RobotstxtConfig();

RobotstxtServer robotstxtServer =

new RobotstxtServer(robotstxtConfig, pageFetcher);

CrawlController controller =

new CrawlController(config, pageFetcher, robotstxtServer);

The crawler needs to start at one or more pages. The addSeed method adds the starting pages.

While we used the method only once here, it can be used as many times as needed:

controller.addSeed(

"https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly");

The ^start method will begin the crawling process:

controller.start(SampleCrawler.class, numberOfCrawlers);

The SampleCrawler class contains two methods of interest. The first is the shouldVisit method that determines whether a page will be visited and the ^visit method that actually handles the page. We start with the class declaration and the declaration of a Java regular expression class

Pattern object. It will be one way of determining whether a page will be visited. In this declaration, standard images are specified and will be ignored:

public class SampleCrawler extends WebCrawler {

private static final Pattern IMAGE_EXTENSIONS = Pattern.compile(".*\\.(bmp|gif|jpg|png)$");

...

}

The shouldVisit method is passed a reference to the page where this URL was found along with the URL. If any of the images match, the method returns ^false and the page is ignored. In

addition, the URL must start with https://en.wikipedia.org/wiki/. We added this to restrict our searches to the Wikipedia website:

public boolean shouldVisit(Page referringPage, WebURL url) { String href = url.getURL().toLowerCase();

if (IMAGE_EXTENSIONS.matcher(href).matches()) { return false;

}

return href.startsWith("https://en.wikipedia.org/wiki/");

}

The ^visit method is passed a ^Page object representing the page being visited. In this

implementation, only those pages containing the string shipping route will be processed. This further restricts the pages visited. When we find such a page, its ^URL, ^Text, and Text length are displayed:

public void visit(Page page) {

String url = page.getWebURL().getURL();

if (page.getParseData() instanceof HtmlParseData) { HtmlParseData htmlParseData =

(HtmlParseData) page.getParseData();

String text = htmlParseData.getText();

if (text.contains("shipping route")) { out.println("\nURL: " + url);

out.println("Text: " + text);

out.println("Text length: " + text.length());

} } }

The following is the truncated output of the program when executed:

URL: https://en.wikipedia.org/wiki/Bishop_Rock,_Isles_of_Scilly

Text: Bishop Rock, Isles of Scilly...From Wikipedia, the free encyclopedia ... Jump to: ... navigation, search For the Bishop Rock in the Pacific Ocean, see Cortes Bank. Bishop Rock Bishop Rock Lighthouse (2005)

...

Text length: 14677

Notice that only one page was returned. This web crawler was able to identify and ignore

previous versions of the main web page.

We could perform further processing, but this example provides some insight into how the API works. Significant amounts of information can be obtained when visiting a page. In the example, we only used the URL and the length of the text. The following is a sample of other data that you may be interested in obtaining:

URL path Parent URL Anchor HTML text Outgoing links Document ID

Dalam dokumen Java for Data Science (Halaman 77-81)