• Tidak ada hasil yang ditemukan

A Scraping Example

As an example, you’ll be taking a list of latitudes and longitudes for the capital cities of many countries in the world. The page that you’ll scrape is located at http://googlemapsbook.com/

chapter5/scrape_me.html. It’s not the most challenging scraping example, but it will serve our purposes.

The first thing you need to do is use wgetto retrieve a local copy of the page. From the shell, run the following command while in your working directory for this example:

wget http://googlemapsbook.com/chapter5/scrape_me.html

Tip

If you would prefer to snag this page live from the Web directly from within your code, then grab a snippet of the CURLcode from Chapter 4’s geocoding web services examples. The only trick should be splitting up the result on the newlines to form an array of lines, instead of using fgets()to read each line in sequence.

Next, you need to do some analysis of the HTML of this page to decide what you can do with it. Listing 5-7 shows the important bits for our discussion.

Listing 5-7. Snippets of HTML from the Sample Scraping Page (After about 10 lines of header HTML you'll find this...)

<!-- Content Body -->

<table border="1" width="100%">

<tr>

<td >Country</td>

<td >Capital City</td>

<td >Latitude</td>

<td >Longitude</td></tr>

<tr><td class="latlongtable">Afghanistan</td>

<td class="latlongtable">Kabul</td>

<td class="latlongtable">34.28N</td>

<td class="latlongtable">69.11E</td></tr>

<tr><td class="latlongtable">Albania</td>

<td class="latlongtable">Tirane</td>

<td class="latlongtable">41.18N</td>

<td class="latlongtable">19.49E</td></tr>

<tr><td class="latlongtable">Algeria</td>

<td class="latlongtable">Algiers</td>

<td class="latlongtable">36.42N</td>

<td class="latlongtable">03.08E</td></tr>

(and 190 countries later...)

C H A P T E R 5■ M A N I P U L AT I N G T H I R D - PA RT Y D ATA 114

<tr><td class="latlongtable">Zambia</td>

<td class="latlongtable">Lusaka</td>

<td class="latlongtable">15.28S</td>

<td class="latlongtable">28.16E</td></tr>

<tr><td class="latlongtable">Zimbabwe</td>

<td class="latlongtable">Harare</td>

<td class="latlongtable">17.43S</td>

<td class="latlongtable">31.02E</td>

</tr>

</table>

<!-- Content Body End -->

So how do you extract the information that you care about? The first thing is to find the patterns that you can exploit. In our case, we’re going to ignore all of the data that comes before the HTML comment <!-- Content Body -->and after the closing comment

<!-- Content Body End -->. In between, we’ll care about only the lines where class=

"latlongtable"appears. We’re lucky that the data we care about is surrounded entirely by HTML and that PHP has a handy function to remove it: strip_tags(). The largest string man- gling we need to do is determining the sign of the latitude and longitude measurements based on the N/S E/W labels. You can see the required code in Listing 5-8.

Listing 5-8. Screen Scraping Example

<?php

// Open the file and the database

$handle = @fopen("scrape_me.html","r");

$conn = mysql_connect("localhost","username","password");

mysql_select_db("geocoding_experiment",$conn);

// Status flags and temporary variables

$in_main_table = false;

$count = 0;

if ($handle) {

while (!feof($handle)) {

$buffer = fgets($handle, 4096);

// Look for "<!-- Content Body -->"

if (trim($buffer) == "<!-- Content Body -->") {

$in_main_table = true;

continue;

}

// For each line that has "latlongtable" in it trim

if ($in_main_table && strstr($buffer,'class="latlongtable"') !== false) { // Dig out the part we care about

$interesting_data = trim(strip_tags($buffer));

switch($count % 4) { case 0:

// Country Info

$city = array(); // reset

$city[0] = addslashes($interesting_data);

break;

case 1:

// Capital City Info

$city[1] = addslashes($interesting_data);

break;

case 2:

// Latitude Information (determine sign)

$latitude = substr($interesting_data,0,strlen($interesting_data)-1);

if (substr($interesting_data,-1,1) == 'S') $sign = "-";

else $sign = "";

$city[2] = $sign.$latitude;

break;

case 3:

//Longitude Information (determine sign)

$longitude = substr($interesting_data,0,strlen($interesting_data)-1);

if (substr($interesting_data,-1,1) == 'W') $sign = "-";

else $sign = "";

$city[3] = $sign.$longitude;

echo implode(" ",$city)."<br />";

// Write to the database

$result = mysql_query("INSERT INTO capital_cities

(country,capital,lat,lng) VALUES ('".implode("','",$city)."')");

break;

} // switch

// Increment our counter

$count++;

// Stop when we find "<!-- Content Body End -->"

if ($buffer == "<!-- Content Body End -->") break;

} // if } // while } // if

fclose($handle);

?>

You can store this information using a database table like the one in Listing 5-9.

C H A P T E R 5■ M A N I P U L AT I N G T H I R D - PA RT Y D ATA 116

Listing 5-9. SQL Database Structure for the Screen Scraping Example CREATE TABLE capital_cities (

uid int(11) NOT NULL auto_increment, country text NOT NULL,

capital text NOT NULL,

lat float NOT NULL default '0', lng float NOT NULL default '0', PRIMARY KEY (uid),

KEY lat (lat,lng) ) ENGINE=MyISAM;

Note

We hereby explicitly grant permission to any person who has purchased this book to use the infor- mation contained in the body table of scrape_me.htmlfor any purpose (commercial or otherwise), provided it is in conjunction with a map built on the Google Maps API and conforms to Google’s terms of service. We make no warranties about the accuracy of the information (in fact, there is one deliberate error) or its suit- ability for any purpose.