Exercise: Use Table Extraction, Try a Different Library

After scratching our heads at the perplexities illustrated by this PDF-to-text conver‐

sion, we went searching for alternatives to using ^pdfminer for table extraction. We came across ^pdftables, which is a presumed-defunct library (the last update from the original maintainers was more than two years ago).

We installed the necessary libraries, which can be done simply by running ^pip install pdftables and pip requests install. The maintainers didn’t keep all the documentation up to date, so certain examples in the documentation and README.md were blatantly broken. Despite that, we did find one “all in one” func‐

tion we were able to use to get at our data:

from pdftables import get_tables

all_tables = get_tables(open('EN-FINAL Table 9.pdf', 'rb')) print all_tables

Let’s start a new file for our code and run it (pdf_table_data.py). You should see a whirlwind of data that looks like the data we want to extract. You will notice not all of the headers convert perfectly, but it seems every line is contained in the ^all_tables variable. Let’s take a closer look to extract our headers, data columns, and notes.

You might have noticed ^all_tables is a list of lists (or a matrix). It has every row, and it also has rows of rows. This probably is a good idea in table extraction, because that’s essentially what a table is—columns and rows. The ^get_tables function returns each page as its own table, and the each of those tables has a list of rows with a con‐

tained list of columns.

Our first step is to find the titles we can use for our columns. Let’s try viewing the first few rows to see if we can identify the one containing our column headers:

print all_tables[0][:6]

Here we are just looking at the first page’s first six rows:

... [u'', u'', u'', u'', u'', u'', u'Birth', u'Female',

u'genital mutila', u'tion/cutting (%)+', u'Jus',

u'tification of', u'',

u'', u'E'], [u'', u'',

u'Child labour (%', u')+',

u'Child m', u'arriage (%)', u'registration', u'',

u'2002\u201320', u'12*',

u'wife', u'beating (%)', u'',

u'Violent disciplin', u'e (%)+ 9'],

[u'Countries and areas', u'total',

u'2005\u20132012*male', u'female',

u'2005married by 15', u'\u20132012*married by 18', u'(%)+ 2005\u20132012*total', u'prwomena',

u'evalencegirlsb',

Learning How to Solve Problems | 117

u'attitudessupport for thepracticec', u'2male',

u'005\u20132012*female', u'total',

u'2005\u20132012*male', u'female'],...

We can see the titles are included in the first three lists, and they are messy. However, we can also see from our ^print statement that the rows are actually fairly clean. If we manually set up our titles by comparing them to the PDF, as shown here, we might have a clean dataset:

headers = ['Country', 'Child Labor 2005-2012 (%) total', 'Child Labor 2005-2012 (%) male',

'Child Labor 2005-2012 (%) female',

'Child Marriage 2005-2012 (%) married by 15', 'Child Marriage 2005-2012 (%) married by 18', 'Birth registration 2005-2012 (%)',

'Female Genital mutilation 2002-2012 (prevalence), women', 'Female Genital mutilation 2002-2012 (prevalence), girls', 'Female Genital mutilation 2002-2012 (support)',

'Justification of wife beating 2005-2012 (%) male', 'Justification of wife beating 2005-2012 (%) female', 'Violent discipline 2005-2012 (%) total',

'Violent discipline 2005-2012 (%) male', 'Violent discipline 2005-2012 (%) female']

for table in all_tables:

for row in table:

print zip(headers, row)

Adds all of our headers, including the country names, to one list. We can now zip this list with rows to have our data aligned.

Uses the ^zip method to zip together the headers with each row.

We can see from the output of our code that we have matches for some of the rows, but there are also many rows that are not country rows (similar to what we saw ear‐

lier when we found extra spaces and newlines in our table).

We want to programmatically solve this problem with some tests based on what we’ve learned so far. We know some of the countries span more than one row. We also know the file uses dashes (^-) to show missing data, so completely empty rows are not actual data rows. We know from our previous print output that the data starts for each page on the fifth row. We also know the last row we care about is Zimbabwe.

Let’s combine our knowledge and see what we get:

for table in all_tables:

for row in table[5:]:

if row[2] == '':

print row

Isolates only the rows for each page we want, meaning only the slice from the fifth index onward

If there is data that looks null, prints out the row to see what’s in that row

When you run the code, you’ll see there are some random blank rows scattered throughout the list that aren’t part of the country names. Maybe this is the cause of our problems in the last script. Let’s try to just get the country names put together and skip other blank rows. Let’s also add in the test for Zimbabwe:

first_name = ''

for table in all_tables:

for row in table[5:]:

if row[0] == '':

continue if row[2] == '':

first_name = row[0]

continue

if row[0].startswith(' '):

row[0] = '{} {}'.format(first_name, row[0]) print zip(headers, row)

if row[0] == 'Zimbabwe':

break

If the data row is missing index 0, it has no country name and is a blank row. The next line skips it using ^continue, which is a Python keyword that tells the ^for loop to go to the next iteration.

If the data row is missing index 2, we know this is probably the first part of a country name. This line saves the first part of the name in a variable ^first_name. The next line moves on to the next row of data.

If the data row starts with spaces, we know it’s the second part of a country name.

We want to put the name back together again.

If we are right in our hypothesis, we can match these things by printing out the results for human review. This line prints out each iteration so we can see them.

When we reach Zimbabwe, this line breaks out of our ^for loop.

Most of the data looks good, but we are still seeing some anomalies. Take a look here:

[('Country', u'80 THE STATE OF T'), ('Child Labor 2005-2012 (%) total', u'HE WOR'),

('Child Labor 2005-2012 (%) male', u'LD\u2019S CHILDRE'),

Learning How to Solve Problems | 119

('Child Labor 2005-2012 (%) female', u'N 2014'),

('Child Marriage 2005-2012 (%) married by 15', u'IN NUMBER'), ('Child Marriage 2005-2012 (%) married by 18', u'S'),

('Birth registration 2005-2012 (%)', u''), ...

We see the line number is at the beginning of the section we thought was the country name. Do you know any countries that have numbers in their name? We sure don’t!

Let’s put in a test for numbers and see if we can lose the bad data. We also noticed our two-line countries are not properly mapping. From the looks of it, the ^pdftables import autocorrected for spaces at the beginning of lines. How kind! Now we should add in a test and see if the very last line has a ^first_name or not:

from pdftables import get_tables import pprint

headers = ['Country', 'Child Labor 2005-2012 (%) total', 'Child Labor 2005-2012 (%) male',

'Child Labor 2005-2012 (%) female',

'Child Marriage 2005-2012 (%) married by 15', 'Child Marriage 2005-2012 (%) married by 18', 'Birth registration 2005-2012 (%)',

'Female Genital mutilation 2002-2012 (prevalence), women', 'Female Genital mutilation 2002-2012 (prevalence), girls', 'Female Genital mutilation 2002-2012 (support)',

'Justification of wife beating 2005-2012 (%) male', 'Justification of wife beating 2005-2012 (%) female', 'Violent discipline 2005-2012 (%) total',

'Violent discipline 2005-2012 (%) male', 'Violent discipline 2005-2012 (%) female']

all_tables = get_tables(open('EN-FINAL Table 9.pdf', 'rb')) first_name = False

final_data = []

for table in all_tables:

for row in table[5:]:

if row[0] == '' or row[0][0].isdigit():

continue elif row[2] == '':

first_name = row[0]

continue if first_name:

row[0] = u'{} {}'.format(first_name, row[0]) first_name = False

final_data.append(dict(zip(headers, row))) if row[0] == 'Zimbabwe':

break

pprint.pprint(final_data)

2It does seem there are some active GitHub forks that may be maintained and supported. We encourage you to keep an eye on them for your PDF table parsing needs.

Manipulates the country name entry in the row if it has a ^first_name Sets ^first_name back to ^False, so our next iteration operates properly

We now have our completed import. You’ll need to further manipulate the data if you want to match the exact structure we had with our Excel import, but we were able to preserve the data in our rows from the PDF.

pdftables is not actively supported, and the people who developed it now only offer a new product to replace it as a paid service. It’s dangerous to rely on unsupported code, and we can’t depend on pdftables to be around and functional forever.² Part of belonging to the open source community, however, is giving back; so we encourage you to find good projects and help out by contributing and publicizing them in the hopes that projects like ^pdftables stay open source and continue to grow and thrive.

Next, we’ll take a look at some other options for parsing PDF data, including cleaning it by hand.

Dalam dokumen Buku Data Wrangling with Python (Halaman 134-139)