Parsing PDFs Using pdfminer - Buku Data Wrangling with Python

Because PDFs are notoriously difficult to work with, we’ll be learning how to work through problems in our code and do some basic troubleshooting.

We want to start collecting the country names, because the country names are going to be the keys in our final dataset. If you open up your text file, you will find that the eighth line is the last line before the countries start. That line has the text and areas:

Parsing PDFs Using pdfminer | 97

5 TABLE 9 CHILD PROTECTION 6

7 Countries 8 and areas 9 Afghanistan 10 Albania 11 Algeria 12 Andorra

If you look through the text document, you will see that this is a consistent pattern.

So, we want to create a variable which acts as an on/off switch to start and stop the collection process when it hits the line and areas.

To accomplish this we will update the ^for loop to include a Boolean variable, which is a ^True/^False variable. We want to set our Boolean to ^True when we hit the and areas lines:

country_line = False for line in openfile:

if line.startswith('and areas'):

country_line = True

Sets country_line to ^False, because by default the line is not a country line.

Searches for a line that starts with and areas.

Sets country_line to ^True.

The next thing we need to find is when to set the Boolean back to ^False. Take a moment to look at the text file to identify the pattern. How do you know when the list of countries ends?

If you look at the follow excerpt, you will notice there is a blank line:

45 China 46 Colombia 47 Comoros 48 Congo 49

50 total 51 10 52 12

But how does Python recognize a blank line? Add a line to your script to print out the Python representation of the line (see “Formatting Data” on page 162 for more on string formatting):

country_line = False for line in openfile:

if country_line:

print '%r' % line if line.startswith('and areas'):

country_line = True

If country_line is ^True, which it will be after the previous iteration of the ^for loop…

… then print out the Python representation of the line.

If you look at the output, you will notice that all the lines now have extra characters at the end:

45 'China \n' 46 'Colombia \n' 47 'Comoros \n' 48 'Congo \n' 49 '\n' 50 'total\n' 51 '10 \n' 52 '12 \n'

The ^\n is the symbol of the end of a line, or a newline character. This is what we will now use as the marker to turn off the country_line variable. If country_line is set to ^True but the line is equal to ^\n, our code should set country_line to ^False, because this line marks the end of the country names:

country_line = False for line in openfile:

if country_line:

print line

if line.startswith('and areas'):

country_line = True elif country_line:

if line == '\n':

country_line = False

If country_line is ^True, print out the line so we can see the country name. This comes first because we don’t want it after our and areas test; we only want to print the actual country names, not the and areas line.

If country_line is ^True and the line is equal to a newline character, set country_line to ^False, because the list of countries has ended.

Parsing PDFs Using pdfminer | 99

1Your text editor likely has an option to turn on line numbers, and might even have a shortcut to “hop” to a particular line number. Try a Google search if it’s not obvious how to use these features.

Now, when we run our code, we get what looks like all the lines with countries returned. This will eventually turn into our country list. Now, let’s look for the mark‐

ers for the data we want to collect and do the same thing. The data we are looking for is the child labor and child marriage figures. We will begin with child labor data—we need total, male, and female numbers. Let’s start with the total.

We will follow the same pattern to find total child labor:

1. Create an on/off switch using ^True/^False. 2. Look for the starter marker to turn it on.

3. Look for the ending marker to turn it off.

If you look at the text, you will see the starting marker for data is total. Look at line 50 in the text file you created to see the first instance:¹

45 China 46 Colombia 47 Comoros 48 Congo 49

50 total 51 10 52 12

The ending marker is once again a newline or ^\n, which you can see on line 71:

68 6 69 46 70 3 71

72 26 y 73 5

Let’s add this logic to our code and check the results using ^print:

country_line = total_line = False for line in openfile:

if country_line or total_line:

print line

if line.startswith('and areas'):

country_line = True elif country_line:

if line == '\n':

country_line = False

if line.startswith('total'):

total_line = True elif total_line:

if line == '\n':

total_line = False

Sets ^total_line to ^False.

If country_line or ^total_line is set to ^True, outputs the line so we can see what data we have.

Checks where ^total_line starts and set ^total_line to ^True. The code following this line follows the same construction we used for the country_line logic.

At this point, we have some code redundancy. We are repeating some of the same code, just with different variables or on and off switches. This opens up a conversa‐

tion about how to create non-redundant code. In Python, we can use functions to per‐

form repetitive actions. That is, instead of manually performing those sets of actions line by line in our code, we can put them in a function and call the function to do it for us. If we need to test each line of a PDF, we can use functions instead.

When first writing functions, figuring out where to place them can be confusing. You need to write the code for the function before you want to call the function. This way Python knows what the function is supposed to do.

We will name the function we are writing turn_on_off, and we will set it up to receive up to four arguments:

• ^line is the line we are evaluating.

• ^status is a Boolean (^True or ^False) representing whether it is on or off.

• ^start is what we are looking for as the start of the section—this will trigger the on or ^True status.

• ^end is what we are looking for as the end of the section—this will trigger the off or ^False status.

Update your code and add the shell for the function above your ^for loop. Do not for‐

get to add a description of what the function does—this is so when you refer back to the function later, you don’t have to try to decipher it. These descriptions are called docstrings:

Parsing PDFs Using pdfminer | 101

def turn_on_off(line, status, start, end='\n'):

"""

This function checks to see if a line starts/ends with a certain value. If the line starts/ends with that value, the status is set to on/off (True/False).

"""

return status country_line = total_line = False

for line in openfile:

...

This line begins the function and will take up to four arguments. The first three, line, ^status, and ^start, are required arguments—that is, they must be provided because they have no default values. The last one, ^end, has a default value of a newline character, as that appears to be a pattern with our file. We can override the default value by passing a different value when we call our function.

Always write a description (or docstring) for your function, so you know what it does. It doesn’t have to be perfect. Just make sure you have something. You can always update it later.

The ^return statement is the proper way to exit a function. In this case, we are going to return ^status, which will be ^True or ^False.

Dalam dokumen Buku Data Wrangling with Python (Halaman 115-120)