Automating Your Data Checking - Buku Data Wrangling with Python

elif line not in double_lined_countries:

countries.append(clean(line))

We want the logic in the if country_line section because it is only relevant to country names.

If the previous_line is in the double_lined_countries list, this line joins the previous_line with the current line and assigns the combined lines to the ^line variable. ^join, as you can see, binds a list of strings together with the preceding string. This line uses a space as the joining character.

If the line is not in the double_lined_countries list, then the following line adds it to the countries list. Here, we utilize ^elif, which is Python’s way of saying else if. This is a nice tool to use if you want to include a different logic flow than ^{if - else}.

If we run our script again, we see 'Bolivia (Plurinational State of)' is now combined. Now we need to make sure we have all the countries. We will do this man‐

ually because our dataset is small, but if you had a larger dataset, you would automate this.

We know this is likely not how Python sees it, so we need to print out the countries as Python sees them and add those to the list:

if country_line:

print '%r' % line if previous_line in double_lined_countries:

Adds a ^{print '%r'} statement to output the Python representation

Run your script to populate the double_lined_countries list with the Python representation:

double_lined_countries = [ 'Bolivia (Plurinational \n',

'Democratic People\xe2\x80\x99s \n', 'Democratic Republic \n',

'Micronesia (Federated \n', #... uh oh.

]

We are missing Lao People’s Democratic Republic from our output, but it’s on two lines in the PDF. Let’s go back to the text version of the PDF and see what happened.

After looking at the text, can you identify the issue? Look at the turn_on_off func‐

tion. How does that work in relation to how this text is written?

The problem turns out to be a blank line or ^\n right after the and areas we were look‐

ing for as a marker. If you look at the text file that we created, you will see the stray blank line on line number 1343:

...

1341 Countries 1342 and areas 1343

1344 Iceland 1345 India 1346 Indonesia

1347 Iran (Islamic Republic of) ...

That means our function didn’t work. There are multiple ways we could approach this problem. For this example, we could try adding in more logic to make sure our on/off code works as intended. When we start to collect countries, there should be at least one country collected before we turn off the country collection. If no countries have been collected, then we should not turn off the collecting action. We can also use the previous line as a way to solve this problem. We can test the previous line in our on/off function and ensure it’s not in a list of special lines.

In case we come across any other anomalies, let’s set up this special line:

def turn_on_off(line, status, start, prev_line, end='\n'):

"""

This function checks to see if a line starts/ends with a certain value. If the line starts/ends with that value, the status is

set to on/off (True/False) as long as the previous line isn't special.

"""

if line.startswith(start):

status = True elif status:

if line == end and prev_line != 'and areas':

status = False return status

If the line is equal to ^end and the previous line is not equal to and areas, then we can turn data collection off. Here, we are using ^!= which is Python’s way of test‐

ing “not equal.” Similar to ⁼⁼, ^!= returns a Boolean value.

You will also need to update your code to pass the previous line:

country_line = turn_on_off(line, country_line, previous_line, 'and areas') total_line = turn_on_off(line, total_line, previous_line, 'total')

Let’s go back to the original task we were working on—creating our list of double- lined countries so that we can make sure to collect both lines. We left off here:

double_lined_countries = [ 'Bolivia (Plurinational \n',

'Democratic People\xe2\x80\x99s \n', 'Democratic Republic \n',

]

Looking at the PDF, we see the next one is Lao People’s Democratic Republic. Let’s start adding from there by looking back at our script output:

double_lined_countries = [ 'Bolivia (Plurinational \n',

'Democratic People\xe2\x80\x99s \n', 'Democratic Republic \n',

'Lao People\xe2\x80\x99s Democratic \n', 'Micronesia (Federated \n',

'Saint Vincent and \n', 'The former Yugoslav \n', 'United Republic \n', 'Venezuela (Bolivarian \n', ]

If your list looks like the preceding list, when you run the script, you should have an output that pulls in the country names split over two lines. Make sure to add a ^print statement to the end of your script to view the country list:

import pprint

pprint.pprint(countries)

Parsing PDFs Using pdfminer | 111

Now that you have spent a bit of time with the country list, can you think of another approach to solve this issue? Take a look at a few of the second lines:

' Republic of Korea \n' ' Republic \n'

' of the Congo \n'

What do they have in common? They start with spaces. Writing code to check whether a line begins with three spaces would be more efficient. However, taking the approach we did allowed us to discover we were losing part of our dataset as it was being collected. As your coding skills develop, you will learn to find different ways to approach the same problem and determine which works best.

Let’s check to see how our total numbers line up with our countries. Update the pprint statement to match the following:

import pprint

data = dict(zip(countries, totals)) pprint.pprint(data)

Zips the countries and totals lists together by calling zip(countries, totals). This turns them into a tuple. We then change the tuple into a dictionary, or dict (for easier reading), by passing it to the ^dict function.

Prints out the data variable we just created.

What you will see returned is a dictionary where the country names are the keys and the totals are the values. This is not our final data format; we are just doing this to see our data so far. The result should look like this:

{'': '-',

'Afghanistan': '10', 'Albania': '12', 'Algeria': '5 y', 'Andorra': '-', 'Angola': '24 x', ...

}

If you check this again alongside the PDF, you will notice it falls apart right at the point of the first country on a double line. The numbers pulled in are the ones from the Birth registration column:

{ ...

'Bolivia (Plurinational State of)': '', 'Bosnia and Herzegovina': '37',

'Botswana': '99', 'Brazil': '99', ...

}

If you look at the text version of the PDF, you will notice there is a gap between the numbers when the country is on a double line:

6 46 3 26 y 5 9 y

In the same way we accounted for this in the country name collection, we need to account for this in data collection. If we have a blank space in our data lines, we need to make sure we don’t collect it—that way, we only collect the data that matches the countries we’ve been collecting. Update your code to the following:

for line in openfile:

if country_line:

print '%r' % line

if previous_line in double_lined_countries:

line = ' '.join([clean(previous_line), clean(line)]) countries.append(line)

elif line not in double_lined_countries:

countries.append(clean(line)) elif total_line:

if len(line.replace('\n', '').strip()) > 0:

totals.append(clean(line))

country_line = turn_on_off(line, country_line, previous_line, 'and areas')

total_line = turn_on_off(line, total_line, previous_line, 'total')

previous_line = line

We know from experience the PDF uses newlines as blank lines. On this line, the code replaces newlines with nothing and strips whitespace to clean it. Then this code tests whether the string still has a length greater than zero. If so, we know there is data and we can add it to our data (totals) list.

After running our updated code, things fall apart again at our first double line. This time we are pulling in those Birth registration numbers again, aligned with our first double-lined country. All the following values are also incorrect. Let’s go back to the text file and figure out what is happening. If you look at the numbers in that column in the PDF, you can find the pattern in the text version of the PDF starting on line number 1251:

1250

1251 total 1252 –

Parsing PDFs Using pdfminer | 113

1253 5 x 1254 26 1255 – 1266 –

If you look closely, you will notice the title of the Birth registration column ends in total:

266 Birth 267 registration 268 (%)+

269 2005–2012*

270 total 271 37 272 99

Right now the function collecting totals is looking for total, so this column is getting picked up before we even get to the next line of countries. We also see that the Violent discipline (%) column has a label for total with a blank line above it. This follows the same pattern as the total we want to collect.

Encountering back-to-back bugs likely means the problem exists in the larger logic you’ve constructed. Because we started our script using these on/off switches, to fix the underlying problem we would need to rework the logic there. We’d need to figure out how to best determine the right column, maybe by collecting column names and sorting them. We might need to determine a way to see if the “page” has changed. If we keep quick-fixing the solution, we will likely run into more errors.

Only spend as much time on a script as you think you need to invest. If you are trying to build a sustainable process you can run on a large dataset multiple times over a long period, you are going to want to take the time to carefully consider all of the steps.

This is the process of programming—write code, debug, write code, debug. No mat‐

ter how experienced a computer programmer you are, there will be moments when you introduce errors in your code. When learning to code, these moments can be demoralizing. You might think, “Why isn’t this working for me? I must not be good at this.” But that’s not true; programming takes practice, like anything else.

At this point, it’s clear our current process isn’t working. Based on what we now know about the text file, we can tell we began with a false notion that the file defines the beginning and end of each section using text. We could begin again with this file, starting at a new point; however, we want to explore some other ways of problem solving to fix the errors and get the data we want.

Dalam dokumen Buku Data Wrangling with Python (Halaman 127-133)