Arguments with Default Values Always Come Last

def turn_on_off(line, status, start, end='\n'):

"""

This function checks to see if a line starts/ends with a certain value. If the line starts/ends with that value, the status is set to on/off (True/False).

"""

return status country_line = total_line = False

for line in openfile:

...

This line begins the function and will take up to four arguments. The first three, line, ^status, and ^start, are required arguments—that is, they must be provided because they have no default values. The last one, ^end, has a default value of a newline character, as that appears to be a pattern with our file. We can override the default value by passing a different value when we call our function.

Always write a description (or docstring) for your function, so you know what it does. It doesn’t have to be perfect. Just make sure you have something. You can always update it later.

The ^return statement is the proper way to exit a function. In this case, we are going to return ^status, which will be ^True or ^False.

Here are two possible ways to update it:

def turn_on_off(line, start, end='\n', status=False):

def turn_on_off(line, start, status=False, end='\n'):

The ^status argument will have to be moved after the required arguments. When call‐

ing the new function, we can use defaults for ^end and ^status, or we can override them:

turn_on_off(line, 'and areas')

turn_on_off(line, 'and areas', end='\n\n', status=country_line)

If you accidentally list arguments with defaults before required arguments, Python will throw an error: SyntaxError: non-default argument follows default argu ment. You don’t need to memorize this, but be aware so if you see the error, you recog‐

nize what it’s referencing.

Now, let’s move the code from our ^for loop into the function. We want to replicate the logic we had with country_line in our new turn_on_off function:

def turn_on_off(line, status, start, end='\n'):

"""

This function checks to see if a line starts/ends with a certain value. If the line starts/ends with that value, the status is set to on/off (True/False).

"""

if line.startswith(start):

status = True elif status:

if line == end:

status = False

return status

Replaces what we are searching for on the starting line with the ^start variable.

Replaces the end text we used with the ^end variable.

Returns the status based on the same logic (^end means ^False, ^start means True).

Let’s now call the function in our ^for loop, and check out what our script looks like all together thus far:

pdf_txt = 'en-final-table9.txt' openfile = open(pdf_txt, "r")

def turn_on_off(line, status, start, end='\n'):

"""

Parsing PDFs Using pdfminer | 103

This function checks to see if a line starts/ends with a certain value. If the line starts/ends with that value, the status is set to on/off (True/False).

"""

if line.startswith(start):

status = True elif status:

if line == end:

status = False return status

country_line = total_line = False for line in openfile:

if country_line or total_line:

print '%r' % line

country_line = turn_on_off(line, country_line, 'and areas') total_line = turn_on_off(line, total_line, 'total')

In Python syntax, a series of ⁼ symbols means we are assigning each of the vari‐

ables listed to the final value. This line assigns both country_line and total_line to ^False.

We want to still keep track of our lines and the data they hold when we are on.

For this, we are employing ^or. A Python ^or says if one or the other is true, do the following. This line says if either country_line or ^total_line is ^True, print the line.

This calls the function for countries. The country_line variable catches the returned status that the function outputs and updates it for the next ^for loop.

This calls the function for totals. It works the same as the previous line for coun‐

try names.

Let’s start to store our countries and totals in lists. Then we will take those lists and turn them into a dictionary, where the country will be the key and the total will be the value. This will help us troubleshoot to see if we need to clean up our data.

Here’s the code to create the two lists:

countries = []

totals = []

country_line = total_line = False for line in openfile:

if country_line:

countries.append(line)

elif total_line:

totals.append(line) country_line = turn_on_off(line, country_line, 'and areas')

total_line = turn_on_off(line, total_line, 'total')

Creates empty countries list.

Creates empty totals list.

Note that we’ve removed the if country_line or total_line statement. We will break this out separately below.

If it’s a country line, this line adds the country to the country list.

This line collects totals, the same as we did with countries.

We are going to combine the totals and countries by “zipping” the two datasets. The zip function takes an item from each list and pairs them together until all items are paired. We can then convert the zipped list to a dictionary by passing it to the ^dict function.

Add the following to the end of the script:

import pprint test_data = dict(zip(countries, totals)) pprint.pprint(test_data)

Imports the ^pprint library. This prints complex data structures in a way that makes them easy to read.

Creates a variable called ^test_data, which will be the countries and totals zipped together and then turned into a dictionary.

Passes ^test_data to the pprint.pprint() function to pretty print our data.

If you run the script now, you will get a dictionary that looks like this:

{'\n': '49 \n',

' \n': '\xe2\x80\x93 \n',

' Republic of Korea \n': '70 \n', ' Republic of) \n': '\xe2\x80\x93 \n', ' State of) \n': '37 \n',

' of the Congo \n': '\xe2\x80\x93 \n', ' the Grenadines \n': '60 \n',

'Afghanistan \n': '10 \n', 'Albania \n': '12 \n', 'Algeria \n': '5 y \n',

Parsing PDFs Using pdfminer | 105

'Andorra \n': '\xe2\x80\x93 \n', 'Angola \n': '24 x \n',

'Antigua and Barbuda \n': '\xe2\x80\x93 \n', 'Argentina \n': '7 y \n',

'Armenia \n': '4 \n',

'Australia \n': '\xe2\x80\x93 \n', ...

At this point, we are going to do some cleaning. This will be explained in greater detail in Chapter 7. For now, we need to clean up our strings, as they are very hard to read. We are going to do this by creating a function to clean up each line. Place this function above the ^for loop, near your other function:

def clean(line):

"""

Cleans line breaks, spaces, and special characters from our line.

"""

line = line.strip('\n').strip() line = line.replace('\xe2\x80\x93', '-') line = line.replace('\xe2\x80\x99', '\'')

return line

Strips ^\n off of the line and reassigns the output to ^line so now ^line holds the cleaned version

Replaces special character encodings Returns our newly cleaned string

In the cleaning we just did, we could combine method calls like this:

line = line.strip('\n').strip().replace(

'\xe2\x80\x93', '-').replace('\xe2\x80\x99s', '\'') However, well-formatted Python code lines should be no greater than 80 characters in length. This is a recommendation, not a rule, but keeping your lines restricted in length allows your code to be more readable.

Let’s apply the ^clean_line function in our ^for loop:

for line in openfile:

if country_line:

countries.append(clean(line)) elif total_line:

totals.append(clean(line))

Now if we run our script, we get something that looks closer to what we are aiming for:

{'Afghanistan': '10', 'Albania': '12', 'Algeria': '5 y', 'Andorra': '-', 'Angola': '24 x',

'Antigua and Barbuda': '-', 'Argentina': '7 y', 'Armenia': '4', 'Australia': '-', 'Austria': '-', 'Azerbaijan': '7 y', ...

If you skim the list, you will see our approach is not adequately parsing all of the data.

We need to figure out why this is happening.

It looks like countries with names spread over more than one line are separated into two records. You can see with this Bolivia: we have records reading 'Bolivia (Pluri national': '', and 'State of)': '26 y',.

The PDF itself can be used as a visual reference to show you how the data should be organized. You can see these lines in the PDF, as shown in Figure 5-2.

Figure 5-2. Bolivia in the PDF

PDFs can be rabbit holes. Each PDF you process will require its own finesse. Because we are only parsing this PDF once, we are doing a lot of hand-checking. If this was a PDF we needed to parse on a regular basis, we would want to closely identify patterns over time and programmatically account for those, along with building checks and testing our code to ensure our import is accurate.

There are a couple of ways to approach this problem. We could try to create a place‐

holder to check for blank total lines and combine those with the following data lines.

Another solution is to keep track of which countries have records spanning more than one line. We will try the second approach, as our dataset isn’t too large.

Parsing PDFs Using pdfminer | 107

We will create a list of the first lines for each multiline country record and use this list to check each line in our script. You will want to put this list before your ^for loop.

Often, reference items are put near the top of the script so they are easy to find and change as needed.

Let’s add Bolivia (Plurinational to a list of double-lined countries:

double_lined_countries = [ 'Bolivia (Plurinational', ]

Now we need to update our ^for loop to check if the previous line is in the double_lined_countries list, and, if so, combine that line with the current line. To do so, we will create a previous_line variable. Then, we will populate the ^previ ous_line variable at the end of the ^for loop. Only then will we be able to combine the rows when the code hits the next iteration of the loop:

countries = []

totals = []

country_line = total_line = False

previous_line = '' for line in openfile:

if country_line:

countries.append(clean(line)) elif total_line:

totals.append(clean(line))

country_line = turn_on_off(line, country_line, 'and areas') total_line = turn_on_off(line, total_line, 'total')

previous_line = line

Creates the previous_line variable and sets it to an empty string.

Populates the previous_line variable with the current line at the end of the ^for loop.

Now we have a previous_line variable and we can check to see if the previous_line is in double_lined_countries so we know when to combine the previous line with the current line. Furthermore, we will want to add the new combined line to the countries list. We also want to make sure we do not add the line to the countries list if the first part of the name is in the double_lined_countries list.

Let’s update our code to reflect these changes:

if country_line:

if previous_line in double_lined_countries:

line = ' '.join([clean(previous_line), clean(line)]) countries.append(line)

elif line not in double_lined_countries:

countries.append(clean(line))

We want the logic in the if country_line section because it is only relevant to country names.

If the previous_line is in the double_lined_countries list, this line joins the previous_line with the current line and assigns the combined lines to the ^line variable. ^join, as you can see, binds a list of strings together with the preceding string. This line uses a space as the joining character.

If the line is not in the double_lined_countries list, then the following line adds it to the countries list. Here, we utilize ^elif, which is Python’s way of saying else if. This is a nice tool to use if you want to include a different logic flow than ^{if - else}.

If we run our script again, we see 'Bolivia (Plurinational State of)' is now combined. Now we need to make sure we have all the countries. We will do this man‐

ually because our dataset is small, but if you had a larger dataset, you would automate this.

Dalam dokumen Buku Data Wrangling with Python (Halaman 120-127)