What Is Indexing? - Buku Data Wrangling with Python

As you’ll recall from Chapter 3, indexing is a way a to pull an item out of a set of objects, such as a list. In the case of the Excel file we are parsing, when we pass ⁱ to sheet.row_values(), the method ^row_values uses ⁱ as an index. Let’s practice index‐

ing in the Python interpreter.

Create a sample list:

x = ['cat', 'dog', 'fish', 'monkey', 'snake']

To pull out the second item, you can refer to the item by adding an index, as shown here:

>>>x[2]

'fish'

If this isn’t the result you expected, remember that Python starts counting at 0. So, to get the second item as humans would identify it, we have to use the number 1:

>>>x[1]

'dog'

Getting Started with Parsing | 83

You can also take a negative index:

>>>x[-2]

'monkey'

What is the difference in behavior between positive and negative indexes? You can see that one counts from the beginning (positive) while the other counts from the end (negative).

Slicing is another useful practice related to indexing. Slicing allows you to take a

“slice” out of another list or iterable object. For example:

>>>x[1:4]

['dog', 'fish', 'monkey']

Notice that, as with ranges, the slice starts at the first number, but the second number is read as “up to, but not including.”

If you don’t include the first or last number, the slice will go to the end. Here are a few examples:

x[2:]

['fish', 'monkey', 'snake']

x[-2:]

['monkey', 'snake']

x[:2]

['cat', 'dog']

x[:-2]

['cat', 'dog', 'fish']

Slicing on other iterable objects works the same way as with lists.

Let’s add a dictionary to our code, and then pull out the country name from each row and add it as a key to our dictionary.

Update your ^for loop to reflect this:

count = 0

data = {}

for i in xrange(sheet.nrows):

if count < 10:

if i >= 14:

row = sheet.row_values(i) country = row[1]

data[country] = {}

count += 1

print data

This creates an empty dictionary to store our data.

row[1] pulls out the country from each row we iterate over.

data[country] adds the country as a key to the ^data dictionary. We set the value to another dictionary, because that is where we are going to store our data in the following steps.

This outputs the data, so we can see what it looks like.

At this point, your output should look something like this:

{u'Afghanistan': {}, u'Albania': {}, u'Angola': {}, u'Algeria': {},

u'Andorra': {}, u'Austria': {}, u'Australia': {}, u'Antigua and Barbuda': {}, u'Armenia': {}, u'Argentina': {}}

Now, we need to match up each of the values in the rest of the row with the proper values in the spreadsheet, then store them in our dictionary.

As you try to pull out all the values and check them against your Excel sheet, you will make lots of errors. That is fine and expected.

This process should be embraced—it means you’re working your way through the problem.

First, let’s create an empty version of our data structure where we can store our data.

Let’s also remove our counter, as we know that the rows of data start at line 14.

Because we know ^xrange can accept a start and end point, we can begin our counting at 14 and end at the end of the file. Let’s take a look at our updated code:

data = {}

for i in xrange(14, sheet.nrows):

row = sheet.row_values(i) country = row[1]

data[country] = { 'child_labor': { 'total': [], 'male': [],

'female': [], },

'child_marriage': { 'married_by_15': [], 'married_by_18': [], }

}

print data['Afghanistan']

Getting Started with Parsing | 85

We can remove all of our references to the counter and just begin our ^for loop starting at the 14th row of our sheet. This line begins the loop with a value of 14, so we automatically skip the lines we don’t need for our dataset.

This line expands the dictionary to multiple lines to fill out the other data points.

This creates the key child_labor and sets it equal to another dictionary.

The dictionary has strings to explain each part of the data it holds. For each of these keys, the values are lists.

This outputs the values associated with the key Afghanistan. Our output data for Afghanistan looks like this:

{

'child_labor': {'total': [], 'male': [], 'female': []}, 'child_marriage': {'married_by_18': [], 'married_by_15': []}

}

Let’s now populate the data. Because we have access to each column of each row using the index, we can populate these lists with the values from the sheet. By looking at our sheet and lining up which columns relate to which parts of the data, we can update the data dictionary to reflect the following:

data[country] = { 'child_labor': {

'total': [row[4], row[5]], 'male': [row[6], row[7]],

'female': [row[8], row[9]], },

'child_marriage': {

'married_by_15': [row[10], row[11]], 'married_by_18': [row[12], row[13]], }

}

Because there are two cells for each of the columns, our code stores both values.

Because in this line our child labor totals are the fifth and sixth columns and we know Python is zero-indexed, we want the fourth and fifth indexes.

When we run our code again, we get output like this:

{

'child_labor': {'female': [9.6, ''], 'male': [11.0, ''], 'total': [10.3, '']}, 'child_marriage': {'married_by_15': [15.0, ''], 'married_by_18': [40.4, '']}}

}

Before you continue, output a couple of records and check the number in the dictionary. It is easy to end up one index off and ruin the rest of your data.

Finally, to preview our data, we can use ^pprint instead of a ^print statement. In com‐

plicated data structures (like dictionaries), this makes it a lot easier to review the out‐

put. Add the following to the end of your file to preview the data in a formatted fashion:

import pprint pprint.pprint(data)

Imports the ^pprint library. Normally, ^import statements come at the beginning of the file, but we are putting it here for simplicity. After you are done, you will want to delete these lines, because they are not critical to your script.

Passes data to the pprint.pprint() function.

If you scroll through your output, you will notice the majority of it looks good. But there are a couple of records that seem out of place.

If you look at the spreadsheet, you should note the last row for countries is Zim‐

babwe. So, we want to look for when the country is equal to ^'Zimbabwe', and exit there. To exit, we add a ^break to our code, which is how we prematurely break the for loop to continue with the rest of script. Let’s add that as our stopping point. At the end of the ^for loop, add the following and rerun your code:

if country == 'Zimbabwe':

break

If the country is equal to Zimbabwe…

Exits out of the ^for loop.

After adding the ^break, did you end up with a NameError: name 'country' is not defined error? If so, check your indentation.

The ^if statement should be indented four spaces to be in the ^for loop.

Stepping through code can be helpful in identifying an issue. If you need to troubleshoot to figure out what a variable, such as ^country, is equal to in a ^for loop, try adding ^print statements inside the ^for loop and watching the values before your script exits with an error.

They will likely give you a hint as to what is happening.

Getting Started with Parsing | 87

At this point, our script’s output matches our end goal. The last thing we want to do to our script is to make sure we document it with some comments.

Dalam dokumen Buku Data Wrangling with Python (Halaman 101-106)