Identifying Values for Data Cleanup - Buku Data Wrangling with Python

We begin our data cleanup with a simple review of the fields we find and any visual inconsistencies we can see. If you start your data cleanup by making your data look cleaner, you will have a good idea of the initial problems you must conquer as you normalize your data.

Let’s take a look at our mn.csv file. The file contains raw data and uses codes (acro‐

nyms) as headers which likely contain some easily translatable meaning. Let’s take a look at the column headers in our mn.csv file:

"","HH1","HH2","LN","MWM1","MWM2", ...

Each of these represents a question or data in the survey, and we’d like the more human-readable versions. Searching via Google, we locate the human-readable values for those headings on the World Bank site for sharing MICS data.

Take time to first investigate whether data like the abbreviation list‐

ing on the World Bank site exists to help with your cleanup needs.

You can also pick up the phone and give the organization a call to ask if they have an easy-to-use abbreviation list.

Using some web scraping skills you’ll become acquainted with in Chapter 11, we were able to get a CSV of these headers with their English variants and the questions used to calculate their values from the World Bank site for MICS data. We’ve included the new headers from our web scraper in the book’s repository (mn-headers.csv). We want to match up this data with our survey data so we have readable questions and answers. Let’s look at a few ways we can do that.

Data Cleanup Basics | 151

Replacing headers

The most straightforward and obvious way to make the headers more readable is to merely replace the short headers with longer English ones we can understand. How might we go about header substitution using Python? First, we’ll need to import both the mn.csv and mn-headers.csv files using the ^csv module we learned about in Chap‐

ter 3 (see the following code for importing). Throughout this chapter and the follow‐

ing chapters, feel free to write code in either scripts or in your terminal (such as IPython). This will allow you to interact with the data before saving it to a file:

from csv import DictReader

data_rdr = DictReader(open('data/unicef/mn.csv', 'rb'))

header_rdr = DictReader(open('data/unicef/mn_headers.csv', 'rb')) data_rows = [d for d in data_rdr]

header_rows = [h for h in header_rdr]

print data_rows[:5]

print header_rows[:5]

This code writes the iterable ^DictReader object into a new list so we can preserve the data and reuse it. We’re using the list generator format so we can do it in one simple line of code that’s readable and clear.

This prints just a slice of the data, by using the Python list’s ^slice method to show the first five elements of our new lists and get an idea of the content.

In the fourth line of code, we used a list generator function for Python. Python list generators have the following format:

[func(x) for x in iter_x]

A list generator starts and ends with list brackets. Then it takes an iterable object (^iter_x) and passes each row or value from ^iter_x into ^func(x) to create new values for the new list. Here, we are not doing anything with the function part of the list generator; we only want the row as it currently stands. In future chapters, we’ll use the ability to pass each row or value from the iterable into a function to clean or change the data before we put it into the list. List generators are a great example of the easy-to-read and easy-to-use syntax Python is well known for. You could achieve the same functionality using a ^for loop, but that would require more code:

new_list = []

for x in iter_x:

new_list.append(func(x))

As you can see, using a list generator saves us a few lines of code, and offers superior performance and memory efficiency.

We want to replace the ^data_row dictionary headers with the readable headers from our file. As we can see from our output, the header_rows dictionaries hold both the short and longer values. The current short headers are contained under the ^Name field, and the longer, more readable headers are saved under the ^Label field. Let’s see how easily we can match them up using some Python string methods:

for data_dict in data_rows:

for dkey, dval in data_dict.items():

for header_dict in header_rows:

for hkey, hval in header_dict.items():

if dkey == hval:

print 'match!'

Iterates over each of our data records. We will try to use the keys in each of these dictionaries to match our headers.

Iterates over each key and value in each data row so we can replace all of the keys with the more readable header labels (to view each key-value pair in the data dic‐

tionary, we use the Python dictionary’s ^items method).

Iterates over all header rows of data, so we can get the readable labels. It’s not the fastest way to do this, but it makes sure we don’t miss any.

Prints found matches between the data list keys (^MWB3, ^MWB7, ^MWB4, ^MWB5…) and the header dictionary data.

By running the code, we see we have many matches. Let’s see if we can use similar logic to replace the titles with better ones. We know we can match them really easily.

However, we only found the row we wanted to match. Let’s see if we can figure out how to match the keys from the data list with the values from the row in the header row we found:

Data Cleanup Basics | 153

new_rows = []

for data_dict in data_rows:

new_row = {}

for dkey, dval in data_dict.items():

for header_dict in header_rows:

if dkey in header_dict.values():

new_row[header_dict.get('Label')] = dval new_rows.append(new_row)

Creates a new list to populate with cleaned rows.

Creates a new dictionary for each row.

Here, we use the dictionary’s ^values method instead of iterating over every key and value of the header rows. This method returns a list of only the values in that dictionary. We are also using Python’s ⁱⁿ method, which tests whether an object is a member of a list. For this line of code, the object is our key, or the abbreviated string, and the list is the values of the header dictionary (which contains the abbreviated headers). When this line is true, we know we have found the match‐

ing row.

Adds to our ^new_row dictionary every time we find a match. This sets the dictio‐

nary key equal to the ^Label value in the header row, replacing those short ^Name values with the longer, more readable ^Label values, and keeps the values set to the data row values.

Appends the new cleaned dictionary we created to our new array. This is inden‐

ted to ensure we have all the matches before going to the next row.

You can see from a simple print of the first record of our new values that we have successfully made the data readable:

In [8]: new_rows[0]

Out[8]: {

'AIDS virus from mother to child during delivery': 'Yes', 'AIDS virus from mother to child during pregnancy': 'DK', 'AIDS virus from mother to child through breastfeeding': 'DK', 'Age': '25-29',

'Age at first marriage/union': '29',...

One easy way to determine whether you have the proper indenta‐

tion for your function is to look at other lines with the same inden‐

tation. Always ask yourself: What other code logically goes with this step? When should I move on to the next step in the process?

There isn’t always only one good solution to data cleanup problems, so let’s see if we can solve our unreadable header problem another way using a different technique.

Zipping questions and answers

Another way to fix the label problem is to use Python’s ^zip method:

from csv import reader

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))

header_rdr = reader(open('data/unicef/mn_headers.csv', 'rb')) data_rows = [d for d in data_rdr]

header_rows = [h for h in header_rdr]

print len(data_rows[0]) print len(header_rows)

This time, instead of using ^DictReader, we use the simple ^reader class. The sim‐

ple ^reader creates a list for each row, rather than a dictionary. Because we want to use ^zip, we need lists instead of dictionaries, so we can zip the list of header values with the list of data values.

These lines create lists for our header and data readers and print them to see if they are the same length.

Oops—our printed length output shows we have a mismatch in the length of our data and headers! Our data shows only 159 rows while our header list shows we have 210 possible headers. This likely means MICS uses more questions for other countries or provides more questions to choose from than we have in our Zimbabwe dataset.

We need to further investigate which headers are used in the dataset and which ones we can leave behind. Let’s take a closer look to find which ones don’t align properly:

In [22]: data_rows[0]

Out[22]: ['', 'HH1', 'HH2', 'LN', 'MWM1', 'MWM2', 'MWM4', 'MWM5', 'MWM6D', 'MWM6M', 'MWM6Y', ... ]

In [23]: header_rows[:2]

Out[23]: [

Data Cleanup Basics | 155

['Name', 'Label', 'Question'], ['HH1', 'Cluster number', '']]

OK, so we can clearly see here we need to match the ^data_rows second row with the first index of the header_rows. Once we identify which ones don’t match, we want to toss them out of the header_rows so we can zip the data properly:

bad_rows = []

for h in header_rows:

if h[0] not in data_rows[0]:

bad_rows.append(h) for h in bad_rows:

header_rows.remove(h) print len(header_rows)

Tests if the first element of the header row (the shorthand version of the header) is in the first row of the data (all shortened headers).

Appends the rows identified as having mismatched headers to our new list, bad_rows. We use this in the next step to identify rows to remove.

Uses the list’s ^remove method to remove a particular row of data from a list. This method is often useful in situations where you can identify one specific row (or set of rows) you want to remove from a list.

Aha! So now we can see we are nearly matching. We have 159 values in our data rows and 150 values in our header list. Now let’s see if we can figure out why we don’t have those nine matching headers in our header list:

all_short_headers = [h[0] for h in header_rows]

for header in data_rows[0]:

if header not in all_short_headers:

print 'mismatch!', header

Uses Python list comprehension to make a list of all the short headers by collect‐

ing only the first element of each header row.

Iterates over the headers in our dataset to see which ones don’t align with our cleaned header list.

Singles out the headers that don’t match from our abbreviated list.

Uses ^print to display the mismatches. If you need a quick way to print two strings on the same line, you can simply use a ^, in between them to concatenate the strings with a space.

When you run this code, your output should look something like:

mismatch!

mismatch! MDV1F mismatch! MTA8E mismatch! mwelevel mismatch! mnweight mismatch! wscoreu mismatch! windex5u mismatch! wscorer mismatch! windex5r

From the output and our current knowledge of the data, we can see that only a few of the mismatched headers (those with capitals) are ones we might want to fix. The low‐

ercase titles are used for UNICEF internal methodology and don’t line up with ques‐

tions we have for our own investigation.

Because the ^MDV1F and ^MTA8E variables were not found with the web scraper we built to collect headers from the World Bank site, we need to investigate what they mean using our SPSS viewer. (The other option is to drop these rows and move on.)

When you are dealing with raw data, sometimes you’ll find that getting it into a usable form means dropping data you don’t need or data that’s difficult to clean. In the end, the determining factor should not be sloth, but instead whether the data is essential to your questions.

After opening the SPSS viewer, we can see ^MDV1F matches the label “If she commits infidelity: wife beating justified” and matches up with another longer set of questions regarding domestic abuse. We have other questions related to relationship abuse, so it’s probably a good idea to include this. Investigating the ^MTA8E header shows it matches up with a different series of questions, about which type of tobacco is smoked by the person. We have added both to a new file, mn_headers_updated.csv.

Now we can retry the original code, this time using our updated headers file:

Let’s look at it all together and make a few changes so we can try zipping our headers and data together. The following script requires a lot of memory, so if you have less than 4GB RAM, we recommend running it in an IPython terminal or notebook to help mitigate segmentation faults:

from csv import reader

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))

Data Cleanup Basics | 157

header_rdr = reader(open('data/unicef/mn_headers_updated.csv', 'rb')) data_rows = [d for d in data_rdr]

header_rows = [h for h in header_rdr if h[0] in data_rows[0]]

print len(header_rows)

all_short_headers = [h[0] for h in header_rows]

skip_index = []

for header in data_rows[0]:

if header not in all_short_headers:

index = data_rows[0].index(header) skip_index.append(index)

new_data = []

for row in data_rows[1:]:

new_row = []

for i, d in enumerate(row):

if i not in skip_index:

new_row.append(d) new_data.append(new_row) zipped_data = []

for drow in new_data:

zipped_data.append(zip(header_rows, drow))

Uses list comprehension to quickly remove mismatched headers. As you can see, we can also use an ^if statement inside a list comprehension. Here, the code makes a list of rows from the header rows list, as long as the first header row ele‐

ment (abbreviated header) is in the headers from the data rows.

Creates a list to hold the indexes of data rows whose data we aren’t interested in keeping.

Utilizes the Python list’s ^index method to return the indexes we should skip because the headers aren’t in the abbreviated list. The next line will save the indexes of the data rows that don’t match our headers, so we can skip collecting that data.

Slices the list holding the survey data to include only data rows (all rows except the first row) and then iterates through them.

Uses the ^enumerate function to isolate the indexes of the data rows to skip. This function takes an iterable object (here, the data row list) and returns the numeric

index and value for each item. It assigns the first value (index) to ⁱ and assigns the data value to ^d.

Tests to make sure the index is not in the list we want to skip.

After going through each item (or “column”) in the data row, adds the new entry to the ^new_data list.

Zips each row (now exactly matched with header and data) and adds it to a new array, zipped_data.

We can now print out a row of our new dataset and see if we have what we expected:

In [40]: zipped_data[0]

Out[40]: [(['HH1', 'Cluster number', ''], '1'), (['HH2', 'Household number', ''], '17'), (['LN', 'Line number', ''], '1'), (['MWM1', 'Cluster number', ''], '1'), (['MWM2', 'Household number', ''], '17'), (['MWM4', "Man's line number", ''], '1'), (['MWM5', 'Interviewer number', ''], '14'), (['MWM6D', 'Day of interview', ''], '7'), (['MWM6M', 'Month of interview', ''], '4'), (['MWM6Y', 'Year of interview', ''], '2014'),

(['MWM7', "Result of man's interview", ''], 'Completed'), (['MWM8', 'Field editor', ''], '2'),

(['MWM9', 'Data entry clerk', ''], '20'),

(['MWM10H', 'Start of interview - Hour', ''], '17'), ....

We have all of the questions and answers together in tuples, and every row has all of the matching data with headers. To be certain we have everything correct, let’s take a look at the end of that row:

(['TN11', 'Persons slept under mosquito net last night', 'Did anyone sleep under this mosquito net last night?'], 'NA'), (['TN12_1', 'Person 1 who slept under net',

'Who slept under this mosquito net last night?'], 'Currently married/in union'), (['TN12_2', 'Person 2 who slept under net',

'Who slept under this mosquito net last night?'], '0'),

This looks strange. It seems like we might have some mismatches. Let’s do a reality check and use our newly learned ^zip method to see if our headers match up properly:

data_headers = []

for i, header in enumerate(data_rows[0]):

if i not in skip_index:

data_headers.append(header)

header_match = zip(data_headers, all_short_headers)

Data Cleanup Basics | 159

print header_match

Iterates over the headers in the data list.

By using if...not in..., returns ^True only for indexes not included in skip_index.

Zips the new lists of headers together so we can visually check for the mismatch.

Aha! Did you see the error?

....

('MHA26', 'MHA26'), ('MHA27', 'MHA27'), ('MMC1', 'MTA1'), ('MMC2', 'MTA2'), ....

Everything is matching until this point, when it appears our header file and data file diverge in regards to question order. Because the ^zip method expects everything to appear in the same order, we must reorder our headers to match the dataset before we can use it. Here’s our new try at matching our data:

from csv import reader

data_rdr = reader(open('data/unicef/mn.csv', 'rb'))

header_rdr = reader(open('data/unicef/mn_headers_updated.csv', 'rb')) data_rows = [d for d in data_rdr]

header_rows = [h for h in header_rdr if h[0] in data_rows[0]]

all_short_headers = [h[0] for h in header_rows]

skip_index = []

final_header_rows = []

for header in data_rows[0]:

if header not in all_short_headers:

index = data_rows[0].index(header) skip_index.append(index)

else:

for head in header_rows:

if head[0] == header:

final_header_rows.append(head) break

new_data = []

for row in data_rows[1:]:

new_row = []

for i, d in enumerate(row):

if i not in skip_index:

new_row.append(d) new_data.append(new_row) zipped_data = []

for drow in new_data:

zipped_data.append(zip(final_header_rows, drow))

Makes a new list to contain the final properly ordered header rows.

Uses an ^else statement to include only columns where we have a match.

Iterates over our header_rows until there’s a match.

Tests the short header to see if the question lines up. We use ⁼⁼ to test for a match.

Uses ^break to exit the for head in header_rows loop once a match is found.

This makes it faster and doesn’t hurt the outcome.

Zips our new final_header_rows list with the header rows in the proper order.

After running our new code, we want to take a look at the end of our first entry:

(['TN12_3', 'Person 3 who slept under net',

'Who slept under this mosquito net last night?'], 'NA'), (['TN12_4', 'Person 4 who slept under net',

'Who slept under this mosquito net last night?'], 'NA'), (['HH6', 'Area', ''], 'Urban'),

(['HH7', 'Region', ''], 'Bulawayo'),

(['MWDOI', 'Date of interview women (CMC)', ''], '1372'), (['MWDOB', 'Date of birth of woman (CMC)', ''], '1013'), (['MWAGE', 'Age', ''], '25-29'),

This looks like a good match. We can likely improve the clarity of our code; however, we have found a good way to preserve most of the data and zip our data together, and it works relatively fast.

You will always need to evaluate how complete you need the data to be and what level of effort fits your project’s cleanup needs. If you are only using one part of the data, it’s likely you don’t need to retain it all. If the dataset is your primary research source, it’s worth more time and effort to keep it complete.

In this section, we’ve learned some new tools and methods to identify what’s wrong or needs cleaning and implemented fixes by combining the Python we know with our

Data Cleanup Basics | 161

Dalam dokumen Buku Data Wrangling with Python (Halaman 169-185)