Joshua Hatch, Senior Editor, Data and Interactives, The Chronicle of Higher Education and The Chronicle of Philanthropy. Both the business of working with data and the sophistication of tools have evolved over the past 20 years, making it more important to think about a common set of techniques.
Preface
Who Should Read This Book
Who Should Not Read This Book
How This Book Is Organized
What Is Data Wrangling?
What to Do If You Get Stuck
Conventions Used in This Book
Using Code Examples
For example, writing a program that uses several pieces of code from this book does not require permission. If you feel that your use of code samples falls outside of fair use or the permission granted above, please contact us at [email protected].
Safari® Books Online
How to Contact Us
We have a website for this book that lists errata, examples, and any additional information. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia.
Acknowledgments
Introduction to Python
The article explains why graduation costs are so high compared to school district budgets. The question remained: Why is the Manatee County School District so financially strapped that it can't make its regular dues?
Finding Your Datasets
Through this process you will learn how to clean data and identify and solve hidden problems in your data sets. As we cover the data issues you may encounter, you'll learn how to determine whether you should write a cleanup script or use an out-of-the-box approach.
Why Python
Once you have obtained and transformed the data, you will begin your initial data exploration. You will learn to tell the story in multiple formats and compare different publishing options.
Getting Started with Python
Besides the benefits of Python as a language, it also has one of the most open and helpful communities. Many members of the Python community no longer have the fresh eyes that you currently possess.
Which Python Version
If you want to try some of the examples with Python 3, feel free; However, we would prefer that you focus on learning Python 2.7 and move on to Python 3 after completing this book. Most of these external code pieces will work for Python 2.7, but may not yet work for 3.4.
Setting Up Python on Your Machine
If you are using 64-bit Windows, you will need to download the x86-64 MSI installer from the download page. After you have successfully installed Python, you will want to add Python to your environment.
Test Driving Python
If you get an error, note this and make sure you're only typing Python commands in the Python interpreter. If you type a command into your Python interpreter that you should type into the system terminal, you'll probably get a NameError or SyntaxError.
Install pip
YOUR_NAME\Downloads to change to your Downloads folder (replacing the name of your computer's main directory with YOUR_NAME). When you use pip, your computer searches PyPI for the specified package or code library, downloads it to your computer, and installs it.
Install a Code Editor
Optional: Install IPython
Summary
Python Basics
Launching the Python Interpreter
Basic Data Types
Strings
Some use single quotes by convention in their code, and others prefer double quotes. Personally, we prefer single quotes because double quotes require us to hold down the Shift key.
Integers and Floats
In the following example, the first line imports getcontext and Decimal from the decimal module so that we have them in our environment. There are many different approaches to math that you may need to do, but the decimal type allows us to be more precise when using a non-integer.
Numbers in Python
As mentioned earlier, there are many mathematical details you will encounter when wrangling your data.
Data Containers
Variables
Object-Oriented Programming
In this example, the error says, dogs is not defined, which means we haven't defined a variable called dogs. You will get the same error if you forgot to include the quotes in 'budget.csv' in our first example.
Lists
We decided to use short strings in our examples because typing long strings is not fun for you (or us). Now let's save our list in a variable so we can use it later in our code.
Dictionaries
This line defines the variable cat_names as a list of cat names (a list of strings). This line uses the variable cat_names to pass that list of names as the value for the key 'cats' in the dictionary.
What Can the Various Data Types Do?
Think of the data types as nouns and the things they can do as verbs. A wide range of built-in methods and basic data types are included in the Python standard library (also known as stdlib), including the Python data types you use today.
String Methods: Things Strings Can Do
When we call a string's methods, these actions are part of the standard Python libraries that every Python installation shares (similar to the standard applications that come pre-installed on your phone). These methods will be present on every computer running Python, so any Python string can share the same methods (just as any phone can make a call and any Apple phone can send an iMessage).
Numerical Methods: Things Numbers Can Do
The number of lines you use for your code is sometimes a matter of style or personal preference. Make choices that make sense to you, but keep your code clear, easy to read, and clear.
Addition and Subtraction
List Methods: Things Lists Can Do
Dictionary Methods: Things Dictionaries Can Do
Helpful Tools: type, dir, and help
For now, ignore everything at the beginning of the returned list (the strings that start with the double underscore). If you look at the string 'cat,dog,horse' it looks like it's a list stored in a string.
The next line reads: "Return the list of words in the string S using sep as the delimiter string."sep is the argument passed to the split method, which is used as the delimiter. The comma is also the delimiter in the string we've created, as it separates the words we want in our list.
Putting It All Together
The help description also tells us that spaces, or white space, are the default delimiter if no other delimiter is specified. This tells us that if we had a string 'cat dog horse', the split method would not require us to use a separator within the .
What Does It All Mean?
According to the documentation, the pprint.pprint() method displays an easy-to-read display of whatever is passed to it. It passes sys.path, a list, to pprint.pprint so that the list is displayed in a way that is clear and easy to read.
Data Meant to Be Read by Machines
Creating a Home for Your Code
CSV Data
If you don't already have one installed, follow the instructions in “Install a Code Editor” on page 15. If you open the file in your text editor,1 you'll see the lines that hold the data containing the values in Table 3-1.
How to Import CSV Data
In the next line, we pass the csv file to the reader function in the csv module. With this for loop, we store each dog name in the for loop variable dog.
Closing Indented Code Blocks in IPython
For each of the iterations of our for loop, we print out the dog's name (held in the variable dog). When the for loop has gone through each dog's name (or item in the list), the code has finished running.
Saving the Code to a File; Running from Command Line
Now that we can import and review our data, we can really start exploring it.
Opening Files from Different Locations
To run a Python file from the command line, you simply type python, a space, and then the name of the file. At this point, we successfully imported the CSV data into Python, which means we were able to get the data from the file into a usable format that Python can understand (dictionaries).
JSON Data
We were able to use two different readers from the csv library to view the data in both a list and a dictionary form.
How to Import JSON Data
Be sure to rename the file to match the name you use in your code. In the case of CSV, we opened the file in read-only mode, and in the case of JSON, we read con‐.
XML Data
As you can see, opening and converting a JSON file to a list of dictionaries in Python is pretty easy. While in JSON you can store data in key/value pairs, in XML you can store data in pairs or groups of three or four.
How to Import XML Data
To understand how to traverse a tree (and the data in it), we start at the root of the tree. We want to check the keys so we can form the keys and values of the new item dictionary.
Working with List Indexes
These all look like great keys for our new dictionary, except for the last one. This is because the last element is a numeric dictionary instead of the categories we have been working with.
Python’s if Statement
If you use the updated code, you will see that the value of rec_matches correctly for Numeric. Run the code again and you should now see that we have values for our rec_key and our rec_value.
Working with Excel Files
Installing Python Packages
There are many ways to install pip, and you should have already done this in Chapter 1 of this book. In Chapter 13, we'll learn more about the selection process when looking at Twitter libraries.
Parsing Excel Files
Getting Started with Parsing
Create a new Python file called parse_excel.py and place it in the directory you created. Place the Excel file from the book's repository named SOWC 2014 Stat Tables_Table 9.xlsx in the same folder.
What Is range()?
The first method we will use is a nested for loop, which is a for loop inside another for loop. We are going to use a nested for loop to output each cell from each row.
What Is a Counter?
The answer is in the following code example, but before you look, try updating the counter to start at the row where the country names start. This line will iterate through the first 20 rows to identify which row the country names start on.
What Is Indexing?
Now we need to match each of the values in the remaining row to the corresponding values in the spreadsheet, then store them in our dictionary. Knowing that xrange can accept a start and end point, we can start counting at 14 and end at the end of the file.
Comments
This is a plural comment used to generally describe what is going on in this script. It will stop executing the loop and continue to the next part of the script.
PDFs and Problem Solving in Python
Avoid Using PDFs!
Programmatic Approaches to PDF Parsing
As mentioned in “Installing Python Packages” on page 73, PyPI is a convenient place for us to look for Python packages. When learning to program, the best tool is the most intuitive.
Opening and Reading Using slate
If you look at your PDF, it's easy to see the pattern of rows on the page. If you need to skip lines, you can follow the pattern in the Excel example in the previous chapter, where we made a counter incremented by one for each row, used it to find the region, and then used the technique described in "What Is indexing?" on page 83 to select only the data we needed.
Converting PDF to Text
In some cases, tables are buried in text, so you may have to skip lines to get to the data you need. Our ultimate goal is to get the data from the PDF in the same format as the Excel file output.
Autocompletion Using Tab
Parsing PDFs Using pdfminer
If state_line is true, print the line so we can see the state name. If line_country or total_line is set to True, output the line so we can see what data we have.
Arguments with Default Values Always Come Last
Fill the previous_line variable with the current line at the end of the for loop. If the previous_line is in the doubleline_countries list, this line joins the previous_line with the current line and assigns the combined lines to the line variable.
Automating Your Data Checking
If the line is not in the list of double_lined_countries, the next line adds it to the list of countries. Let's get back to the original task we were working on: creating our list of countries with double lines so we can make sure we collect both lines.
Learning How to Solve Problems
Exercise: Use Table Extraction, Try a Different Library
We can see that the addresses are included in the first three lists and are unordered. Let's run the numbers and see if we can lose bad data.
Exercise: Clean the Data Manually
We encourage you to keep an eye on them for your PDF table parsing needs. It's dangerous to rely on unsupported code, and we can't rely on pdf tables to be around and working forever.2 However, part of belonging to the open source community is giving back; so we encourage you to find good projects and help by contributing and publishing them in the hope that projects like pdf tables remain open source and continue to grow and thrive.
Exercise: Try Another Tool
You will need to manipulate the data further to match the exact structure we had in our Excel import, but we were able to keep the data in our rows from the PDF. To see what else is out there, instead of searching for "parsing pdfs python", we tried searching for "extracting tables from pdf", which gave us more distinct solutions to the table problem (including a blog post reviewing several tools) .
Uncommon File Types
If it is not possible to get the data in another format, you should try the following. In Chapters 7 and 8, we cover data cleansing, which will also help with the complexities of processing PDFs.
Acquiring and Storing Data
Not All Data Is Created Equal
Fact Checking
If you want to study a phenomenon in your local city, state, or country, you can usually find someone who has a useful data set over the phone or email. You can even use some of the tips you learn in this book (especially Chapter 14) to create scripts and update your data automatically.
Readability, Cleanliness, and Longevity
Once you've checked and verified your initial data set, you'll find it easier to script and determine the validity of the data in the future. Knowing how often an organization updates its data will help you determine your ability to use the data in the years to come.
Where to Find Data
Using a Telephone
What methods were used in data collection and how were they met‐. Don't be shy - your interest in their subject of study and work can be refreshing and flattering.
Dealing with a Communications Official
US Government Data
FOIA How-To
Government and Civic Open Data Worldwide
If you are interested in data from Asian countries, most of them run their own open data websites. Many non-EU Central Asian, Central European and Middle Eastern countries have their own government open data websites.
Organization and Non-Government Organization (NGO) Data
Education and University Data
Medical and Scientific Data
Crowdsourced Data and APIs
Outside of social web services, there are a number of sites where you can post your own questions and ideas and ask for a crowdsourced response. The amount of data available is huge, and it's no small task to sort through all the noise to get a good idea of what questions you can answer and how you should go about answering them.
Case Studies: Example Data Investigation
Whether you want to go to an expert forum related to the topic or post a survey and distribute it through your own channels is up to you. However, keep in mind that when using your own research questions and methods, you must take into account any size and sampling error. to arise. For a more detailed introduction to writing your own survey, along with quotes for further information, the University of Wisconsin Survey Guide can be a good starting point.
Ebola Crisis
Train Safety
Football Salaries
Child Labor
We've included them all in the repository for you to use and follow. Now that we've explored how to identify queries and find resources, let's look at storing our data.
Storing Your Data: When, Why, and How?
When searching for government data, we identify the US Department of Labor's annual reports on child labor around the globe. The ILO report appears to be relevant to many different data sets and should be a good reference for historical data on child labour.
Where Should I Store My Data?
So let's say your data set is different: a file from here, a report from there; some of it is easy to download and access, but other parts you may need to copy or scrape from the web. We'll cover how to clean and combine data sets in Chapters 7 and 9, but now let's talk about how we can store data in a common place.
Databases: A Brief Introduction
If you're going to use datasets from multiple computers, it's always a good idea to store them on a network or the Internet (hello cloud computing!), or have them on an external hard drive or USB stick . Keep this in mind when working with a team that may need access to data from different locations or computers.
Relational Databases: MySQL and PostgreSQL
If you are familiar with (or learning) MySQL and want to use a MySQL database, there are Python bindings to connect easily. If you are familiar with (or learning) PostgreSQL and want to use a PostgreSQL database, there are Python bindings for PostgreSQL as well.
Non-Relational Databases: NoSQL
While there are plenty to choose from, one of the most popular is NoSQL database frames. To use MongoDB, you must first install the drivers and then use Python to connect.
Setting Up Your Local Database with Python
If you already have data in a non-relational database structure or are hoping to get the hang of it, it's very easy to connect NoSQL databases using Python. Having all your data stored in one place keeps your data organized and your research focused.
When to Use a Simple File
To manage these needs, you can store your data on a shared network drive or in a cloud-based service (Dropbox, Box, Amazon, Google Drive). Using one of these services usually means you'll also have backup options, manageability.
Cloud-Storage and Python
Local Storage and Python
Alternative Data Storage
You've come through some of the biggest questions facing your project: How can I find useful data? You will use all these skills and more as you progress in future chapters.
Data Cleanup: Investigation, Matching, and Formatting
Why Clean Data?
Imagine if you have columns or fields in your dataset that need to be stored as a certain data type (like dates or numbers or email addresses). If you can standardize what you expect to see and clean up or remove data that doesn't fit, then you ensure the consistency of your data and eliminate the hard work later when you need to search for assets in your dataset.
Data Cleanup Basics
As we explored in Chapter 6, it's much easier to store your data in the right models if it's cleaned up first. It's a great tool for some social science statistics, but it's not very useful for our Python needs.
Identifying Values for Data Cleanup
Okay, so we can clearly see here that we need to match the second row of data_rows to the first index of header_rows. The next line will store the indexes of the data rows that don't match our headers, so we can skip collecting that data.
Finding Outliers and Bad Data
As we talked about in Chapter 6, you want to make sure that your source is properly researched and that you can trust the data. For example, you can look through a data set and see if an important value is missing.
As with the code for counting NA responses, here we test whether the question is in the keys of the data type dictionary. If the query is not in the data type dictionary, this code adds it and stores a copy of start_dict as the value.
Finding Duplicates
An array's merge method combines the values of the first array and the second array. Passing return_index=True results in a bunch of strings: the first is an array of unique values, and sec‐.
Fuzzy Matching
Here we use the fuzz module's token_set_ratio function, which uses the same token approach, but compares sets of the tokens to see intersections and differences. And unfortunately, we couldn't match our cheeseburger answer properly, even though it's the same.
RegEx Matching
Uses FuzzyWuzzy's extractOne method to return only the best match between our string and the list of available choices. The ExtractOne function just finds the best match and returns it along with its relation as a tuple.