• Tidak ada hasil yang ditemukan

Buku Data Wrangling with Python

N/A
N/A
Eben Ezer

Academic year: 2024

Membagikan "Buku Data Wrangling with Python"

Copied!
501
0
0

Teks penuh

Joshua Hatch, Senior Editor, Data and Interactives, The Chronicle of Higher Education and The Chronicle of Philanthropy. Both the business of working with data and the sophistication of tools have evolved over the past 20 years, making it more important to think about a common set of techniques.

Preface

Who Should Read This Book

Who Should Not Read This Book

How This Book Is Organized

What Is Data Wrangling?

What to Do If You Get Stuck

Conventions Used in This Book

Using Code Examples

For example, writing a program that uses several pieces of code from this book does not require permission. If you feel that your use of code samples falls outside of fair use or the permission granted above, please contact us at [email protected].

Safari® Books Online

How to Contact Us

We have a website for this book that lists errata, examples, and any additional information. Find us on Facebook: http://facebook.com/oreilly Follow us on Twitter: http://twitter.com/oreillymedia Watch us on YouTube: http://www.youtube.com/oreillymedia.

Acknowledgments

Introduction to Python

The article explains why graduation costs are so high compared to school district budgets. The question remained: Why is the Manatee County School District so financially strapped that it can't make its regular dues?

Finding Your Datasets

Through this process you will learn how to clean data and identify and solve hidden problems in your data sets. As we cover the data issues you may encounter, you'll learn how to determine whether you should write a cleanup script or use an out-of-the-box approach.

Why Python

Once you have obtained and transformed the data, you will begin your initial data exploration. You will learn to tell the story in multiple formats and compare different publishing options.

Getting Started with Python

Besides the benefits of Python as a language, it also has one of the most open and helpful communities. Many members of the Python community no longer have the fresh eyes that you currently possess.

Which Python Version

If you want to try some of the examples with Python 3, feel free; However, we would prefer that you focus on learning Python 2.7 and move on to Python 3 after completing this book. Most of these external code pieces will work for Python 2.7, but may not yet work for 3.4.

Setting Up Python on Your Machine

If you are using 64-bit Windows, you will need to download the x86-64 MSI installer from the download page. After you have successfully installed Python, you will want to add Python to your environment.

Figure 1-4. Adding features using the installer
Figure 1-4. Adding features using the installer

Test Driving Python

If you get an error, note this and make sure you're only typing Python commands in the Python interpreter. If you type a command into your Python interpreter that you should type into the system terminal, you'll probably get a NameError or SyntaxError.

Install pip

YOUR_NAME\Downloads to change to your Downloads folder (replacing the name of your computer's main directory with YOUR_NAME). When you use pip, your computer searches PyPI for the specified package or code library, downloads it to your computer, and installs it.

Install a Code Editor

Optional: Install IPython

Summary

Python Basics

Launching the Python Interpreter

Basic Data Types

Strings

Some use single quotes by convention in their code, and others prefer double quotes. Personally, we prefer single quotes because double quotes require us to hold down the Shift key.

Integers and Floats

In the following example, the first line imports getcontext and Decimal from the decimal module so that we have them in our environment. There are many different approaches to math that you may need to do, but the decimal type allows us to be more precise when using a non-integer.

Numbers in Python

As mentioned earlier, there are many mathematical details you will encounter when wrangling your data.

Data Containers

Variables

Object-Oriented Programming

In this example, the error says, dogs is not defined, which means we haven't defined a variable called dogs. You will get the same error if you forgot to include the quotes in 'budget.csv' in our first example.

Lists

We decided to use short strings in our examples because typing long strings is not fun for you (or us). Now let's save our list in a variable so we can use it later in our code.

Dictionaries

This line defines the variable cat_names as a list of cat names (a list of strings). This line uses the variable cat_names to pass that list of names as the value for the key 'cats' in the dictionary.

What Can the Various Data Types Do?

Think of the data types as nouns and the things they can do as verbs. A wide range of built-in methods and basic data types are included in the Python standard library (also known as stdlib), including the Python data types you use today.

String Methods: Things Strings Can Do

When we call a string's methods, these actions are part of the standard Python libraries that every Python installation shares (similar to the standard applications that come pre-installed on your phone). These methods will be present on every computer running Python, so any Python string can share the same methods (just as any phone can make a call and any Apple phone can send an iMessage).

Numerical Methods: Things Numbers Can Do

The number of lines you use for your code is sometimes a matter of style or personal preference. Make choices that make sense to you, but keep your code clear, easy to read, and clear.

Addition and Subtraction

List Methods: Things Lists Can Do

Dictionary Methods: Things Dictionaries Can Do

Helpful Tools: type, dir, and help

For now, ignore everything at the beginning of the returned list (the strings that start with the double underscore). If you look at the string 'cat,dog,horse' it looks like it's a list stored in a string.

The next line reads: "Return the list of words in the string S using sep as the delimiter string."sep is the argument passed to the split method, which is used as the delimiter. The comma is also the delimiter in the string we've created, as it separates the words we want in our list.

Putting It All Together

The help description also tells us that spaces, or white space, are the default delimiter if no other delimiter is specified. This tells us that if we had a string 'cat dog horse', the split method would not require us to use a separator within the .

What Does It All Mean?

According to the documentation, the pprint.pprint() method displays an easy-to-read display of whatever is passed to it. It passes sys.path, a list, to pprint.pprint so that the list is displayed in a way that is clear and easy to read.

Table 2-1. Data types Name Example
Table 2-1. Data types Name Example

Data Meant to Be Read by Machines

Creating a Home for Your Code

CSV Data

If you don't already have one installed, follow the instructions in “Install a Code Editor” on page 15. If you open the file in your text editor,1 you'll see the lines that hold the data containing the values ​​in Table 3-1.

Table 3-1. Two sample data records a
Table 3-1. Two sample data records a

How to Import CSV Data

In the next line, we pass the csv file to the reader function in the csv module. With this for loop, we store each dog name in the for loop variable dog.

Closing Indented Code Blocks in IPython

For each of the iterations of our for loop, we print out the dog's name (held in the variable dog). When the for loop has gone through each dog's name (or item in the list), the code has finished running.

Saving the Code to a File; Running from Command Line

Now that we can import and review our data, we can really start exploring it.

Opening Files from Different Locations

To run a Python file from the command line, you simply type python, a space, and then the name of the file. At this point, we successfully imported the CSV data into Python, which means we were able to get the data from the file into a usable format that Python can understand (dictionaries).

JSON Data

We were able to use two different readers from the csv library to view the data in both a list and a dictionary form.

How to Import JSON Data

Be sure to rename the file to match the name you use in your code. In the case of CSV, we opened the file in read-only mode, and in the case of JSON, we read con‐.

XML Data

As you can see, opening and converting a JSON file to a list of dictionaries in Python is pretty easy. While in JSON you can store data in key/value pairs, in XML you can store data in pairs or groups of three or four.

How to Import XML Data

To understand how to traverse a tree (and the data in it), we start at the root of the tree. We want to check the keys so we can form the keys and values ​​of the new item dictionary.

Working with List Indexes

These all look like great keys for our new dictionary, except for the last one. This is because the last element is a numeric dictionary instead of the categories we have been working with.

Python’s if Statement

If you use the updated code, you will see that the value of rec_matches correctly for Numeric. Run the code again and you should now see that we have values ​​for our rec_key and our rec_value.

Table 3-2. File types and file extensions File type File extensions Python library CSV, TSV .csv, .tsv csv
Table 3-2. File types and file extensions File type File extensions Python library CSV, TSV .csv, .tsv csv

Working with Excel Files

Installing Python Packages

There are many ways to install pip, and you should have already done this in Chapter 1 of this book. In Chapter 13, we'll learn more about the selection process when looking at Twitter libraries.

Parsing Excel Files

Getting Started with Parsing

Create a new Python file called parse_excel.py and place it in the directory you created. Place the Excel file from the book's repository named SOWC 2014 Stat Tables_Table 9.xlsx in the same folder.

What Is range()?

The first method we will use is a nested for loop, which is a for loop inside another for loop. We are going to use a nested for loop to output each cell from each row.

What Is a Counter?

The answer is in the following code example, but before you look, try updating the counter to start at the row where the country names start. This line will iterate through the first 20 rows to identify which row the country names start on.

What Is Indexing?

Now we need to match each of the values ​​in the remaining row to the corresponding values ​​in the spreadsheet, then store them in our dictionary. Knowing that xrange can accept a start and end point, we can start counting at 14 and end at the end of the file.

Comments

This is a plural comment used to generally describe what is going on in this script. It will stop executing the loop and continue to the next part of the script.

Table 4-1. New Python programming concepts
Table 4-1. New Python programming concepts

PDFs and Problem Solving in Python

Avoid Using PDFs!

Programmatic Approaches to PDF Parsing

As mentioned in “Installing Python Packages” on page 73, PyPI is a convenient place for us to look for Python packages. When learning to program, the best tool is the most intuitive.

Figure 5-1. PDF packages on PyPI
Figure 5-1. PDF packages on PyPI

Opening and Reading Using slate

If you look at your PDF, it's easy to see the pattern of rows on the page. If you need to skip lines, you can follow the pattern in the Excel example in the previous chapter, where we made a counter incremented by one for each row, used it to find the region, and then used the technique described in "What Is indexing?" on page 83 to select only the data we needed.

TABLE 9Afghanistan 10  11  10  15  40  37  –  –  –  –  90  74  75  74Albania 12  14  9  0  10  99  –  –  –  36  30  75  78  71Algeria 5 y 6 y 4 y 0  2  99 –  –  –  –  68  88  89  87Andorra –   –  –  –  –  100 v –  –  –  –  –  –  – –Angola 24 x 22 x 25 x –
TABLE 9Afghanistan 10 11 10 15 40 37 – – – – 90 74 75 74Albania 12 14 9 0 10 99 – – – 36 30 75 78 71Algeria 5 y 6 y 4 y 0 2 99 – – – – 68 88 89 87Andorra – – – – – 100 v – – – – – – – –Angola 24 x 22 x 25 x –

Converting PDF to Text

In some cases, tables are buried in text, so you may have to skip lines to get to the data you need. Our ultimate goal is to get the data from the PDF in the same format as the Excel file output.

Autocompletion Using Tab

Parsing PDFs Using pdfminer

If state_line is true, print the line so we can see the state name. If line_country or total_line is set to True, output the line so we can see what data we have.

Arguments with Default Values Always Come Last

Fill the previous_line variable with the current line at the end of the for loop. If the previous_line is in the doubleline_countries list, this line joins the previous_line with the current line and assigns the combined lines to the line variable.

Figure 5-2. Bolivia in the PDF
Figure 5-2. Bolivia in the PDF

Automating Your Data Checking

If the line is not in the list of double_lined_countries, the next line adds it to the list of countries. Let's get back to the original task we were working on: creating our list of countries with double lines so we can make sure we collect both lines.

Learning How to Solve Problems

Exercise: Use Table Extraction, Try a Different Library

We can see that the addresses are included in the first three lists and are unordered. Let's run the numbers and see if we can lose bad data.

Exercise: Clean the Data Manually

We encourage you to keep an eye on them for your PDF table parsing needs. It's dangerous to rely on unsupported code, and we can't rely on pdf tables to be around and working forever.2 However, part of belonging to the open source community is giving back; so we encourage you to find good projects and help by contributing and publishing them in the hope that projects like pdf tables remain open source and continue to grow and thrive.

Exercise: Try Another Tool

You will need to manipulate the data further to match the exact structure we had in our Excel import, but we were able to keep the data in our rows from the PDF. To see what else is out there, instead of searching for "parsing pdfs python", we tried searching for "extracting tables from pdf", which gave us more distinct solutions to the table problem (including a blog post reviewing several tools) .

Figure 5-3. Select tables in Tabula
Figure 5-3. Select tables in Tabula

Uncommon File Types

If it is not possible to get the data in another format, you should try the following. In Chapters 7 and 8, we cover data cleansing, which will also help with the complexities of processing PDFs.

Table 5-1. New Python libraries and tools Library or tool Purpose
Table 5-1. New Python libraries and tools Library or tool Purpose

Acquiring and Storing Data

Not All Data Is Created Equal

Fact Checking

If you want to study a phenomenon in your local city, state, or country, you can usually find someone who has a useful data set over the phone or email. You can even use some of the tips you learn in this book (especially Chapter 14) to create scripts and update your data automatically.

Readability, Cleanliness, and Longevity

Once you've checked and verified your initial data set, you'll find it easier to script and determine the validity of the data in the future. Knowing how often an organization updates its data will help you determine your ability to use the data in the years to come.

Where to Find Data

Using a Telephone

What methods were used in data collection and how were they met‐. Don't be shy - your interest in their subject of study and work can be refreshing and flattering.

Dealing with a Communications Official

US Government Data

FOIA How-To

Government and Civic Open Data Worldwide

If you are interested in data from Asian countries, most of them run their own open data websites. Many non-EU Central Asian, Central European and Middle Eastern countries have their own government open data websites.

Organization and Non-Government Organization (NGO) Data

Education and University Data

Medical and Scientific Data

Crowdsourced Data and APIs

Outside of social web services, there are a number of sites where you can post your own questions and ideas and ask for a crowdsourced response. The amount of data available is huge, and it's no small task to sort through all the noise to get a good idea of ​​what questions you can answer and how you should go about answering them.

Case Studies: Example Data Investigation

Whether you want to go to an expert forum related to the topic or post a survey and distribute it through your own channels is up to you. However, keep in mind that when using your own research questions and methods, you must take into account any size and sampling error. to arise. For a more detailed introduction to writing your own survey, along with quotes for further information, the University of Wisconsin Survey Guide can be a good starting point.

Ebola Crisis

Train Safety

Football Salaries

Child Labor

We've included them all in the repository for you to use and follow. Now that we've explored how to identify queries and find resources, let's look at storing our data.

Storing Your Data: When, Why, and How?

When searching for government data, we identify the US Department of Labor's annual reports on child labor around the globe. The ILO report appears to be relevant to many different data sets and should be a good reference for historical data on child labour.

Where Should I Store My Data?

So let's say your data set is different: a file from here, a report from there; some of it is easy to download and access, but other parts you may need to copy or scrape from the web. We'll cover how to clean and combine data sets in Chapters 7 and 9, but now let's talk about how we can store data in a common place.

Databases: A Brief Introduction

If you're going to use datasets from multiple computers, it's always a good idea to store them on a network or the Internet (hello cloud computing!), or have them on an external hard drive or USB stick . Keep this in mind when working with a team that may need access to data from different locations or computers.

Relational Databases: MySQL and PostgreSQL

If you are familiar with (or learning) MySQL and want to use a MySQL database, there are Python bindings to connect easily. If you are familiar with (or learning) PostgreSQL and want to use a PostgreSQL database, there are Python bindings for PostgreSQL as well.

Non-Relational Databases: NoSQL

While there are plenty to choose from, one of the most popular is NoSQL database frames. To use MongoDB, you must first install the drivers and then use Python to connect.

Setting Up Your Local Database with Python

If you already have data in a non-relational database structure or are hoping to get the hang of it, it's very easy to connect NoSQL databases using Python. Having all your data stored in one place keeps your data organized and your research focused.

When to Use a Simple File

To manage these needs, you can store your data on a shared network drive or in a cloud-based service (Dropbox, Box, Amazon, Google Drive). Using one of these services usually means you'll also have backup options, manageability.

Cloud-Storage and Python

Local Storage and Python

Alternative Data Storage

You've come through some of the biggest questions facing your project: How can I find useful data? You will use all these skills and more as you progress in future chapters.

Data Cleanup: Investigation, Matching, and Formatting

Why Clean Data?

Imagine if you have columns or fields in your dataset that need to be stored as a certain data type (like dates or numbers or email addresses). If you can standardize what you expect to see and clean up or remove data that doesn't fit, then you ensure the consistency of your data and eliminate the hard work later when you need to search for assets in your dataset.

Data Cleanup Basics

As we explored in Chapter 6, it's much easier to store your data in the right models if it's cleaned up first. It's a great tool for some social science statistics, but it's not very useful for our Python needs.

Identifying Values for Data Cleanup

Okay, so we can clearly see here that we need to match the second row of data_rows to the first index of header_rows. The next line will store the indexes of the data rows that don't match our headers, so we can skip collecting that data.

Finding Outliers and Bad Data

As we talked about in Chapter 6, you want to make sure that your source is properly researched and that you can trust the data. For example, you can look through a data set and see if an important value is missing.

As with the code for counting NA responses, here we test whether the question is in the keys of the data type dictionary. If the query is not in the data type dictionary, this code adds it and stores a copy of start_dict as the value.

Finding Duplicates

An array's merge method combines the values ​​of the first array and the second array. Passing return_index=True results in a bunch of strings: the first is an array of unique values, and sec‐.

Fuzzy Matching

Here we use the fuzz module's token_set_ratio function, which uses the same token approach, but compares sets of the tokens to see intersections and differences. And unfortunately, we couldn't match our cheeseburger answer properly, even though it's the same.

RegEx Matching

Uses FuzzyWuzzy's extractOne method to return only the best match between our string and the list of available choices. The ExtractOne function just finds the best match and returns it along with its relation as a tuple.

Gambar

Figure 1-1. Data handling process
Figure 1-2. Terminal search using Spotlight
Figure 1-3. A newly opened Terminal window
Figure 1-4. Adding features using the installer
+7

Referensi

Dokumen terkait

Several thousands of individual data elements have been integrated through shared Departmental data, data modeling, and common data architecture techniques making it possible to

the sample data and scripts, use a line of Python code to call the Buffer tool and generate buffers around the input features with the following steps:.. When you preview the fi

You need to know the content of this book's prequel, Learning IPython for Interactive Computing and Data Visualization: Python programming, the IPython console and notebook,

• Artificial intelligence and machine learning : If computer science knowledge helps us to program data analysis tools, artificial intelligence and machine learning help us to

He is the author of the IPython Interactive Computing and Visualization Cookbook , Packt Publishing , an advanced-level guide to data science and numerical computing with

Data structure and algorithm in Python Software Development Design Software Development • Responsibilities: Divide the work into different actors, each with a different

The descriptive analysis for circular data can be calculated using the proposed coding and it is beneficial for those who are started exploring circular data and decided to use Python

167-178 E- ISSN 2503-2933 169 Sendi Algifari Rismawan, et., al [Implementasi Website Berita Online Menggunakan Metode Crawling Data Dengan Bahasa Pemrograman Python] Penelitian