Finding Duplicates - Buku Data Wrangling with Python

If you are using more than one dataset with the same survey data or if you have used raw data that may have duplicate entries, removing duplicate data will be an impor‐

tant step in ensuring your data can be accurately used. If you have a dataset with unique identifiers, you can use those IDs to ensure you haven’t accidentally inserted or acquired duplicate data. If you do not have an indexed dataset, you might need to figure out a good way to identify each unique entry (such as creating an indexable key).

Python has some great ways to identify uniques in the built-in library. Let’s introduce some of the concepts here:

list_with_dupes = [1, 5, 6, 2, 5, 6, 8, 3, 8, 3, 3, 7, 9]

set_without_dupes = set(list_with_dupes) print set_without_dupes

Your output should look something like this:

{1, 2, 3, 5, 6, 7, 8, 9}

Data Cleanup Basics | 173

What’s happening here? Set and frozenset are Python built-in types which allow us to take an iterable object (like a list, or a string, or a tuple) and create a set containing only the unique values.

In order to use set and frozenset, the values need to be hashable.

With hashable types, we can apply a hash method and the resulting value will always be the same. This means, for example, that we can trust a ³ is the same as every other ³ we see in code.

Most Python objects are hashable—only lists and dictionaries are not. We can create sets using ^set with any collection of hashable types (integers, floats, decimals, strings, tuples, etc.). The other neat thing about sets and frozensets is they have some fast comparison properties. Let’s take a look at some examples:

first_set = set([1, 5, 6, 2, 6, 3, 6, 7, 3, 7, 9, 10, 321, 54, 654, 432]) second_set = set([4, 6, 7, 432, 6, 7, 4, 9, 0])

print first_set.intersection(second_set) print first_set.union(second_set) print first_set.difference(second_set) print second_set - first_set

print 6 in second_set print 0 in first_set

The intersection method of a set returns the intersection between two sets (i.e., the elements held in commom). A built-in Venn diagram!

The ^union method of a set combines the values of the first set and the second set.

The ^difference method shows the difference between the first set and the sec‐

ond set. Order of operations matters, as you’ll see in the next line.

Subtracting one set from another shows the difference between them. Changing the order of the difference sets changes the order of the result (just like in math).

in tests set membership (with very fast performance).

Your output should look like this:

set([432, 9, 6, 7])

set([0, 1, 2, 3, 4, 5, 6, 7, 9, 10, 321, 432, 654, 54])

set([1, 2, 3, 5, 321, 10, 654, 54]) set([0, 4])

True False

Sets have quite a lot of useful features for defining unique datasets and comparing sets. There are many times in data wrangling where we need to know the minimum and maximum of a series of values, or we need a union of unique keys. Sets can help us with those tasks.

Aside from sets, Python has several other libraries with easy ways to test uniqueness.

One library you can use for uniqueness is ^numpy, a powerful mathmatics library for Python with scientific and statistical methods and classes. ^numpy has superior array, numerical, and mathematical capabilities compared to the core Python libraries. It also has a great method called ^unique used with a ^numpy array. You can install ^numpy like so:

pip install numpy

Let’s take a look at how ^numpy’s ^unique works:

import numpy as np

list_with_dupes = [1, 5, 6, 2, 5, 6, 8, 3, 8, 3, 3, 7, 9]

print np.unique(list_with_dupes, return_index=True)

array_with_dupes = np.array([[1, 5, 7, 3, 9, 11, 23], [2, 4, 6, 8, 2, 8, 4]]) print np.unique(array_with_dupes)

Numpy’s ^unique method keeps track of the indexes. Passing return_index=True results in a tuple of arrays: the first is an array of the unique values, and the sec‐

ond is a flattened array of the indexes—only the first occurrence of every number will be present.

To show more ^numpy, this line creates a ^numpy matrix. This is an array of arrays (equally sized).

unique creates a unique set out of a matrix.

Your output will look like this:

(array([1, 2, 3, 5, 6, 7, 8, 9]), array([ 0, 3, 7, 1, 2, 11, 6, 12])) [ 1 2 3 4 5 6 7 8 9 11 23]

If you don’t have unique keys, you can write a function to create a unique set. It can be as simple as using list comprehension. Let’s try one using Python’s sets for our

Data Cleanup Basics | 175

dataset. First, we determine a unique number by taking a look at which data in our dataset is unique:

for x in enumerate(zipped_data[0]):

print x ...

(0, (['HH1', 'Cluster number', ''], '1')) (1, (['HH2', 'Household number', ''], '17')) (2, (['LN', 'Line number', ''], '1')) (3, (['MWM1', 'Cluster number', ''], '1')) (4, (['MWM2', 'Household number', ''], '17')) (5, (['MWM4', "Man's line number", ''], '1'))

We see the first five elements of each row have some presumably unique identifiers.

Assuming we are understanding the data properly, the cluster, household, and man’s line numbers should create a unique combination. It’s possible the line numbers are unique as well. Let’s see if we are correct:

set_of_lines = set([x[2][1] for x in zipped_data])

uniques = [x for x in zipped_data if not set_of_lines.remove(x[2][1])]

print set_of_lines

First, we make a set containing the line numbers of the survey. The line number is the third element in each response and the value is the second element of that row (^x[2][1]). We use list comprehension to speed up our code.

set_of_lines now holds the unique keys. We can use the set object’s ^remove method to see if we have more than one of each of those keys in the dataset. If the line number is unique, it will remove each key only once. If we have duplicates, remove will throw a ^KeyError to let us know that key is no longer in our set.

Hmm. We did see an error when we ran the code, so we were wrong in our assump‐

tion the line numbers were unique. If we take a closer look at the set we made, it looks like line numbers go from 1–16 and then repeat.

You’ll often have to work with messy datasets, or datasets similar to this one with no obvious unique key. Our suggestion in times like these is to determine a good way to find a unique key and then use that as a comparison.

We have numerous options for creating a unique key. We could use the start time of the interview. However, we aren’t sure whether UNICEF deploys many survey teams at once; and if so, we could remove items marked as duplicates which are actually not

duplicates. We could use the birth date of the man combined with the time of the interview, as that’s unlikely to have matches, but if we had any missing fields that could be problematic.

One nice solution is to see if a combination of the cluster, household, and line num‐

bers creates a unique key. If so, then we could use this method across datasets—even ones without a start and end time. Let’s give it a try!

set_of_keys = set([

'%s-%s-%s' % (x[0][1], x[1][1], x[2][1]) for x in zipped_data]) uniques = [x for x in zipped_data if not set_of_keys.remove(

'%s-%s-%s' % (x[0][1], x[1][1], x[2][1]))]

print len(set_of_keys)

Makes a string out of the three parts we think are unique: the cluster number, the household number, and the line number. We are separating each with a ^- so we can differentiate between the three values.

Re-creates the unique key we used, and uses the ^remove feature. This will remove those entries one by one, and the ^uniques list will hold every unique row. If there is a duplicate entry, our code will again throw an error.

Evaluates the length our list of unique keys. This can show us how many unique entries we have in our dataset.

Super! This time we have no errors. We can see by the length of our list that each row is a unique entry. This is what we would expect from a processed dataset, as UNICEF does some data cleanup before publishing and ensures there are no dupli‐

cates. If we were combining this data with other UNICEF data, we might add M in our key because it’s the men’s group survey. We could then cross reference households that carry the same numbers.

Depending on your data, the unique key might not be obvious. Birth dates and addresses might be a good combination. The chances there are two 24-year-old women at the same address with the exact same date of birth are slim, although not out of the question if they are twins who live together!

Moving on from duplicates, we’ll now take a look at fuzzy matching, a great way to find duplicates in particularly noisy datasets.

Dalam dokumen Buku Data Wrangling with Python (Halaman 191-195)