Fuzzy Matching - Buku Data Wrangling with Python

duplicates. We could use the birth date of the man combined with the time of the interview, as that’s unlikely to have matches, but if we had any missing fields that could be problematic.

One nice solution is to see if a combination of the cluster, household, and line num‐

bers creates a unique key. If so, then we could use this method across datasets—even ones without a start and end time. Let’s give it a try!

set_of_keys = set([

'%s-%s-%s' % (x[0][1], x[1][1], x[2][1]) for x in zipped_data]) uniques = [x for x in zipped_data if not set_of_keys.remove(

'%s-%s-%s' % (x[0][1], x[1][1], x[2][1]))]

print len(set_of_keys)

Makes a string out of the three parts we think are unique: the cluster number, the household number, and the line number. We are separating each with a ^- so we can differentiate between the three values.

Re-creates the unique key we used, and uses the ^remove feature. This will remove those entries one by one, and the ^uniques list will hold every unique row. If there is a duplicate entry, our code will again throw an error.

Evaluates the length our list of unique keys. This can show us how many unique entries we have in our dataset.

Super! This time we have no errors. We can see by the length of our list that each row is a unique entry. This is what we would expect from a processed dataset, as UNICEF does some data cleanup before publishing and ensures there are no dupli‐

cates. If we were combining this data with other UNICEF data, we might add M in our key because it’s the men’s group survey. We could then cross reference households that carry the same numbers.

Depending on your data, the unique key might not be obvious. Birth dates and addresses might be a good combination. The chances there are two 24-year-old women at the same address with the exact same date of birth are slim, although not out of the question if they are twins who live together!

Moving on from duplicates, we’ll now take a look at fuzzy matching, a great way to find duplicates in particularly noisy datasets.

natural language processing or machine learning to determine a match with big data‐

sets on language, fuzzy matching can help us relate “My dog & I” and “me and my dog” as having similar meaning.

There are many ways to go about fuzzy matching. One Python library, developed by SeatGeek, uses some pretty cool methods internally to match tickets being sold online for different events. You can install it by using:

pip install fuzzywuzzy

So let’s say you’re dealing with some unclean data. Maybe it was input sloppily or is user-entered and contains misspellings and small syntactic errors or deviations. How might you account for that?

from fuzzywuzzy import fuzz

my_records = [{'favorite_book': 'Grapes of Wrath', 'favorite_movie': 'Free Willie', 'favorite_show': 'Two Broke Girls', },

{'favorite_book': 'The Grapes of Wrath', 'favorite_movie': 'Free Willy',

'favorite_show': '2 Broke Girls', }]

print fuzz.ratio(my_records[0].get('favorite_book'), my_records[1].get('favorite_book')) print fuzz.ratio(my_records[0].get('favorite_movie'), my_records[1].get('favorite_movie')) print fuzz.ratio(my_records[0].get('favorite_show'), my_records[1].get('favorite_show'))

Here we use the ^fuzz module’s ^ratio function, which expects two strings to com‐

pare. It returns the similarity of the sequencing of the strings (a value between 1 and 100).

We can see from our own understanding of popular culture and English that these two entries have the same favorites; however, they have spelled them differently.

FuzzyWuzzy helps us counter these unintentional mistakes. We can see our matches using ^ratio scored pretty high. This gives us some level of confidence the strings are similar.

Let’s try another FuzzyWuzzy method and see our results. We’ll use the same data for the sake of simplicity and comparison:

print fuzz.partial_ratio(my_records[0].get('favorite_book'), my_records[1].get('favorite_book'))

print fuzz.partial_ratio(my_records[0].get('favorite_movie'), my_records[1].get('favorite_movie')) print fuzz.partial_ratio(my_records[0].get('favorite_show'), my_records[1].get('favorite_show'))

Here we call the ^fuzz module’s partial_ratio function, which expects two strings to compare. It returns the similarity of the sequencing of the closest matching substrings (a value between 1 and 100).

Wow, we can see we are getting much higher numbers! The partial_ratio function allows us to compare substrings, which means we don’t need to worry if someone has forgotten a word (like in our book example) or used different punctuation. This means a closer match for all of our strings.

If your data has some simple inconsistencies, these are some great functions to help find the mismatches. But if your data has some large differences in meaning with a few characters’ difference, you might want to test similarity and difference. For example, “does”

and “doesn’t” are quite different in meaning but not very different in spelling. In the first ^ratio example, these two strings wouldn’t score highly, but in the substring, we would have a match. Knowl‐

edge of your data and the complexities therein is a must!

FuzzyWuzzy also has some other cool options. Let’s explore some of them, as they might pertain to your data cleanup needs:

from fuzzywuzzy import fuzz

my_records = [{'favorite_food': 'cheeseburgers with bacon', 'favorite_drink': 'wine, beer, and tequila', 'favorite_dessert': 'cheese or cake', },

{'favorite_food': 'burgers with cheese and bacon', 'favorite_drink': 'beer, wine, and tequila', 'favorite_dessert': 'cheese cake',

}]

print fuzz.token_sort_ratio(my_records[0].get('favorite_food'), my_records[1].get('favorite_food')) print fuzz.token_sort_ratio(my_records[0].get('favorite_drink'), my_records[1].get('favorite_drink')) print fuzz.token_sort_ratio(my_records[0].get('favorite_dessert'), my_records[1].get('favorite_dessert'))

Data Cleanup Basics | 179

Here we call the ^fuzz module’s token_sort_ratio function, which allows us to match strings despite word order. This is great for free-form survey data, where

“I like dogs and cats” and “I like cats and dogs” mean the same thing. Each string is first sorted and then compared, so if they contain the same words in a different order, they will match.

From our output, we can see using tokens (here, words) gives us a pretty good chance at matching word order differences. Here we see the favorite drink options are the same, just in different orders. We can use this when the order of tokens doesn’t change the meaning. For SeatGeek, “Pittsburgh Steelers vs. New England Patriots” is the same as “New England Patriots vs. Pittsburgh Steelers” (with the exception of home field advantage).

Let’s take a look at another token-oriented function from FuzzyWuzzy, using our same data:

print fuzz.token_set_ratio(my_records[0].get('favorite_food'), my_records[1].get('favorite_food')) print fuzz.token_set_ratio(my_records[0].get('favorite_drink'), my_records[1].get('favorite_drink')) print fuzz.token_set_ratio(my_records[0].get('favorite_dessert'), my_records[1].get('favorite_dessert'))

Here we use the ^fuzz module’s token_set_ratio function, which uses the same token approach but compares sets of the tokens to see intersection and differ‐

ence. The function attempts to find the best possible match of sorted tokens and returns the ratio of similarity for those tokens.

Here we can see an unintended side effect if we are not aware of similarities and dif‐

ferences in our dataset. One answer was improperly spelled. We know cheesecake and cheese are not the same thing, but using the token set approach, these resulted in a false positive. And sadly, we could not properly match our cheeseburger answer, even though it’s the same. Can you do so using another method we’ve already learned?

One final matching method FuzzyWuzzy provides is the ^process module. This is great if you have a limited amount of choices or options and messy data. Let’s say you know the answer has to be yes, no, maybe, or decline to comment. Let’s take a look at how we can match these up:

from fuzzywuzzy import process

choices = ['Yes', 'No', 'Maybe', 'N/A']

process.extract('ya', choices, limit=2) process.extractOne('ya', choices)

process.extract('nope', choices, limit=2) process.extractOne('nope', choices)

Uses FuzzyWuzzy’s ^extract method to compare strings to the list of possible matches. The function returns two possible matches from the list of choices we have declared in our ^choices variable.

Uses FuzzyWuzzy’s ^extractOne method to return only the best match between our string and the list of available choices.

Aha! Given a variety of words we know “mean” the same thing, ^process can extract the best guess—and in these cases, the correct guess. With ^extract we get tuples with the ratios returned, and our code can parse through the string responses and com‐

pare how similar or different the matches are. The ^extractOne function just finds the best match and returns it along with its ratio as a tuple. Depending on your needs, you might opt for ^extractOne to help you simply find the best match and move along.

You’ve learned all about matching strings now, so let’s talk a little about how to write some similar string matching functions on your own.

Dalam dokumen Buku Data Wrangling with Python (Halaman 195-199)