• Tidak ada hasil yang ditemukan

Date Functions in Aggregate Summaries and Window Functions

In this section, we’ll explore a few ways that you can use date functions when summarizing data.

Figure 8.8

Let’s say we wanted to get a profile of each farmer’s market customer’s habits over time. So, we’ll want to group the results at the customer level and include some date- related summary information in the output. Our database isn’t very heavily populated with example purchases over a long time period yet, but we can use the sample data to demonstrate these concepts.

First, let’s get each customer’s purchase detail records, particularly the dates on which each customer made purchases. We’ll start by querying the database for the records for customer_id 1:

SELECT customer_id, market_date FROM farmers_market.customer_purchases WHERE customer_id = 1

Figure 8.9 shows all of the purchases made by customer 1 over time. Let’s summarize this data and get their earliest purchase date, latest purchase date, and number of different days on which they made a purchase.

We’ll GROUP BY customer_id, use MIN and MAX to get the lowest (earliest) and highest (latest) purchase dates, and COUNT DISTINCT to determine on how many different dates they made purchases:

SELECT customer_id,

MIN(market_date) AS first_purchase, MAX(market_date) AS last_purchase,

COUNT(DISTINCT market_date) AS count_of_purchase_dates FROM farmers_market.customer_purchases

WHERE customer_id = 1 GROUP BY customer_id

Figure 8.10 shows the output of this query.

Figure 8.9

Figure 8.10

to the DATEDIFF function. I’ll also remove the customer filter here, so we can see the results for all customers in Figure 8.11:

SELECT customer_id,

MIN(market_date) AS first_purchase, MAX(market_date) AS last_purchase,

COUNT(DISTINCT market_date) AS count_of_purchase_dates,

DATEDIFF(MAX(market_date), MIN(market_date)) AS days_between_first_

last_purchase

FROM farmers_market.customer_purchases GROUP BY customer_id

If we wanted to also know how long it’s been since the customer last made a purchase, we can use the CURDATE() function (which may be called CURRENT_DATE, TODAY(), SYSDATE, or GETDATE() in your particular database system’s SQL syntax;

check the documentation). The following query demonstrates its usage. CUR- DATE() can be used to represent the current system date in any calculation that requires a date or datetime parameter. Keep in mind that the server’s current time might differ from your local time, depending on what time zone it is set to:

SELECT customer_id,

MIN(market_date) AS first_purchase, MAX(market_date) AS last_purchase,

COUNT(DISTINCT market_date) AS count_of_purchase_dates,

DATEDIFF(MAX(market_date), MIN(market_date)) AS days_between_first_

last_purchase,

DATEDIFF(CURDATE(), MAX(market_date)) AS days_since_last_purchase FROM farmers_market.customer_purchases

GROUP BY customer_id

Going back to the window functions covered in Chapter 7, “Window Functions Frequently Used by Data Scientists,” we can also write a query that gives us the days between each purchase a customer makes. Let’s go back to customer 1’s detailed purchases (previously shown in Figure 8.9) and use both the RANK Figure 8.11

and LAG window functions to retrieve each purchase date, along with the next purchase date, so we can have both values per row to enable us to display both and calculate the time between each:

SELECT customer_id, market_date,

RANK() OVER (PARTITION BY customer_id ORDER BY market_date) AS purchase_number,

LEAD(market_date,1) OVER (PARTITION BY customer_id ORDER BY market_

date) AS next_purchase

FROM farmers_market.customer_purchases WHERE customer_id = 1

The results of this query are shown in Figure 8.12.

You can see that we didn’t quite accomplish the goal of retrieving each purchase date and the previous purchase date in order to show the time between them, because there are multiple rows with the same date in cases where the customer purchased multiple items on the same date. We can resolve this a few ways.

One approach is to remove the duplicates by using the DISTINCT keyword, and then use a WHERE clause filter to remove rows where the two dates (current and next purchase) are the same (because multiple purchases were made on the same date).

Another is to remove duplicates in the initial dataset and use a subquery (a query inside a query) to get the date differences. Doing this and moving the window functions to the outer query will also fix the issue of the RANK counting each purchase, when we really want to count each purchase date.

This is what that second approach looks like:

SELECT

x.customer_id, x.market_date,

RANK() OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS purchase_number,

LEAD(x.market_date,1) OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS next_purchase

Figure 8.12

) x

and we can now add a line to the query to use that next_purchase date in a DATEDIFF calculation:

SELECT

x.customer_id, x.market_date,

RANK() OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS purchase_number,

LEAD(x.market_date,1) OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS next_purchase,

DATEDIFF(

LEAD(x.market_date,1) OVER

(PARTITION BY x.customer_id ORDER BY x.market_date), x.market_date

) AS days_between_purchases FROM

(

SELECT DISTINCT customer_id, market_date FROM farmers_market.customer_purchases WHERE customer_id = 1

) x

This may look confusing, but we used the same exact LEAD function inside the DATEDIFF as we used in the next_purchase field above it, and the second DATEDIFF parameter is just market_date, so we are calculating the days between the current row’s market_date and next_purchase columns. We can’t just insert the next_purchase column name into the query there; we have to calculate it for the days_between_purchases field as well, because the calculations don’t happen sequentially and are at the same level (the outer query).

The results of the preceding query are shown in Figure 8.13. You might notice that the final days_between_purchases value is NULL. That’s because that row’s next_purchase date is NULL, since there are no more purchases for customer 1 after March 20, 2019.

Figure 8.13

If we wanted to use the next_purchase field name inside the DATEDIFF() function to avoid inserting that LEAD() calculation twice, we could use another query layer and have a query of a query of a query, as shown in the following code. Here, we’ll remove the customer_id filter to return all customers, then filter to each customer’s first purchase by adding a filter on the calculated pur- chase_number. This query answers the question “How many days pass between each customer’s first and second purchase?” The results of this query are shown in Figure 8.14.

SELECT

a.customer_id,

a.market_date AS first_purchase, a.next_purchase AS second_purchase,

DATEDIFF(a.next_purchase, a.market_date) AS time_between_1st_2nd_

purchase FROM (

SELECT

x.customer_id, x.market_date,

RANK() OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS purchase_number,

LEAD(x.market_date,1) OVER (PARTITION BY x.customer_id ORDER BY x.market_date) AS next_purchase

FROM (

SELECT DISTINCT customer_id, market_date FROM farmers_market.customer_purchases ) x

) a

WHERE a.purchase_number = 1

In Chapter 10, “Building Analytical Reports with SQL,” we will cover a con- cept called Common Table Expression, also known as a CTE or “WITH clause,”

which offers another way to select from precalculated values instead of nesting Figure 8.14

to customer 1’s purchase history (originally shown in Figure 8.9). Let’s say that today’s date is March 31, 2019, and the marketing director of the farmer’s market wants to give infrequent customers an incentive to return to the market in April.

The director asks you for a list of everyone who only made a purchase at one market event during the previous month, because they want to send an email to all of those customers with a coupon to receive a discount on a purchase made in April. How would you pull up that list?

Well, first we have to find everyone who made a purchase in the 31 days prior to March 31, 2019. Then, we need to filter that list to those who only made a purchase on a single market date during that time.

This query would retrieve a list of one row per market date per customer within that date range:

SELECT DISTINCT customer_id, market_date FROM farmers_market.customer_purchases

WHERE DATEDIFF('2019- 03- 31', market_date) <= 31

Then, we could query the results of that query, count the distinct market_date values per customer during that time, and filter to those with exactly one market date, using the HAVING clause (which remember is like the WHERE clause, but calculated after the GROUP BY aggregation):

SELECT x.customer_id,

COUNT(DISTINCT x.market_date) AS market_count FROM

(

SELECT DISTINCT customer_id, market_date FROM farmers_market.customer_purchases

WHERE DATEDIFF('2019- 03- 31', market_date) <= 31 ) x

GROUP BY x.customer_id

HAVING COUNT(DISTINCT market_date) = 1

The results of this query are shown in Figure 8.15

If we were actually fulfilling a report request, we would want to next join these results to the customer table to get the customer name and contact information, but here we have shown how to use date calculations to filter a list of customers by the actions they took.

Figure 8.15

Exercises

1. Get the customer_id, month, and year (in separate columns) of every purchase in the farmers_market.customer_purchases table.

2. Write a query that filters to purchases made in the past two weeks, returns the earliest market_date in that range as a field called sales_since_date, and a sum of the sales (quantity * cost_to_customer_per_qty) during that date range.

Your final answer should use the CURDATE() function, but if you want to test it out on the Farmer’s Market database, you can replace your CUR- DATE() with the value ‘2019- 03- 31’ to get the report for the two weeks prior to March 31, 2019 (otherwise your query will not return any data, because none of the dates in the database will have occurred within two weeks of you writing the query).

3. In MySQL, there is a DAYNAME() function that returns the full name of the day of the week on which a date occurs. Query the Farmer’s Market database market_date_info table, return the market_date, the market_day, and your calculated day of the week name that each market_date occurred on. Create a calculated column using a CASE statement that indicates whether the recorded day in the database differs from your calculated day of the week. This is an example of a quality control query that could be used to check manually entered data for correctness.

127 Exploratory Data Analysis (EDA) is often discussed in a data science context as a first step in the predictive modeling process, when a data scientist explores what the data in a provided dataset looks like prior to using it to build a predic- tive model. The SQL we’ll be using in this chapter could be used at that point in the process, to explore an already- prepared dataset. But what if you don’t have a dataset to work with yet?

Here we’ll show examples that could occur even earlier in the data pipeline, as we explore raw data straight from the database tables (as opposed to an already- aggregated dataset in which the raw data has been combined and transformed using SQL that is ready to be ingested into a model). If you are given access to a database for the first time, these are the types of queries you can run to familiarize yourself with the tables and data in it.

There are of course many ways to conduct EDA, including in a Jupyter note- book with Python code, in a Tableau workbook, or using SQL. (I regularly do all three in my job as a data scientist.) In the later EDA, once a dataset has been prepared, the focus is often on distributions of values, relationships between columns, and identifying correlations between input features and the target variable (column with values to be predicted by the model). Here, we will use the types of queries we’ve covered so far in this book to explore some tables in the Farmer’s Market database, as a demonstration of a real EDA focusing on familiarizing ourselves with the data in the database for the first time.

Exploratory Data Analysis