LAG and LEAD - A Beginner's Guide to Building Datasets for Analysis

every row. So, without the ORDER BY, the SUM is calculated across the entire partition, instead of as a per- row running total, as shown in Figure 7.10. We also added the ROUND() function so this final output displays the prices with two numbers after the decimal point.

The recipient of a report like this, such as the manager of the farmer’s market, may want to filter these query results to a specific market date to determine which vendors are new or changing booths that day, so we can contact them and ensure setup goes smoothly. We will create this report by wrapping the query with the LAG function in another query, which we can use to filter the results to a market_date and vendors whose current booth_number is different from their previous_booth_number:

SELECT * FROM (

SELECT

market_date, vendor_id, booth_number,

LAG(booth_number,1) OVER (PARTITION BY vendor_id ORDER BY market_

date, vendor_id) AS previous_booth_number

FROM farmers_market.vendor_booth_assignments ORDER BY market_date, vendor_id, booth_number ) x

WHERE x.market_date = '2019- 04- 10'

AND (x.booth_number <> x.previous_booth_number OR x.previous_

booth_number IS NULL)

If you look closely at Figure 7.11, you can see that for the April 10, 2019 market, vendor 1 and vendor 4 have swapped booths compared to the previous market date. This would be hard to spot from a printout of this output, but using the preceding query, we can return just the rows with booth changes on the specified date, as shown in Figure 7.12.

Figure 7.11

To show another example use case, let’s say we want to find out if the total sales on each market date are higher or lower than they were on the previous market date. In this example, we are going to use the customer_purchases table from the Farmer’s Market database, and also add in a GROUP BY function, which the previous examples did not include. The window functions are calculated after the grouping and aggregation occurs.

First, we need to get the total sales per market date, using a GROUP BY and reg- ular aggregate SUM. The results of the following query are shown in Figure 7.13:

SELECT

market_date,

SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales FROM farmers_market.customer_purchases

GROUP BY market_date ORDER BY market_date

Then, we can add the LAG() window function to output the previous market_

date’s calculated sum on each row. We ORDER BY market_date in the window function to ensure it’s the previous market date we’re comparing to and not another date. You can see in Figure 7.14 that each row has a new total value (for that market date), as well as the previous market date’s total:

SELECT

market_date,

SUM(quantity * cost_to_customer_per_qty) AS market_date_total_sales, LAG(SUM(quantity * cost_to_customer_per_qty), 1) OVER (ORDER BY market_date) AS previous_market_date_total_sales

FROM farmers_market.customer_purchases GROUP BY market_date

ORDER BY market_date Figure 7.12

Figure 7.13

LEAD works the same way as LAG, but it gets the value from the next row instead of the previous row (assuming the offset integer is 1). You can set the offset integer to any value x to count x rows forward, following the sort order specified in the ORDER BY section of the window function. If the rows are sorted by a time value, LAG would be retrieving data from the past, and LEAD would be retrieving data from the future (relative to the current row). These values can also now be used in calculations; for example, to determine the change in sales week to week.

This chapter just covers the tip of the iceberg when it comes to window functions! Look in the documentation for the type of database you’re working with to see what other functions are available, and what caveats to be aware of for each. Some database systems offer additional capabilities. For example, PostgreSQL supports something called “window naming,” Oracle has additional useful aggregate functions like LISTAGG (which operates on string values), and some database systems allow for additional clauses like RANGE.

Once you understand the concept of a window function and how to use it in your query, you have the knowledge you need to research and apply the many variations.

Exercises Using the Included Database

1. Do the following two steps:

a. Write a query that selects from the customer_purchases table and numbers each customer’s visits to the farmer’s market (labeling each market date with a different number). Each customer’s first visit is labeled 1, second visit is labeled 2, etc. (We are of course not counting visits where no purchases are made, because we have no record of those.) You can either display all rows in the customer_purchases table, with the counter changing on each new market date for each customer, or select only the unique market dates per customer (without purchase details) and number those visits. HINT: One of these approaches uses ROW_NUMBER() and one uses DENSE_RANK().

Figure 7.14

b. Reverse the numbering of the query from a part so each customer’s most recent visit is labeled 1, then write another query that uses this one as a subquery and filters the results to only the customer’s most recent visit.

2. Using a COUNT() window function, include a value along with each row of the customer_purchases table that indicates how many different times that customer has purchased that product_id.

3. In the last query associated with Figure 7.14 from the chapter, we used LAG and sorted by market_date. Can you think of a way to use LEAD in place of LAG, but get the exact same output?

113 Data scientists use date and time functions many different ways in our queries.

We may use two dates to calculate a duration, for example. Many machine learning algorithms are “trained” to identify patterns in data from the past and use those patterns to predict future outcomes. In order to build a dataset for that purpose, we have to be able to filter queries by time range.

Often, datasets that are built for predictive models include summaries of activities within dynamic date ranges— for example, a count of some activity occurrence during each of the past three months. Or, in the case of time- series analysis, an input dataset might include one row per time period (hour, day, week, month) with a count of something associated with each time period; for example, the number of patients a doctor sees per week.

Many predictive models are time- bound. For example, the question “Will this first- time customer become a repeat customer?” will be further refined as

“What is the likelihood that each first- time customer at today’s farmer’s market will return and make a second purchase within the next month?” To answer this question, we could create a dataset with a row for every customer, columns containing data values as of the time of their first purchase, and a binary “target variable” that indicates whether that customer made another purchase within a month of their first purchase date.

Let’s look at some different ways to work with date and time values in our Farmer’s Market database.

Dalam dokumen A Beginner's Guide to Building Datasets for Analysis (Halaman 134-139)