Aggregate Window Functions - A Beginner's Guide to Building Datasets for Analysis

You learned about aggregate SQL functions like SUM() in Chapter 6, “Aggregating Results for Analysis,” and in this chapter you have learned about window functions that partition the results set. Can you imagine how they might be used together? It turns out that you can use most aggregate functions across partitions like the window functions, returning an aggregate calculation for a partition on every row in that partition (or, for the whole results set, if you don’t use the PARTITION BY clause). One way this approach can be used is to compare each row’s value to the aggregate value for that grouped category.

For example, what if you are a farmer selling products at the market, and you want to know which of your products were above the average price per product on each market date? (Remember that because of the way our database is designed, this isn’t a true average for the full inventory, because we’re not multiplying by a quantity, but you can think of it as the average display price in a product catalog.) We can use the AVG() function as a window function, partitioned by market_date, and compare each product’s price to that value.

First, let’s try using AVG() as a window function. The output of the following query is shown in Figure 7.5:

SELECT

vendor_id, market_date,

Continues

product_id, original_price,

AVG(original_price) OVER (PARTITION BY market_date ORDER BY market_date)

AS average_cost_product_by_market_date FROM farmers_market.vendor_inventory

The AVG() function in this query is structured as a window function, meaning it has “OVER (PARTITION BY __ ORDER BY __)” syntax, so instead of returning a single row per group with the average for that group, like you would get with GROUP BY, this function displays the average for the partition on every row within the partition. You can see in Figure 7.5 that when you get to a new market_date value in the results dataset, the average_cost_product_by_market_date value changes.

Now, let’s wrap that query inside another query (use it as a subquery) so we can compare the original price per item to the average cost of products on each market date that has been calculated by the window function. In this example, we are comparing the values in the last two columns of Figure 7.5. Remember that we can’t compare the two values in the original query, because the window function is calculated over multiple rows and won’t have a value for the partition yet when the WHERE clause filters are being applied row by row.

Using a subquery, we can filter the results to a single vendor, with vendor_id 1, and only display products that have prices above the market date’s average product cost. Here we will also format the average_cost_product_by_market_

date to two digits after the decimal point using the ROUND() function:

SELECT * FROM (

SELECT

vendor_id, market_date, Figure 7.5

(continued)

FROM farmers_market.vendor_inventory ) x

WHERE x.vendor_id = 1

AND x.original_price > x.average_cost_product_by_market_date ORDER BY x.market_date, x.original_price DESC

Note that we will get different (and incorrect) results if we put the WHERE clause filtering by vendor_id inside the parentheses with the original query in this case. That’s because the results set of the inner SELECT statement would be filtered to vendor_id 1 before the window function was calculated, we would only be calculating the average price of vendor 1’s products! Since we want to compare vendor 1’s prices on each market date to the average price of all vendors’ products on each market date, we don’t want to filter to vendor_id 1 until after the averages have been calculated, so we put the WHERE clause on the “outer” query outside the parentheses.

The results of the preceding query are shown in Figure 7.6. So vendor_id 1 had a single product, with product_id 11, that was above the average product cost on each of the market dates listed.

Another use of an aggregate window function is to count how many items are in each partition. The following is a query that counts how many different products each vendor brought to market on each date, and displays that count on each row. This way, even if the results weren’t sorted in a way that let you quickly determine how many inventory rows there are for each vendor, you would know that the row you’re looking at represents just one of the products in a counted set:

SELECT

vendor_id, market_date, product_id, original_price,

COUNT(product_id) OVER (PARTITION BY market_date, vendor_id) vendor_product_count_per_market_date

FROM farmers_market.vendor_inventory

ORDER BY vendor_id, market_date, original_price DESC Figure 7.6

The output for this query is shown in Figure 7.7. You can see that even if I’m only looking at one row for vendor 9 on March 9, 2019, I would know that it is one of three products that vendor had in their inventory on that market date.

You can also use aggregate window functions to calculate running totals.

In the first query shown next, we’re not using a PARTITION BY clause, so the running total of the price is calculated across the entire results set, in the sort order specified in the ORDER BY clause of the SUM() window function. The results are displayed in Figure 7.8.

SELECT customer_id, market_date, vendor_id, product_id,

quantity * cost_to_customer_per_qty AS price,

SUM(quantity * cost_to_customer_per_qty) OVER (ORDER BY market_date, transaction_time, customer_id, product_id) AS running_total_purchases FROM farmers_market.customer_purchases

Figure 7.7

Figure 7.8

(in case any two items have identical purchase times). The result is shown in Figure 7.9.

SELECT customer_id, market_date, vendor_id, product_id,

quantity * cost_to_customer_per_qty AS price,

SUM(quantity * cost_to_customer_per_qty) OVER (PARTITION BY customer_id ORDER BY market_date, transaction_time, product_id) AS customer_spend_running_total

FROM farmers_market.customer_purchases

This SUM functions as a running total because of the combination of the PARTITION BY and ORDER BY clauses in the window function. We showed what happens when there is only an ORDER BY clause, and when both clauses are pre- sent. What do you expect to happen when there is only a PARTITION BY clause (and no ORDER BY clause)?

SELECT customer_id, market_date, vendor_id, product_id,

ROUND(quantity * cost_to_customer_per_qty, 2) AS price,

ROUND(SUM(quantity * cost_to_customer_per_qty) OVER (PARTITION BY customer_id), 2) AS customer_spend_total

FROM farmers_market.customer_purchases

As hinted at by the field name alias, this version with no in- partition sorting calculates the total spent by the customer and displays that summary total on Figure 7.9

every row. So, without the ORDER BY, the SUM is calculated across the entire partition, instead of as a per- row running total, as shown in Figure 7.10. We also added the ROUND() function so this final output displays the prices with two numbers after the decimal point.

Dalam dokumen A Beginner's Guide to Building Datasets for Analysis (Halaman 129-134)