Earlier in this chapter, in the query that generated the output in Figure 6.4, we added up the quantity value in the customer_purchases table, which included discrete items sold individually as well as bulk items sold by ounce or pound, and it was awkward to add those quantities together. In Chapter 4, “Conditionals / CASE Statements,” you learned about conditional CASE statements. Here, we’ll use a CASE statement to specify which type of item quantities to add together using each SUM aggregate function.
First, we’ll need to JOIN the customer_purchases table to the product table to pull in the product_qty_type column, which currently only contains the values
“unit” and “lbs,” as shown in Figure 6.20.
SELECT
cp.market_date, cp.vendor_id, cp.customer_id, cp.product_id, cp.quantity, p.product_name, p.product_size, p.product_qty_type
FROM farmers_market.customer_purchases AS cp INNER JOIN farmers_market.product AS p ON cp.product_id = p.product_id
To create one column that only adds up quantities of products that are sold by unit, another column that adds up quantities of products sold by the pound, and a third for any products that may be entered in the future that are sold by other units (like bulk ounces), we’ll put CASE statements inside the SUM functions to indicate which values to add up in each summary column.
First, we’ll review the results with the CASE statements included before grouping or using aggregate functions. Notice in Figure 6.21 that the CASE statements have Figure 6.20
cp.vendor_id, cp.customer_id, cp.product_id,
CASE WHEN product_qty_type = "unit" THEN quantity ELSE 0 END AS quantity_units,
CASE WHEN product_qty_type = "lbs" THEN quantity ELSE 0 END AS quantity_lbs,
CASE WHEN product_qty_type NOT IN ("unit","lbs") THEN quantity ELSE 0 END AS quantity_other,
p.product_qty_type
FROM farmers_market.customer_purchases cp INNER JOIN farmers_market.product p ON cp.product_id = p.product_id
Now we can add the SUM functions around each CASE statement to add up these values per market date per customer, as defined in the GROUP BY clause.
The results are shown in Figure 6.22. (The prior screenshot was just a subset of the full results, so there may be values added into the rows in Figure 6.22 that are not visible in Figure 6.21.)
SELECT
cp.market_date, cp.customer_id,
SUM(CASE WHEN product_qty_type = "unit" THEN quantity ELSE 0 END) AS qty_units_purchased,
SUM(CASE WHEN product_qty_type = "lbs" THEN quantity ELSE 0 END) AS qty_lbs_purchased,
SUM(CASE WHEN product_qty_type NOT IN ("unit","lbs") THEN quantity ELSE 0 END) AS qty_other_purchased
Figure 6.21
Continues
FROM farmers_market.customer_purchases cp INNER JOIN farmers_market.product p ON cp.product_id = p.product_id GROUP BY market_date, customer_id ORDER BY market_date, customer_id
So now you have seen examples of how to use COUNT, COUNT DISTINCT, SUM, AVG, MIN, and MAX aggregate SQL functions, as well as CASE statements and cal- culations inside the functions, and calculations performed with the summarized values. I hope that by now you are starting to dream up how to apply these skills to your own work!
Exercises Using the Included Database
1. Write a query that determines how many times each vendor has rented a booth at the farmer’s market. In other words, count the vendor booth assignments per vendor_id.
2. In Chapter 5, “SQL Joins,” Exercise 3, we asked “When is each type of fresh fruit or vegetable in season, locally?” Write a query that displays the product category name, product name, earliest date available, and latest date avail- able for every product in the “Fresh Fruits & Vegetables” product category.
3. The Farmer’s Market Customer Appreciation Committee wants to give a bumper sticker to everyone who has ever spent more than $50 at the market. Write a query that generates a list of customers for them to give stickers to, sorted by last name, then first name. (HINT: This query requires you to join two tables, use an aggregate function, and use the HAVING keyword.)
Figure 6.22 (continued)
97 All of the functions that have been covered in this book so far, like ROUND(), return one value in each row of the results dataset. When GROUP BY is used, the functions operate on multiple values in an aggregated group of records, sum- marizing across multiple rows in the underlying dataset, like AVG(), but each value returned is associated with a single row in the results.
Window functions operate across multiple records, as well, but those records don’t have to be grouped in the output. This gives the ability to put the values from one row of data into context compared to a group of rows, or partition, enabling an analyst to write queries that answer questions like: If the dataset were sorted, where would this row land in the results? How does a value in this row compare to a value in the prior row? How does a value in the current row compare to the average value for its group?
So, window functions return group aggregate calculations alongside individual row- level information for items in that group, or partition. They can also be used to rank or sort values within each partition.
One use for window functions in data science is to include some information from a past record alongside the most recent detail record related to an entity.
For example, we could use window functions to get the date of the first purchase a person made at the farmer’s market, to be returned alongside their detailed purchase records, which could then be used to determine how long they had been a customer at the time each purchase was made.