ROW NUMBER - A Beginner's Guide to Building Datasets for Analysis

Based on what you’ve learned in previous chapters, if you wanted to determine how much the most expensive product sold by each vendor costs, you could group the records in the vendor_inventory table by vendor_id, and return the maximum original_price value using the following query:

SELECT

vendor_id,

MAX(original_price) AS highest_price FROM farmers_market.vendor_inventory GROUP BY vendor_id

ORDER BY vendor_id

But this just gives you the price of the most expensive item per vendor. If you wanted to know which item was the most expensive, how would you determine which product_id was associated with that MAX(original_price) per vendor?

There is a window function that enables you to rank rows by a value— in this case, ranking products per vendor by price— called ROW_NUMBER(). This approach will allow you to maintain the detail- level information that you would otherwise lose by aggregating like we did in the preceding query:

SELECT

vendor_id, market_date, product_id, original_price,

ROW_NUMBER() OVER (PARTITION BY vendor_id ORDER BY original_price DESC) AS price_rank

FROM farmers_market.vendor_inventoryORDER BY vendor_id, original_price DESC

Let’s break that syntax down a bit. I would interpret the ROW_NUMBER() line as

“number the rows of inventory per vendor, sorted by original price, in descending order.” The part inside the parentheses says how to apply the ROW_NUMBER() function. We’re going to PARTITION BY vendor_id (you can think of this like a GROUP BY without actually combining the rows, so we’re telling it how to split the rows into groups, without aggregating). Then within the partition, the ORDER BY indicates how to sort the rows. So, we’ll sort the rows by price, high to low, within each vendor_id partition, and number each row. That means the highest- priced item per vendor will be first, and assigned row number 1.

You can see in Figure 7.1 that for each vendor, the products are sorted by original_price, high to low, and the row numbering column is called price_

rank. The row numbering starts over when you get to the next vendor_id, so the most expensive item per vendor has a price_rank of 1.

To return only the record of the highest- priced item per vendor, you can query the results of the previous query (which is called a subquery), and limit the output to the #1 ranked item per vendor_id. With this approach, you’re not using a GROUP BY to aggregate the records. You’re sorting the records within each partition (a set of records that share a value or combination of values— vendor_id in this case), then filtering to a value (the row number called price_rank here) that was evaluated over that partition. Figure 7.2 shows the highest- priced product per vendor using the following query:

SELECT * FROM (

SELECT

vendor_id, market_date, product_id, original_price,

ROW_NUMBER() OVER (PARTITION BY vendor_id ORDER BY original_price DESC) AS price_rank

FROM farmers_market.vendor_inventory ORDER BY vendor_id) x

WHERE x.price_rank = 1 Figure 7.1

Figure 7.2

This will only return one row per vendor, even if there are multiple products with the same price. To return all products with the highest price per vendor when there is more than one with the same price, use the RANK function found in the next section. If you want to determine which one of the multiple items gets returned by this ROW_NUMBER function, you can add additional sorting columns in the ORDER BY section of the ROW_NUMBER function. For example, you can sort by both original_price (descending) and market_date (ascending) to get the product brought to market by each vendor the earliest that had this top price.

You’ll notice that the preceding query has a different structure than the queries we have written so far. There is one query embedded inside the other! Sometimes this is called “querying from a derived table,” but is more commonly called a “subquery.” What we’re doing is treating the results of the “inner” SELECT statement like a table, here given the table alias x, selecting all columns from it, and filtering to only the rows with a particular ROW_NUMBER. Our ROW_NUMBER column is aliased price_rank, and we’re filtering to price_rank = 1, because we numbered the rows by original_price in descending order, so the most expensive item will have the lowest row number.

The reason we have to structure this as a subquery is that the entire dataset has to be processed in order for the window function to find the highest price per vendor. So we can’t filter the results using a WHERE clause (which you’ll remember evaluates the conditional statements row by row) because when that filtering is applied, the ROW_NUMBER has not yet been calculated for every row.

Figure 7.3 illustrates which parts of the SQL statement are considered the

“inner” and “outer” queries. The “outer” part of a subquery is processed after the “inner” query is complete, so the row numbers have been determined, and we can then filter by the values in the price_rank column.

TIP Many SQL editors allow you to run the inner query by itself, by highlighting it and executing the selected SQL only. This allows you to preview the results of the inner query that will then be used by the outer query.

Figure 7.3

functions have not yet had a chance to check the entire dataset to determine the ranking. If we tried to put the ROW_NUMBER function in the WHERE clause, instead of referencing the price_rank alias, we would get a different error, but for the same reason.

You will see the subquery format throughout this chapter, because if you want to do anything with the results of most window functions, you have to allow them to calculate across the entire dataset first. Then, by treating the results like a table, you can query from and filter by the results returned by the window functions.

Note that you can also use ROW_NUMBER without a PARTITION BY clause, to number every record across the whole result (instead of numbering per partition). If you were to use the same ORDER BY clause we did earlier, and eliminate the PARTITION BY clause, then only one item with the highest price in the entire results set would get the price_rank of 1, instead of one item per vendor.

Dalam dokumen A Beginner's Guide to Building Datasets for Analysis (Halaman 124-127)