Exploring Inventory vs. Sales - A Beginner's Guide to Building Datasets for Analysis

market_date, vendor_id, and product_id, and add up the quantities sold and each row’s quantity (?): multiplied by the cost to get the total sales for each group:

SELECT market_date, vendor_id, product_id,

SUM(quantity) quantity_sold,

SUM(quantity * cost_to_customer_per_qty) total_sales FROM farmers_market.customer_purchases

WHERE vendor_id = 7 and product_id = 4 GROUP BY market_date, vendor_id, product_id ORDER BY market_date, vendor_id, product_id

You can see the results of this in Figure 9.15.

Again, if we were building a report, we would probably want to round those dollars in the last column to two numbers after the decimal, but since this is an exploration for our own information, I won’t spend the extra time on formatting.

product yet, so this query will return a lot of rows:

SELECT * FROM farmers_market.vendor_inventory AS vi LEFT JOIN

(

SELECT market_date, vendor_id, product_id,

SUM(quantity) AS quantity_sold,

SUM(quantity * cost_to_customer_per_qty) AS total_sales FROM farmers_market.customer_purchases

GROUP BY market_date, vendor_id, product_id ) AS sales

ON vi.market_date = sales.market_date AND vi.vendor_id = sales.vendor_id AND vi.product_id = sales.product_id ORDER BY vi.market_date, vi.vendor_id, vi.product_id LIMIT 10

We can see in Figure 9.16 that the vendor_id, product_id, and market_date do match on every row, and the summary values for vendor_id 8 and product_id 4 match what we saw when looking at the customer_purchases table alone. This confirms that our join looks right, and we can remove these redundant columns by specifying which columns we want to display from each table. We’ll make sure to pull these columns in from the vendor_inventory table, which we made the “left” side of the JOIN relationship because a customer can’t buy inventory that doesn’t exist (Theoretically! We should check the data to make sure that never happens, which might indicate a vendor made a mistake in entering their available inventory.) But there could be inventory that doesn’t get purchased, which we would want to see in the output, but would be missing if we chose a RIGHT JOIN instead.

Figure 9.16

We can also join in additional “lookup” tables to convert the various IDs to human- readable values, pulling in the vendor name and product names.

Then, we can filter to vendor 7 and product 4 to get the information we were looking for earlier, comparing this vendor’s inventory of this product to the sales made at each market:

SELECT vi.market_date, vi.vendor_id, v.vendor_name, vi.product_id, p.product_name,

vi.quantity AS quantity_available, sales.quantity_sold,

vi.original_price, sales.total_sales

FROM farmers_market.vendor_inventory AS vi LEFT JOIN

(

SELECT market_date, vendor_id, product_id,

SUM(quantity) AS quantity_sold,

SUM(quantity * cost_to_customer_per_qty) AS total_sales FROM farmers_market.customer_purchases

GROUP BY market_date, vendor_id, product_id ) AS sales

ON vi.market_date = sales.market_date AND vi.vendor_id = sales.vendor_id AND vi.product_id = sales.product_id LEFT JOIN farmers_market.vendor v ON vi.vendor_id = v.vendor_id LEFT JOIN farmers_market.product p ON vi.product_id = p.product_id WHERE vi.vendor_id = 7

AND vi.product_id = 4

ORDER BY vi.market_date, vi.vendor_id, vi.product_id

Now we can see in a sample of this query’s output in Figure 9.17 that this vendor is called Marco’s Peppers, and the product we were looking at is jars of Banana Peppers. He brings 30–40 jars each time and sells between 1 and 40 jars per market (which we quickly determined by sorting the output in the SQL editor ascending and descending by the quantity_sold, but we could’ve also done by adding quantity_sold to the ORDER BY clause of the query and scrolling to the top and bottom of the output or surrounding this query with another query that calculated the MIN and MAX quantity_sold).

To take a closer look at the distribution of sales and continue our EDA visu- ally, or to easily filter to different combinations of vendors and products, we can remove the WHERE clause filter and pull the dataset generated by this query into reporting software such as Tableau, building dashboards and reports. Figure 9.18 shows a dashboard with the inventory and sales of the selected products during the selected date range. Figure 9.19 shows a histogram with how many different market dates each count of jars of Banana Peppers were sold.

We could also do a lot more calculations in SQL using this dataset, such as calculating the percent of each vendor’s inventory sold at each market, but the purpose of this demonstration was to show how much you can learn about the data in a database using the types of SQL queries you have learned in this book so far.

In the next chapter, we’ll explore more analytical queries for building reports using SQL that go beyond simple data previews and basic summaries.

Figure 9.17

Figure 9.18

Exercises

1. In the chapter, it was suggested that we should see if the customer_ purchases data was collected for the same time frame as the vendor_inventory table.

Write a query that gets the earliest and latest dates in the customer_ purchases table.

2. There is a MySQL function DAYNAME() that returns the name of the day of the week for a date. Using the DAYNAME and EXTRACT functions on the customer_purchases table, select and group by the weekday and hour of the day, and count the distinct number of customers during each hour of the Wednesday and Saturday markets. See Chapters 6, “Aggregating Results for Analysis,” and 8, “Date and Time Functions,” for information on the COUNT DISTINCT and EXTRACT functions.

3. What other questions haven’t we yet asked about the data in these tables that you would be curious about? Write two more queries further exploring or summarizing the data in the product, vendor_inventory, or customer_purchases tables.

Figure 9.19

143 In previous chapters, we covered basic SQL SELECT syntax and started to use SQL to construct datasets to answer specific questions. In the data analysis world, being asked questions, exploring a database, writing SQL statements to find and pull the data needed to determine the answers, and conducting the analysis of that data to calculate the answers to the questions, is called ad- hoc reporting.

I often say in my data science conference presentations that the process depicted in Figure 10.1 is what is expected of any data analyst or data scientist: to be able to listen to a question from a business stakeholder, determine how it might be answered using data from the database, retrieve the data needed to answer it, calculate the answers, and present that result in a form that the business stakeholder can understand and use to make decisions.

You now know enough SQL to answer some basic ad- hoc questions about what is occurring at the fictional farmer’s market using the demonstration database by filtering, joining, and summarizing the data.

In the remaining chapters, we’ll take those skills to the next level and dem- onstrate how to think through multiple analysis questions, simulating what it might be like to write queries to answer a question posed by a business stakeholder. We’ll design and develop analytical datasets that can be used repeat- edly to facilitate ad- hoc reporting, build dashboards, and serve as inputs into predictive models.

Building SQL Datasets for

Dalam dokumen A Beginner's Guide to Building Datasets for Analysis (Halaman 164-169)