SQL JOINs - A Beginner's Guide to Building Datasets for Analysis

Figure 5.1 shows the one-to-many relationship between these two tables: each product can only belong to one category, but each category can contain many products. The primary key in the product_category table is the product_cate- gory_id. There is also a product_category_id in each row of the product table that serves as a foreign key, identifying which category each product belongs to.

NOTE Remember that in an ERD, an infinity symbol, “N,” or “crow’s feet” on the end of a line connecting two tables indicates that it is the “many” side of a one-to-many relationship. You can see the infinity symbol next to the product table in Figure 5.1.

In order to combine these tables, we need to figure out which type of JOIN to use. To illustrate the different types of SQL JOINs, we’ll use the two tables from the Farmer’s Market database found in Figure 5.1, but remove some columns to simplify the illustration, as shown in Figure 5.2.

Figure 5.1

Figure 5.2

the product table is associated with only one row in the product_category table. The fields that connect the two tables are product_category.product_

category_id and product.product_category_id.

The first type of JOIN we’ll cover is the one I’ve seen used most frequently when building analytical datasets: the LEFT JOIN. This tells the query to pull all records from the table on the “left side” of the JOIN, and only the matching records (based on the criteria specified in the JOIN clause) from the table on the

“right side” of the JOIN. See Figure 5.3. Note that the product table will be on the left side of the join we’re setting up, even though it was on the right side of the relationship diagram!

In our demonstration tables shown in Figure 5.2, you can see that there are two records in the product table that have a product_category_id of 1, and only one record in the product_category table with the product_category_id of 1 (because it is the primary key in this table).

When we LEFT JOIN the product_category table to the product table, the records with a value of 1 in the product_category_id fields of both tables will be joined in the results, as shown in the first two rows of the output in Figure 5.4.

Note that the data for product_category_id 1 from the product_category table is repeated in two rows in the resulting output, even though it was only in one row in the product_category table, because it is matched up with two records in the joined product table.

Figure 5.3

What makes this JOIN a LEFT JOIN is that all of the records from the product table on the left side of the join are included in the output, even if they don’t have a match in the product_category table. But records from the product_

category table on the right side of the join are only included if they match up with product records on the left side.

In Figure 5.4, you can see that the row with a product_id of 99 is included, but the values in the resulting output columns from the product_category are NULL (empty) This is because there was no product_category_id value on which to join this product to the product_category table. However, the row from the product_category table with a product_category_id of 6 is not included in the output at all, because it was in the “right” table and does not have a matching record on the left side, and this is a LEFT JOIN.

The syntax for creating this output is:

SELECT [columns to return]

FROM [left table]

[JOIN TYPE] [right table]

ON [left table].[field in left table to match] = [right table].[field in right table to match]

Let’s look at the actual output of a query that joins our product and product_

category tables from the Farmer’s Market database using a LEFT JOIN.

In order to pull a list of all products with each product’s category name listed, we make the product table the “left” table in this query by listing it first after FROM, and the product_category table the “right” side by listing it after the LEFT JOIN, and we’ll match up the records using the product_category_id fields, as seen after the ON keyword. This is how a query pulling all fields from both tables looks:

SELECT * FROM product

LEFT JOIN product_category Figure 5.4

joined with the product_category table, matched on the product_category_id that’s common to both tables,” or more specifically: “Select all columns and rows from the product table, and all columns from the product_category table for rows where the product_category’s product_category_id matches a product’s product_category_id.” The first 10 rows of the output from this query are shown in Figure 5.5.

NOTE You may have noticed that there are two columns called product_category_id in Figure 5.5. That is because we selected all fields, using the asterisk, and there are fields in both tables with the same name. To remedy this, we could either specify the list of fields to be returned, and only include the product_category_id from one of the tables, or we could alias the column names to indicate which table each came from, as shown in the next query and pictured in Figure 5.6.

The LEFT JOIN indicates that we want all rows from the product table (which is listed on the left side of the JOIN keyword) and only the associated rows from the product_category table. So, if for some reason there is a category that is not associated with any products, it will not be included in the results. If there were a product without a category (which can’t happen in the actual database, because that is a required field), it would be included in the results, with the fields on the product_category side being NULL, as was illustrated in the last row of Figure 5.4.

The ON part of the JOIN clause tells the query to match up the rows in the two tables using the values in the product_category_id field in each table. Note in Figure 5.5 that the product_category_id fields from both tables match on every row. (It would be helpful for us to alias those field names so they don’t appear with identical column headers in the output, which we will do in the next query.)

Now, if we want to retrieve specific columns from the merged dataset, we have to specify which table each column is from, since it’s possible to have identically named columns in different tables. And we can alias identically named Figure 5.5

columns to differentiate them. For example, the following code accomplishes the changes shown in Figure 5.6.

SELECT

product.product_id, product.product_name,

product.product_category_id AS product_prod_cat_id,

product_category.product_category_id AS category_prod_cat_id, product_category.product_category_name

FROM product

LEFT JOIN product_category

ON product.product_category_id = product_category.

product_category_id

There is another type of aliasing in a SQL query that is for developer convenience, because it isn’t visible in the output: table aliasing. If you don’t want to write out the entire table name every time you reference it, you can assign it a short alias in the FROM clause that can then be used to reference it throughout the query, as demonstrated in the following query with the table aliases “p” and

“pc.” Also, we will also only show a single product_category_id now that we are convinced from the output that they are matching up between tables.

SELECT

p.product_id, p.product_name, pc.product_category_id, pc.product_category_name FROM product AS p

LEFT JOIN product_category AS pc

ON p.product_category_id = pc.product_category_id ORDER BY pc.product_category_name, p.product_name

Again, the AS keyword between the table name and the alias is optional, but I will stick with that standard in this book for consistency. The output from this query is shown in Figure 5.7.

Figure 5.6

The next type of SQL JOIN we’ll discuss is called a RIGHT JOIN, which is illustrated in Figure 5.8. In a RIGHT JOIN, all of the rows from the “right table”

are returned, along with only the matching rows from the “left table,” using the fields specified in the ON part of the query.

If we use a RIGHT JOIN to merge the data in the tables shown in Figure 5.2, the result will look like the table in Figure 5.9. All of the records from the right table, product_category, are returned, but only matching records from product. There are no products in this example with a product_category_id value of 6, so the first three columns of the last row are NULL. Note that the record in the product table with a product_id of 99 that was in the last row of Figure 5.4 is now missing, because we performed a RIGHT JOIN this time, and that record does not have a match in the product_category table, since its product_category_id was NULL.

Figure 5.7

Figure 5.8

You would use a RIGHT JOIN if you wanted to list all product categories and the products in each. (And you didn’t care about products that were not put into a category, but you did care about categories that didn’t contain any products.)

An INNER JOIN only returns records that have matches in both tables. Can you tell which rows from each table in Figure 5.2 will not be returned if we INNER JOIN the product and product_category tables on product_ category_id? Figure 5.10 illustrates an INNER JOIN.

In Figure 5.4, the fields on the right side of the join (from product_category) were NULL in the row where the left side had a NULL product_category_id. In Figure 5.9, the fields on the left side of the table (from product) were NULL in the row where the product_category_id on the right side hadn’t been assigned to any products. If we INNER JOIN the tables, neither of those rows will be included in the output, as shown in Figure 5.11.

Figure 5.9

Figure 5.10

In the INNER JOINed output in Figure 5.11, there is a row for every matched pair of product_category_ids, and no rows without a matching product_category_id on the other side.

To practice all of these types of JOINs, let’s now look at the customer and customer_purchase tables in the Farmer’s Market database. Again, this is a one-to-many type relationship. A customer can have multiple purchases, but each purchase is made by only one customer. The two tables are related via the customer_id field, which is the primary key in the customer table, and a foreign key in the customer_purchases table.

If we do a LEFT JOIN, using the following query, we can see in the output in Figure 5.12 that for some rows, the fields that come from the customer_ purchases table are all NULL. What do you think this means?

SELECT *

FROM customer AS c

LEFT JOIN customer_purchases AS cp ON c.customer_id = cp.customer_id

Unlike for the product-product_category relationship, there can be customers without any purchases. Such customers were added to the customer table when they signed up for the farmer’s market loyalty card, so we have their customer data, but they have not yet purchased any products. Since we did a LEFT JOIN, we’re getting a list of all customers, and their associated purchases, if there are Figure 5.11

Figure 5.12

any. Customers with multiple purchases will show up in the output multiple times—once for each item purchased. Customers without any purchases will have NULL values in all fields displayed that are from the customer_purchases table.

We can use the WHERE clause you learned about in Chapter 3, “The WHERE Clause,” to filter the list to only customers with no purchases, if we’d like:

SELECT c.*

FROM customer AS c

LEFT JOIN customer_purchases AS cp ON c.customer_id = cp.customer_id WHERE cp.customer_id IS NULL

In this case, we only selected columns from the customer table, using c.*, because all of the columns on the customer_purchases side of the relationship will be NULL (since we’re filtering to NULL customer_id, and there are no purchases in the customer_purchases table without a customer_id, since it is a required field; remember that in this imaginary farmer’s market, every purchase is logged at checkout, and every customer uses their loyalty card).

Figure 5.13 shows what the output of this query looks like, listing all customers who don’t have any purchases.

What if we wanted to list all purchases and the customers associated with them? In that case, we could do a RIGHT JOIN, pulling all records from the customer_purchases table, and only customers from the customer table with a purchase (a record in the customer_purchases table with their ID on it):

SELECT *

FROM customer AS c

RIGHT JOIN customer_purchases AS cp ON c.customer_id = cp.customer_id

The output in Figure 5.14 is truncated to save space and doesn’t show all results, but there are no rows returned with NULL values in the customer table columns, because there is no such thing as a purchase without a customer_id. And because we did a RIGHT JOIN, we will no longer get customers without purchases in the results.

Figure 5.13

If you only want records from each table that have matches in both tables, use an INNER JOIN. Using these customer and customer_purchases tables, an INNER JOIN happens to return the same results as the RIGHT JOIN, because there aren’t any records on the “right side” of the join without matches on the

“left side”—every purchase is associated with a customer.

Dalam dokumen A Beginner's Guide to Building Datasets for Analysis (Halaman 87-97)