Handling Duplicates - Buku Learning MySQL and MariaDB

In Chapter 6, we covered the ^INSERT statement in detail. We saw several variants on its syntax and interesting ways to use it. This included INSERT…SELECT, a combination of the

INSERT and SELECT statements. There is another combination related to updating rows,

INSERT…ON DUPLICATE KEY UPDATE.

When inserting multiple rows of data, you may attempt inadvertently to insert rows that would be duplicates: that is to say, rows with the same value that is supposed to be unique.

With the ^INSERT statement, you can add the ^IGNORE flag to indicate that duplicate rows should be ignored and not inserted. With the REPLACE statement, MySQL will replace the existing rows with the new data, or rather it will delete the existing rows and insert the new rows. As an alternative, you might want to keep the existing rows, but make a

notation to them in each row. Such a situation is when INSERT…ON DUPLICATE KEY UPDATE

is useful. This will make more sense with an example.

Suppose there is another bird-watchers website similar to ours that’s called Better Birders.

Because that site has become inactive and the owner wants to close it, he contacts us and offers to redirect the site’s traffic to our domain if we’ll add its members to our

membership. We accept this offer, so he gives us a plain-text file with a list of each member’s name and email address. There are a few ways we might import those names;

some are covered in Chapter 15. But because some of the members of the other site may already be members of our site, we don’t want to import them and have duplicate entries.

However, we do want to make note of those people as being members of the other site in case we want that information later. Let’s try using INSERT…ON DUPLICATE KEY UPDATE to

do that. First we’ll add a column to indicate that a member came from the Better Birders site by using the ALTER TABLE statement like so:

ALTER TABLE humans

ADD COLUMN better_birders_site TINYINT DEFAULT 0;

This statement added a column named better_birders_site with a default value of 0. If someone is a member of the Better Birders site, we’ll set the column to 1. We’ll set the column to a value of 2 to indicate they are a member of both sites. Because two people can have the same name, we use the email address to determine whether a row is a duplicate.

In the humans table, the email_address column is already set to UNIQUE. It will be the basis by which rows will be updated with the combined SQL statement we’ll use. With these factors in mind, let’s try to insert a few members:

INSERT INTO humans

(formal_title, name_first, name_last, email_address, better_birders_site) VALUES('Mr','Barry','Pilson', '[email protected]', 1),

('Ms','Lexi','Hollar', '[email protected]', 1), ('Mr','Ricky','Adams', '[email protected]', 1)

ON DUPLICATE KEY

UPDATE better_birders_site = 2;

Because of the ON DUPLICATE KEY component, when there are rows with the same email address, the better_birders_site column will be set to 2. The rest will be inserted with their better_birders_site column set to 1. That’s what we wanted.

We now need to insert rows for these new members in the prize_winners table. We’ll use the INSERT…SELECT statement as we did earlier, but this time we’ll just insert rows where the value of the better_birders_site column is 1:

INSERT INTO prize_winners (human_id)

SELECT human_id FROM humans

WHERE better_birders_site = 1;

Although these two SQL statements worked well, it’s possible that there might be two entries for someone in the humans table if they used a different email address on the other site. That possibility may already exist with our existing members if they registered on the site more than once. Let’s check for this possibility and add a column to note it. We’ll enter the following SQL statements to prepare:

ALTER TABLE humans

ADD COLUMN possible_duplicate TINYINT DEFAULT 0;

CREATE TEMPORARY TABLE possible_duplicates (name_1 varchar(25), name_2 varchar(25));

The first statement added a column to the ^humans table to note a row as a possible

duplicate entry. The second creates a temporary table. A temporary table is accessible only to your MySQL client connection. When you exit from the client, the temporary table will be dropped automatically. Because we cannot update the same table for which we’re checking for duplicates, we can note them in this temporary table. We’ll use INSERT…

SELECT to do this:

INSERT INTO possible_duplicates SELECT name_first, name_last FROM

(SELECT name_first, name_last, COUNT(*) AS nbr_entries FROM humans

GROUP BY name_first, name_last) AS derived_table WHERE nbr_entries > 1;

This statement uses a subquery that selects the names and counts the number of entries based on the GROUP BY clause. We saw how to use GROUP BY and COUNT() together in Counting and Grouping Results, but their use here calls for a reiteration of how they work.

The subquery selects ^name_first and ^name_last, and groups them so that any rows containing the same first and last names will be grouped together. They can then be

counted. We give the result of ^COUNT(*) an alias of nbr_entries so that we can reference it elsewhere.

Back in the main SQL statement, the WHERE clause selects only rows from the subquery in which there are more than one entry (i.e., nbr_entries is greater than 1). These are

duplicate entries. This SQL statement will insert a row into the temporary table for rows found in the ^humans table that have the same first and last name. It should enter only one row in the temporary table for each person.

Now that we have a list of possible duplicates in the temporary table, let’s update the

humans table to note them:

UPDATE humans, possible_duplicates SET possible_duplicate = 1

WHERE name_first = name_1 AND name_last = name_2;

That will set the value of the possible_duplicate column to 1 where the names in the

humans table match the names in possible_duplicates. When we’re ready, we can send an email to these members telling them that we have two entries for their names and asking if the entries are duplicates. If they are, we might be able to merge the information together (such as by creating another column for a second email address) and delete the duplicate rows. As for the temporary table, it will be deleted when we close the MySQL client.

Dalam dokumen Buku Learning MySQL and MariaDB (Halaman 171-174)