Removing Duplicate Entries on MySQL Database

Maybe you work with some old database that has many duplicate entries on some tables? This line of SQL can help you detect whether your table has duplicates.

SELECT user_name, COUNT(user_name) c 
FROM user
GROUP BY user_name HAVING c > 1;

After googled here and there, I found that most of the solution is to create a temporary table filled with distinct values from the original table, empty the original table, then replace the entries with the temporary table. WHAT A MESS if the table is 800,000 entries. I’ll be dead!

I found this small discussion over here. It’s much simpler and much better.

The idea is :

  1. Here you tell mysql that there is a table1.
  2. Then you tell it that you will use table1 and a virtual table with the values of table1.
  3. This will let mysql not compare a record with itself! We will use “greater than” instead of “not equal”, so that this SQL’s left the last entered.
  4. Here you tell it that there shouldn’t be records with the same field_name.

So that your SQL will be like this :

delete from table1
USING table1, table1 as vtable
WHERE (table1.ID > vtable.ID)
AND (table1.field_name=vtable.field_name)

If you’re nervous about deleteing everything, you can make sure it’s working first:

SELECT *
FROM table1, table1 AS vtable
WHERE vtable.id > table1.id
AND vtable.field_name = table1.field_name

You probably find these two problems.

  1. Handling the table alias. It will show error such as “Unknown table ‘X’ in MULTI DELETE“. Solution : do not give the original table the alias.
  2. Your Table do not have primary key (relationship table). So it is confusing which column that will be put into the third line in that code. Solution : Create a new primary key+auto increment, even though your table is a relation table only. YOU CAN DELETE IT LATER!

What should you do after remove the duplicate entries?

Create index UNIQUE for any column so that there will be no duplicates in the future. You can delete the temporary primary key, although it doesn’t really hurt to put primary key in relational tabel. I better keep the primary key for my table😀

***

You can try the other method here.

tmp = table name;
a = cols 1
b = cols 2

ALTER IGNORE TABLE tmp ADD UNIQUE INDEX(a,b);

before :

id | a | b
1 | 1 | 2
2 | 1 | 2
3 | 2 | 3
5 | 3 | 4

after:

id | a | b
2 | 1 | 2
3 | 2 | 3
5 | 3 | 4

You can delete the index created later. Note that the index should contain at least two columns. I’ve tried with only one and failed.

Enjoy!

Tinggalkan Balasan

Isikan data di bawah atau klik salah satu ikon untuk log in:

Logo WordPress.com

You are commenting using your WordPress.com account. Logout / Ubah )

Gambar Twitter

You are commenting using your Twitter account. Logout / Ubah )

Foto Facebook

You are commenting using your Facebook account. Logout / Ubah )

Foto Google+

You are commenting using your Google+ account. Logout / Ubah )

Connecting to %s