Last Updated: February 25, 2016

Deleting Duplicate Tuples in MySQL with Entities Using Pseudo Primary Keys

Summary

<a title="Source: Flickr" target="blank" href="http://www.flickr.com/photos/popilop/331357312/in/photostream/"><img height="180" width="240" src="http://farm1.static.flickr.com/128/331357312893ef9e791m.jpg" align="right" /></a> If you have an entity where there is a pseudo primary key based on a sequence (auto increment) that has duplicates, finding and deleting duplicates can be a little tricky, but here's how:

<a href="https://gist.github.com/775417">https://gist.github.com/775417
</a>

--Make a temporary table for unneeded PKs create temporary table dirty
identifier (pseudokey integer);
-- Get the PKs we don't need
insert into dirtyidentifier (
select dt1.pseudopk from dirtytable as dt1, dirtytable as dt2 where
dt1.truepk1 = dt2.truepk1
-- and dt1.truepk2 = dt2.truepk3
-- and ... keep repeating for however many attributes make your real key
and dt1.pseudopk > dt2.pseudopk
);
-- Finally delete the unneeded tuples from the dirty table
delete from dirtytable where pseudopk in (select pseudopk from dirty_identifier);
</code></pre>

Background

Without getting into it to much I work with a MySQL database that sits under a closed source CMS. The database has no concept of referential integrity or natural keys and uses a sequence on every entity as the primary key. A recent update from the CMS vendor has seen duplicate rows appearing.

There are three entities in question: the user, the section and the sectionaccess. The sectionaccess contains the section access rights for a user and uses a userid and sectionid to reference the relevant entities. However, it also has an id that is used as the primary key, rather than using (sectionid, userid). So recently this entity has been seeing duplicate tuples.

Duplicate Detection

In a sane database the distinct rows can be found simply by using select distinct * from sectionaccess</code>, however in this section the distinctiveness is polluted by the pseudo primary key (in this case, id). To find all the tuples that are duplicates and how many of each there are we can use something like select userid, sectionid, count(*) as cnt from sectionaccess group by userid, sectionid having cnt > 1</code>. But this doesn't help us much because of the pseudo primary key. What we need to do is cross reference each tuple with every other tuple in the entity on the real primary key (sectionid, userid) with a cross join. If a tuple based on the true primary key exists only once then the cross reference will occur only once and the pseudo primary key (id) will match. Where there are multiple tuples on the real primary key, the cross reference will have produce multiple tuples - some of which will have miss-matched pseudo primary keys. In my case I have assumed that the tuple with the lowest pseudo-primary key is to be kept and all the other discarded: select sa1.id from sectionaccess as sa1, sectionaccess as sa2 where sa1.userid=sa2.userid and sa1.sectionid=sa2.sectionid and sa1.id > sa2.id</code> Deleting the Duplicates <a href="https://gist.github.com/775417">https://gist.github.com/775417 </a>

#mysql

#database

#delete duplicates