Last Updated: February 25, 2016
·
2.26K
· BurningNeuron

Deleting Duplicate Tuples in MySQL with Entities Using Pseudo Primary Keys

<strong>Summary</strong></p>

<a title="Source: Flickr" target="blank" href="http://www.flickr.com/photos/popilop/331357312/in/photostream/"><img height="180" width="240" src="http://farm1.static.flickr.com/128/331357312893ef9e791m.jpg" align="right" /></a> If you have an entity where there is a pseudo primary key based on a sequence (auto increment) that has duplicates, finding and deleting duplicates can be a little tricky, but here's how:</p>


<a href="https://gist.github.com/775417">https://gist.github.com/775417
</a>
</p>


--Make a temporary table for unneeded PKs create temporary table dirty
identifier (pseudokey integer);
-- Get the PKs we don't need
insert into dirty
identifier (
select dt1.pseudopk from dirtytable as dt1, dirtytable as dt2 where
dt1.true
pk1 = dt2.truepk1
-- and dt1.true
pk2 = dt2.truepk3
-- and ... keep repeating for however many attributes make your real key
and dt1.pseudo
pk > dt2.pseudopk
);
-- Finally delete the unneeded tuples from the dirty table
delete from dirty
table where pseudopk in (select pseudopk from dirty_identifier);
</code></pre>
<br/>

<strong>Background</strong></p>

Without getting into it to much I work with a MySQL database that sits under a closed source CMS. The database has no concept of referential integrity or natural keys and uses a sequence on every entity as the primary key. A recent update from the CMS vendor has seen duplicate rows appearing.</p>

There are three entities in question: the user, the section and the sectionaccess. The sectionaccess contains the section access rights for a user and uses a userid and sectionid to reference the relevant entities. However, it also has an id that is used as the primary key, rather than using (sectionid, userid). So recently this entity has been seeing duplicate tuples.</p>

<strong>Duplicate Detection</strong></p>

In a sane database the distinct rows can be found simply by using select distinct * from sectionaccess</code>, however in this section the distinctiveness is polluted by the pseudo primary key (in this case, id).</p> To find all the tuples that are duplicates and how many of each there are we can use something like select userid, sectionid, count(*) as cnt from sectionaccess group by userid, sectionid having cnt > 1</code>. But this doesn't help us much because of the pseudo primary key.</p> What we need to do is cross reference each tuple with every other tuple in the entity on the real primary key (sectionid, userid) with a cross join. If a tuple based on the true primary key exists only once then the cross reference will occur only once and the pseudo primary key (id) will match. Where there are multiple tuples on the real primary key, the cross reference will have produce multiple tuples - some of which will have miss-matched pseudo primary keys. In my case I have assumed that the tuple with the lowest pseudo-primary key is to be kept and all the other discarded:</p> select sa1.id from sectionaccess as sa1, sectionaccess as sa2 where sa1.userid=sa2.userid and sa1.sectionid=sa2.sectionid and sa1.id > sa2.id</code></p> <strong>Deleting the Duplicates</strong></p> <a href="https://gist.github.com/775417">https://gist.github.com/775417 </a> </p>