How to Find and Delete Orphaned Records with Ruby on Rails
Let's say you have an application where a User can subscribe to a
Channel. With ActiveRecord associations, it would look something like
this:
# app/models/subscription.rb
class Subscription < ActiveRecord::Base
belongs_to :channel
belongs_to :user
end
# app/models/user.rb
class User < ActiveRecord::Base
has_many :subscriptions
has_many :users, through: :subscriptions
end
# app/models/channel.rb
class Channel < ActiveRecord::Base
has_many :subscriptions
has_many :users, through: :subscriptions
end
Unfortunately, someone forgot to add dependent: :destroy
to the
has_many :subscriptions
. When a user or channel was deleted, an
orphaned subscription was left behind.
This issue was fixed by dependent: :destroy
, but there was still a
large number of orphaned records lingering around.
There are three ways you can use to remove the orphaned records.
Approach #1 — Bad
Subscription.find_each do |subscription|
if subscription.channel.nil? || subscription.user.nil?
subscription.destroy
end
end
This executes a separate SQL query for each record, checks whether it is
orphaned, and destroys it if it is.
Approach #2 — Better, but still pretty bad
Subscription.all.each do |subscription|
if subscription.channel.nil? || subscription.user.nil?
subscription.destroy
end
end
This loads all records into memory, and then iterates over them
performing the same check as above.
Approach #3 — Good
Subscription.where([
"user_id NOT IN (?) OR channel_id NOT IN (?)",
User.pluck("id"),
Channel.pluck("id")
]).destroy_all
This approach first gets the IDs of all Users and Channels, and then
executes one query to find all Subscriptions that don't belong to either
a User or a Query.
Benchmarks
Let's take a look at how long it takes in each case.
When you run it on 2596 subscriptions, out of which 1058 are orphaned,
you get:
user system total real
bad 3.020000 0.160000 3.180000 ( 4.058246)
better 2.950000 0.170000 3.120000 ( 3.982329)
good 0.010000 0.000000 0.010000 ( 0.030346)
Even as the number of subscriptions is reduced,
Subscription.all.each
remains faster than
Subscriptions.find_each
, because it performs less SQL queries.
With 10 subscriptions, the results were
user system total real
bad 0.010000 0.000000 0.010000 ( 0.022374)
beter 0.010000 0.010000 0.020000 ( 0.017584)
good 0.010000 0.000000 0.010000 ( 0.014330)
Conclusion
Approach #3 is a whopping 134 times faster than approach #1.
While the time difference in our scenario was only 4 seconds, it could
be hours for millions of records.
This demonstrates how important refactoring is to the development
process. Refactoring your code furthers your understanding of the
technologies you work with. It also illustrates how important it is to
minimize the SQL queries performed within your web application. If this
was an action initiated by the user, the difference in the request time
would be 4.02 seeconds vs. 30 ms.