Rails Quick Tips - Random Records
Ordering records returned from ActiveRecord in a random manner is on the chopping block today.
Scenario: I want to return 5 random users
How would we go about solving this? Well maybe a first pass would look something like this?
User.all.shuffle[0..4]
So on this first pass we are grabbing all User
records, shuffling them around, and then grabbing the first 5 records. This solves the problem but why would we do something different?
First we are returning all User
records just to return 5! Doing something like this is almost always a bad idea especially when dealing with User
type records. They are likely to have a high record count in your application. When you return them all, each one is taking up memory just for you to return 5.
Let's alleviate this problem a bit with another solution.
ids = User.pluck(:id).shuffle[0..4]
User.where(id: ids)
Now we are making an extra database call in favor of not returning a bunch of unnecessary ActiveRecord
objects. First we [pluck
][pluck_article] the id
column which returns an id
for all User
records. Then we [shuffle
][shuffle_doc] that array and choose the first 5 records.
Well this is pretty good, we are saving our memory for the records we actually want. This again solves our problem but it's a bit more to look at (with the extra DB call).
We can alleviate this problem a bit with yet another solution.
# postgresql
User.limit(5).order("RANDOM()")
Let's review what is happening here. We are querying for User
records ordered randomly and limiting the return set to 5 results. Our original problem to solve was: I want to return 5 random users. Not only have we written code to solve this issue but the code written is expressed very closely to what we wanted if it were read out loud.
What changed? The main difference is the RANDOM()
passed to order
. For those who think this looks a bit odd, this is a database specific function. You can pass all sorts of SQL
fragments to ActiveRecord
that just get joined in to make a comple SQL
statement. This is just another fragment. The call being made actually looks like this:
SELECT * FROM users ORDER BY RANDOM() LIMIT 5
Now I know what you may be thinking... "ActiveRecord is database agnostic! Why would you bind yourself to a specific implementation!?" The simple answer is yes, ActiveRecord
is database agnostic and yes it is great. It gives the developer the option to choose which database is best to solve your specific problems. Now just think that once your database choice has been made, how often if ever are you going to make a change to another? Databases are very good at what they were made to do... let them be good at it. Don't be afraid to use database specific features and functions to improve your development. This particular example has been granted to us by [PostgreSQL][postgresql], which is my go-to DB of choice.
Not using PostgreSQL? This similar functionality can be had by most databases. Consult with your documentation. Here is a [MySQL][mysql] example to show how the same functionality can be had:
# mysql
User.limit(5).order("RAND()")
[mysql]: http://www.mysql.com/
[pluckarticle]: http://blog.hashrocket.com/posts/rails-quick-tips-activerecord-calculations-pluck
[postgresql]: http://www.postgresql.org/
[shuffledoc]: http://ruby-doc.org/core-2.0/Array.html#method-i-shuffle
[Original Article][link]
[link]: http://blog.hashrocket.com/posts/rails-quick-tips-random-records
Written by Matt Polito
Related protips
12 Responses
ORDER BY RAND() should be avoided, it's slow. If you have really small table it's fine. But it does not scale well.
The 'randumb" gem is a great ActiveRecord extension that provides the db agnostic calls and allows you to continue chaining other arel methods. It also scales better on large tables than just using RANDOM. Check out the source code for details.
ids = User.ids.sample(5)
User.find ids
@narkoz I just threw together the examples that would be worse than the end implementation. You are correct though, sample
would have been a better choice to implement that way.
Thanks!
@natebird Thanks for pointing out randumb. I've seen this in the past but never used it. It is DB agnostic as much as my described implementation is. The gem is only using either of two database functions ('RANDOM() and RAND()'). You always need to evaluate your options when it comes to your specific scenario... in regards to high record count and performance.
Thanks again for checking out the article!
@natebird Why did you say that 'randumb' scales better?
In the repository the author says that he already uses the rand function, and he says that his gem could have the same problem with high databases.
I will add another approach:
Take a random number within the Object length, and get some like 20 records (In Active Record), then suffle it and take 5 (in Ruby not Active Record). It could be better in high database, but maybe not random enough.
If you have large tables in Postgres, you can utiilise a fast stats based count and use that as the offset like so :
SELECT * FROM mytable OFFSET (random() * (SELECT reltuples::bigint AS estimate FROM pgclass WHERE relname='my_table')) LIMIT 1;
So, total rows for a table : SELECT reltuples::bigint AS estimate FROM pgclass WHERE relname='mytable'
You would need to -minus the total rows with the same number as the LIMIT.
The biggest poisions in Rails are seeding a few records and Active Record. It means that development is done with data that doesn't show the horrific performance problems until pushed to Production, at which point all panic breaks loose and developers who are then unused to SQL have to try to understand the problem they created.
Hey folks, I'm the author of randumb and can speak a little to this topic. @natebird may have read that 'scales better' on one of the earlier versions of my README. I initially wrote randumb using the 2 query method (select ids, shuffle in ruby, then select records), I tested a number of scenarios with my data set and declared it to be faster and wrote that on my gem's description.
However, eventually someone pointed out to me that you have to load every ID into memory in order to do this, which could potentially crash your app if your dataset is large enough, which is way worse than a slow query in my opinion.
I decided to revisit my approach. I had barely looked at 'order by RAND()' initially since every blog post ever seemed to suggest avoiding it, so I did some benchmarking comparing the two methods. I wish I still had the hard numbers on hand to share, but in a nutshell: it depends.
Sometimes order by RAND() was faster and sometimes the id select method was faster, it totally depended on the size of data set, if you were scoping down the query with where clauses or not, the DB you were using, etc. I opted to default to 'order by RAND()' to error on the side of not crashing your app server, and I left the shuffle method as another option you can still use if you want to, and wrote this for people using the gem to read: https://github.com/spilliton/randumb#a-note-on-performance
One thing I know for sure about this problem is that selecting random records will never be as fast as selecting records via an indexed column. If the random selection method you are using is too slow for a page you are rendering, then you should likely explore something like selecting those at another time and caching them. On my site (http://www.comparevinyl.com/) I cache about 20 different versions of each HTML partial where I render random data, this way I only incurr the hit 20 times and it still gives the illusion or randomness to the casual user.
Hope this helps...
Thanks @spilliton for going through your thought process on this. I'd have loved to see those benchmarks.
No problem! Maybe I'll do some over the weekend or something, if I do I'll report back here :)
@mattpolito I threw together some benchmarking code yesterday to compare the two methods using my Gem on sqlite3 and postgres. I was in a hurry so I didn't do any examples with where() clauses, they are all getting 1, 10, 50, 100, 250 random records from tables sized starting with 10,000,000 down to 10,000 (I'm starting large so I can just delete down to the size I need from the initial population).
Feel free to let me know if you see any issues with my benchmarking method, or to modify it at all with new tests or try a new DB :)
Test Code (on benchmarking branch on randumb): https://github.com/spilliton/randumb/blob/benchmarking/test/benchmark_test.rb
Output:
https://gist.github.com/spilliton/5743965
In these tests, 'order by random()' always outperformed the 2 query shuffle method.