No problem! Maybe I'll do some over the weekend or something, if I do I'll report back here :)
Hey folks, I'm the author of randumb and can speak a little to this topic. @natebird may have read that 'scales better' on one of the earlier versions of my README. I initially wrote randumb using the 2 query method (select ids, shuffle in ruby, then select records), I tested a number of scenarios with my data set and declared it to be faster and wrote that on my gem's description.
However, eventually someone pointed out to me that you have to load every ID into memory in order to do this, which could potentially crash your app if your dataset is large enough, which is way worse than a slow query in my opinion.
I decided to revisit my approach. I had barely looked at 'order by RAND()' initially since every blog post ever seemed to suggest avoiding it, so I did some benchmarking comparing the two methods. I wish I still had the hard numbers on hand to share, but in a nutshell: it depends.
Sometimes order by RAND() was faster and sometimes the id select method was faster, it totally depended on the size of data set, if you were scoping down the query with where clauses or not, the DB you were using, etc. I opted to default to 'order by RAND()' to error on the side of not crashing your app server, and I left the shuffle method as another option you can still use if you want to, and wrote this for people using the gem to read: https://github.com/spilliton/randumb#a-note-on-performance
One thing I know for sure about this problem is that selecting random records will never be as fast as selecting records via an indexed column. If the random selection method you are using is too slow for a page you are rendering, then you should likely explore something like selecting those at another time and caching them. On my site (http://www.comparevinyl.com/) I cache about 20 different versions of each HTML partial where I render random data, this way I only incurr the hit 20 times and it still gives the illusion or randomness to the casual user.
Hope this helps...
@mattpolito I threw together some benchmarking code yesterday to compare the two methods using my Gem on sqlite3 and postgres. I was in a hurry so I didn't do any examples with where() clauses, they are all getting 1, 10, 50, 100, 250 random records from tables sized starting with 10,000,000 down to 10,000 (I'm starting large so I can just delete down to the size I need from the initial population).
Feel free to let me know if you see any issues with my benchmarking method, or to modify it at all with new tests or try a new DB :)
Test Code (on benchmarking branch on randumb): https://github.com/spilliton/randumb/blob/benchmarking/test/benchmark_test.rb
Output:
https://gist.github.com/spilliton/5743965
In these tests, 'order by random()' always outperformed the 2 query shuffle method.