Have you ever wondered how your favorite social news aggregators such as Reddit or Hacker News rank their links? They make use of a couple of simple ranking algorithms to show you the most interesting stories and comments on top. So let's this how these work so that you can start using them in your own projects.
hot algorithm for reddit has changed on January 12, 2014. I wrote an extensive article on the impact of this change over at my blog.
This is the kind of algorithm used for the frontpage. It's much simpler than you might think. It calculates a score for each item based on the number of votes and then substracts points based on the age of the item.
A simple version of this algorithm could look like this:
score = upvotes - downvotes score = score + age_in_seconds # newer means a higher age which should net a higher score
In practice, this implementation is not really useful so let's take a look at how Reddit does this:
cpdef double _hot(long ups, long downs, double date): """The hot formula. Should match the equivalent function in postgres.""" s = score(ups, downs) order = log10(max(abs(s), 1)) if s > 0: sign = 1 elif s < 0: sign = -1 else: sign = 0 seconds = date - 1134028003 return round(order + sign * seconds / 45000, 7) cpdef long score(long ups, long downs): return ups - downs
As you can see, Reddit starts by calculating a simple
ups - downs score which it puts on a logarithmic scale. Why? Is a link with a score of 5000 really much better than one with 4000? How about 10 vs 30? This logarithmic scale reduces the impact of additional votes unless there's significantly more.
It then calculates a
seconds value which is an integer representation of the age of the link. Where does the
1134028003 coming from? It's the UNIX timestamp for Thursday December 8th 2005 7:46:43 AM UTC, the date they introduced the hot ranking to the public.
Finally, it divides this
seconds value by 45,000 which is 12,5 hours. This means that every 12.5 hours older than Thursday December 8th 2005, new stories have 1 more point than their older peers.
That's it for the hot ranking algorithm, if you want to see exactly how the log scale and age affect the rankings, I encourage you to go read this in depth article about Reddit ranking algorithms.
The Top ranking algorithm simply determines which items have the top scores, it does not discriminate against age and will always return the item with the highest score. This might seem good enough for a news aggregator, but after a few days only, you would notice that the list never changes. This algorithm gives a tremendous advantage to the older items and to those that got an early score advantage, creating a snowball effect and burying any item that wasn't lucky enough to get a few early upvotes.
This one is a little trickier. It is used to show the items that are the most likely to be of interest on top. How does it do that? By using calculating a confidence score interval. More specifically, Reddit uses the Wilson Score Interval. Basically, a confidence score interval takes into account the quantity of votes to determine whether the score represents appropriately the value of the item. The algorithm is less confident about the score of an item with only a few votes than one with hundreds of votes. That way an item with 3 positive votes can rank worse than one with 300 hundreds positive and a dozen or so negative.
I will not drill down into Reddit's algorithm here since there are no [magic numbers](https://en.wikipedia.org/wiki/Magic_number_(programming#Unnamed_numerical_constants).
Once again, for more information and to see how the upvotes, downvotes and the number of votes affects ranking, read Amir Salihefendic's article about Reddit algorithms.
The latest items show up first, I don't think I need to bother explaining this one.