Sunday, July 17, 2011

Nerd talk: The tale of the life of a link on reddit, told in graph porn

reddit has a lot of fresh content being submitted all of the time, so how do we pick what makes the front page? After the tank of manatees has their pick, I mean.

Since I love some good ol' fashioned data mining, I "asked" the guys to let me break into the database machines and put together some graph porn for you. When alienth wasn't looking, I pulled a three-day slice of link votes (a total of 6,064,281 votes) from reddit's votes database and built some Python scripts to generate graphs with matplotlib.

The theoretical trail of karma from submission to the front page is:
  1. Find picture of cat
  2. Add caption
  3. Submit link to reddit
  4. The link goes to the new page, accumulating as many votes as it can from these Knights of /new before it is pushed off.
  5. At this point it can still be shown to readers of the rising page and to users of the "new and upcoming" box on the front page. This box does try to maximise the number of opportunities a link has to receive votes but doesn't want to have a lot of spammy links in it so there are some trade-offs made to try to optimise for both.
  6. If it is able to grow here it will be promoted to the Hot page of its community and potentially to the front page (which is essentially the aggregate Hot page for the whole site).
  7. Because the default Hot page cuts off after 25 links, once a link is promoted from #26 to #25 all of the real karmic action happens. The number of votes per minute shoots way up, as do the number of comments. At this point, your link has "made it".
  8. To maximise turn-over, reddit doesn't include links on the front page after 24 hours (although this doesn't apply to hot pages for individual communities). If a link survives on the front page this long, its score will flatten out but can still receive votes from people that still have the tab open, that community's front-page, etc.


Let's see this in action. First, over 80% of votes on reddit are upvotes:



(the small number of "nones" are where someone has voted but rescinded their vote; reddit continues to store that vote but no longer uses it for score or hotness calculations)

Most of reddit's traffic comes in while people are arriving to work in the US, and you can see it in the voting patterns:



(if you look closely you'll see that on this random weekday, reddit received over 300,000 votes per hour at peak, or about 83 votes per second)

Not every link will be a winner though. Most have one or two votes and are never promoted from the new page:



(note that the X axis there is log2)

Now for the good stuff to really see that lifecycle. Here are some popular links that made the front page and their scores over time:



You can see that to become as popular as they were, they had a lot of growth very early on. Compare those lifecycles to these entirely random links (which by the above histograms we know are mostly failures). Here again we can see that the ones that are going to differentiate themselves do so extremely early on in their life:




Here are some timeline snapshots of the front page and a picture of the movement of ranks along it:




If you can make it out among all of the crazy wavy lines you'll see that individual links follow an arc of growth and then decay even after they hit the front page.

Warning: giant nerd talk follows

Successful links grow so fast because they have to, or they die. To understand this, let's take a look at how scoring and "hotness" work. Most people that write over-a-weekend reddit clones (do it! it's a fun exercise) sort their front pages using the somewhat obvious method of an hourly batch job that scores all links in the database and decays that number by the timestamp of the link (maybe boosting for some other values like number of comments or clicks). reddit used to work this way millions of years ago, but that quickly became untenable. You see, reddit's working set of active links is pretty small (maybe ten thousand these days), but this method requires calculating against all of the old links from previous days too, which adds up pretty quickly. reddit now uses system that is updated in real time on every vote by calculating a "hotness" value for that link (the code). Amir Salihefendic does a good job of explaining it in English:



Here is a visualization of the score for a story that has same amount of up and downvotes, but different submission time:



In effect, as a link ages its score needs to grow by orders of magnitude just to have the same hotness as a link that was just submitted and has 1 point. So if they don't continually grow by orders of magnitude very early on they die off, and of course those that do have so many votes that they skyrocket in score.

Giant nerd talk over, returning to normal nerd levels

This has been a nerdy guest post by ketralnis, former reddit admin (you can take the admin away from reddit, but you can never take the reddit out of the admin). If you got this far please let me know in the comments if you enjoyed this and/or have other ideas for similar ones!
discuss this post on reddit