Thursday, March 18, 2010

A *real* picture of reddit's conference room

There's been some recent speculation about what it looks like when the reddit admins get together, and we wanted to set the record straight:


On an unrelated note, there have been some dirty rumors that we've been spending all our time playing ping pong instead of working on the site. They are completely unfounded.

Friday, March 12, 2010

She who entangles men


You may remember the problems that we've been having with our persistent cache, memcacheDB. Our initial response was to add a bunch more RAM to the system, which would probably only last us a few weeks, until we could put a better solution in place. Fortunately with EC2 we were able to spin up five new machines with gobs of RAM to fill that temporary role. After that, we dedicated 33% of our development team (i.e. me) to swapping it out for a more scalable and long-term backend.

As of this morning we're now running with our persistent cache backed by Cassandra. The migration was seamless (did you notice?), but the load-impact on our servers is palpable.

In case you are wondering why we chose Cassandra: it is way faster, more scalable, and has a rich and active development community full of extremely smart and helpful people. It's in use or going to be in use by several large companies (Twitter, Facebook, digg, Rackspace). We have the ability to add nodes as load and storage requires, and we have the ability to move non-cache type data into it as appropriate.

We could not have done this in just 10 days without the help of the amazing Cassandra developers and community and EC2, which allowed us to bring up new instances on which to test and ultimately deploy Cassandra.

I cannot thank the extremely intelligent Cassandra developers and community enough for their work and their help!

Photo credit: http://en.wikipedia.org/wiki/File:Solomon_Ajax_and_Cassandra.jpg

Monday, March 01, 2010

And a fun weekend was had by all...


A - Normal daily reddit traffic
B - Yesterday's actual traffic
C - What it felt like to us
As some of you may have noticed, reddit has been a tad slow lately. Last night, we put in a fix that will hopefully fix the problem in the short term until we can get a longer term fix in place.

We wanted to take this opportunity to fill you in on what caused the problem, how we are fixing it, and dispel some myths.

In short, we made a technical decision a few years ago that was a good idea at the time, but which is becoming increasingly hard to scale. We need to make some deep changes to fix it, but we put in a band-aid solution of adding more memory for reddit to use in the meantime.

But before we get too far down the technical rabbit hole, while the four of us were busy trying to stop the site from melting any further, other things were happening that we didn't get a chance to address.

At the end of the day, reddit is both a community and social news site, bound to attract people in the social news business. We have always been about serving up interesting stories and content, all the while trying to ensure that we curb any abuse of the community's good graces. If you like what you see on reddit, good, upvote it. If not, complain, or even make your own community. Above all, if you think someone is abusing the site, tell us.

A witch hunt and a glut of personal details degrades us all. Posting personal information crosses the line, and it has been our policy since the beginning to remove it when we see it or when it is pointed out to us. That said, we are not all-seeing. We don't have a program that detects personal information and notifies us. While we removed personal info (per our terms of service) when it was shown to us, we obviously didn't get it all.

What happened this weekend saddened us. Saydrah's postings have been additive to the community, and we have no indication that she's been anything but a great moderator to the communities she moderates. Moderators are not exempt from our anti-cheating measures, and, though I hate to have to put it in these terms, we've "investigated" Saydrah, and we didn't find any indication of her cheating or otherwise abusing power.

Nerd talk starts here

TL;DR: oh hi i upgraded your RAM

Myths:
The recent site problems are because reddit moved to EC2 -- False
The only reason we were able to fix this at all is because we are on EC2 and were able to quickly spin up new instances. In fact, we just had to do it again this morning because we needed even more RAM.

But I swear it's been slow since you moved to EC2! -- False
We moved to EC2 in May 2009. We only started getting reports of slowness about three weeks ago, which we could also see in our logs and monitoring. The number of reports got worse as time went on, but we didn't see it ourselves, so it was hard to track down a specific cause.

reddit is just text, so you guys are clearly morons -- False
While reddit is just text when it leaves our servers, there is a whole lot going on under the hood. It is a highly customized user experience, on par with something like Facebook (just not as many users). We do 100's of transactions every second. While we may be doing some things wrong, it's a lot more than just 'select * from comments where article = "foo"'

Let's start at the beginning. Here is a simplified version of reddit's architecture:



The area where we are having trouble right now is that purple section in the middle that says "memcaches". Specifically, we are having problems with memcachedb, which is where we store a bunch of precomputed listings, like all the listing pages, profile pages, inboxes; pretty much any list on reddit that is too expensive to calculate on the fly.

A few years ago, we decided to md5 all of our cache keys. We did this because at the time memcached (which is what memcachedb is based on) could only take keys of a certain length. In fact, the version it is based on still has this limitation. MD5ing the keys was a good solution to this problem, so we thought.

It turns out that this one little decision makes it so that we can't horizontally scale that layer of our architecture without losing all the data already there (because all of the keys would point to the wrong server if we added a new one). While we could recalculate all the data, it would take weeks to do so.

And that is in fact what we will be doing. Memcachedb has served us well, but it is getting old. It can no longer return data fast enough for our needs, due to the way it interacts with BDb (its underlying data store). We will soon be picking a new data store to replace memcachedb and recalculating all of the data for it, most likely by adding the new cache into the cache chain, and then recalculating what is left after a few weeks.

But in the mean time, what we have done is spin up 5 new instances to run memcached. This allowed us to expand the size of the memcache in front of memcachedb to 6GB (up from 2GB). Right now, this means that about 94% of each database is in RAM. As our site grows, this fix will fail us, but hopefully it buys us enough time until we can replace the data store.

As usual, if you want to tell us why we suck at our jobs, feel free to leave some comments on this post.