Wednesday, January 25, 2012

January 2012 - State of the Servers

My fellow redditors: the state of our servers is strong.

2011 was a year of explosive growth and daunting technical hurdles. Our infrastructure has changed dramatically over the past 12 months. I'm here to show you some the more technical details of the changes that have been made, and dazzle you with fanciful talk of the future.

To look at just the numbers, in December of 2010 we had 829 million pageviews and 119 servers. Today, we have 2.07 billion pageviews with 240 servers. That's an increase of 149% for pageviews and 101% for servers.

Postgres

Some of the more lengthy downtimes in 2011 were due to complications surrounding our Postgres infrastructure. The main issue being that whenever EBS volumes would slow down on our masters (which happened often), our database replication system, Londiste, would break. This required us to rebuild the broken slaves, and try to keep the site running while the long rebuild process completed. These replication breaks also caused data corruption on the slaves, resulting in bad data persisting in cache even after the slaves were fixed. In short, it was a huge mess of work whenever this happened.

The bug causing the replication break was unfortunately very transient, and it could not be reliably reproduced in testing. However, a complete upgrade from Postgres 8 to Postgres 9 appeared to resolve the issue. That upgrade was completed in July, and we have nary seen a replication problem since. We currently are just shy of 2TB of data in postgres, which takes an awful long time to replicate.

Farewell, EBS

One of the more painful lessons of the past year has been that EBS' performance degrades often. We were using EBS to store all of our Cassandra and Postgres data. After spending a considerably large amount of time trying to work around these issues, we came to the conclusion that EBS in its current state was not reliable enough. The feature set of the product can be quite handy, but the consistent performance degradations were not worth it.

As a result, we have moved all of our high-traffic data off of EBS, and onto local ephemeral disks. This migration required us to bulk up our redundancy considerably; a hardware failure on an ephemeral means your data is gone. Since the move, we have had significantly fewer issues with disks on our Postgres and Cassandra servers.

Cassandra 0.8

Throughout this whole year, we've been migrating data off of our mostly broken Cassandra 0.7 ring onto our mostly unbroken Cassandra 0.8 ring. This has resulted in much improved stability and faster response times. Additionally, our newer features - like flair and the moderation log - are canonically stored in Cassandra as opposed to Postgres. That said, there's still a lot of work to be done on Cassandra.

Random small improvements

There are a ton of small changes made each week that individually have a negligible impact. We've rekicked most of our servers to Ubuntu Natty and use Puppet to keep their configurations in sync. We're slowly building a kick system to automate most processes with adding new servers to our setup. Our monitoring now exists and we fixed the office TV so we can keep an eye on Google realtime analytics. :-)

The Future

Maintaining the infrastructure for a site like reddit is an exercise in never-ending changes. Here are some of the bigger projects we are going to be working on this year.

No downtime downtimes

One thing we hate the most is having to take the site down for maintenance. We try to avoid it wherever possible, but some changes will always require that systems be taken offline. While these maintenances will still be necessary for the foreseeable future, a project is currently in the works to lessen the sting of downtime.

rram is in the process of working with Akamai, our Content-Delivery Network, to change the sites behaviour during downtimes. Instead of taking the site completely offline for maintenance, we will instead be able to allow Akamai to serve up a cached, read-only version of the site. Once this project is complete, the majority of our maintenances can be done while still serving the site in some capacity. This same method can also be used during unexpected downtimes, should they ever pop up. Moreover, we're researching Edge Side Includes as a method to further reduce load on our servers. I expect that these will greatly reduce the number of bananas consumed by redditors during site downtime.

Cassandra 1.0

We're very happy to see Cassandra reach its shiny new "1.0" state. I'm planning on upgrading our Cassandra 0.8 ring to 1.0 very soon. This upgrade will resolve some of the more pesky issues on our current Cassandra ring, such as difficulty with bootstrapping and repairing.

Why buy one, when you can buy two for twice the price

One of the largest projects on the horizon is running reddit in multiple datacenters concurrently. This will allow us to gain some redundancy, and it is the first step in being able to host the site in multiple regions. Doing so will require significant changes in both the infrastructure and the code of the site. It is a huge undertaking, but it is well worth it.

It has its ups and downs

Finally, we're looking forward to have our infrastructure self-heal and auto-scale. Right now, when bad things happen, we get alerts, but for the most part any fixes are completely manual. This often leads to either app server bloat (we made 15 new ones in 2012 already) or us temporarily sacrificing a post in order to keep the rest of the site healthy. :-( With our infrastructure self-healing and auto-scaling, we'll be more hands off and working on getting rid of bottle necks, rather than fighting server fires.

tl;dr: 2011 was an awesome year, and things are only looking better. We could not have done all of this without your support through reddit gold, advertising, postcards, and just awesomeness. Remember, here at reddit, we're working hard so you won't have to!

discuss this post on reddit