Thursday, January 07, 2010

Why did we take reddit down for 71 minutes?


As most of you know, we moved reddit to EC2 back in May of 2009. Our experience there has been excellent so far. Since we moved to EC2, the number of unique users has gone up 50%, and pageviews are up more than 100%. To support this growth, we have added 30% more ram and 50% more CPU, yet because of Amazon's constant price reductions, we are actually paying less per month now than when we started.

So why am I singing the praises of Amazon and EC2? Mainly to dispel the opinion that the site getting slower since the move is in any way related to Amazon. Our experience with EC2 so far has been excellent, and when we do hit a bump in the road, their support staff is extremely helpful, competent and technically knowledgeable. Any slowness that reddit has been experiencing is our fault, not theirs.

So, why did we take the site down for 71 minutes yesterday? As you probably noticed, our site has been unacceptably slow during peak times since before the holidays, and unusably slow for the last week. To fix this we upgraded the disks for one piece of our system.

Nerd alert -- this section gets technical
Let me start from the beginning… Amazon offers a service they call an Elastic Block Store (EBS), which is a device that you can attach to your running instance that looks like a SCSI disk. Under the hood these disks are accessed over the network and are built from RAID systems, but from our point of view, they are roughly as fast as a single disk fast SCSI disk.

We use a lot of the EBS disks. All of our databases were each using one EBS. This worked really well for us up until a week or so ago. Then all of you came back from holiday and decided that work was just too boring or something, and our traffic spiked, essentially breaking the camel's back, if you will.

In response, we started upgrading some of our databases to use a software RAID of EBS disks, which gives drastically increased performance (at a higher cost of course). This worked really well, but there was still one missing piece of the puzzle.

Part of our setup uses what we call a "permacache", which uses Memcachedb. Memcachedb is Memcached with a built-in permanent storage system using BDB. One of the "features" of this system is that it saves up its disk writes and then bursts them to the disk. Unfortunately, the single EBS volumes they were on could not handle these bursting writes. Memcachedb also has another feature that blocks all reads while it writes to the disk. These two things together would cause the site to go down for about 30 seconds every hour or so lately.

Last night, we upgraded the disks that these caches use to be the same RAID setup that we are moving the databases to. We had to take the site down because while we call them "caches", they really are just another database. We call them a cache because all the info they hold can be recreated from the main database, but not quickly or easily.
End Nerd Alert

The maintenance itself went smoothly, and we only had one small issue where one of the five machines had a slight performance problem building the RAID.

As always, if you have any questions, you can ask them on the comments to this blog post.
discuss this post on reddit