Thursday, March 17, 2011

Why reddit was down for 6 of the last 24 hours.

As most of you are probably aware, we had some serious downtime with the site today. Now that the dust is beginning to settle and we have finally gotten some sleep, we will attempt to explain what happened.

As you will see, the blame was partly ours and partly Amazon's (our hosting provider). But you probably don't care who is to blame, and we aren't here to assign blame. We just want to tell you what happened.

Begin nerd talk

At approximately 1AM PDT today, we noticed that load was simultaneously shooting up on a large number of our Postgres and Cassandra servers. Within the next 10 minutes, we determined that I/O had ground to a complete halt on nearly every server which was using Amazon's Elastic Block Storage(EBS) service in one particular Availability Zone (their version of a data center). When we say "complete halt", we really do mean it. It was taking minutes to read or write a single 512-byte sector. Since replication everywhere was severely degraded due to this issue, we decided to take the site down to prevent further issues.

We immediately got in contact with Amazon and supplied them with a multitude of data. Jedberg even resorted to pasting data to AWS every 30 seconds as it came in. After a couple hours of digging, they put the following status message up:

2:45 AM PDT We are currently investigating increased latencies for a subset of EBS volumes in a single Availability Zone.

Note that when you are the size of Amazon, a "small subset" can mean a lot of things. In our particular case, it meant the majority of EBS disks on most of our Postgres and Cassandra servers.

EBS outages affect us particularly badly, because we use a whole lot of them, so we have a large exposure to their problems.

Nearly an hour later, Amazon acknowledged that the issue started right around the same time we started seeing issues:

3:36 AM PDT A small subset of EBS volumes began experiencing increased latencies in the US-EAST-1 region beginning at 01:00 AM PDT. We are working to restore normal operation to the effected EBS volumes.

At this point, the AWS engineers informed us that they were manually repairing the issue on the EBS disks. They asked us "which disks are most critical" to repair first, to which we replied "all of them".

Nearly an hour after that, Amazon was still working to resolve the issue:

4:33 AM PDT We are continuing to work to restore normal operation to the small number of EBS volumes still experiencing elevated latencies.

At around 5am, most of our disks were repaired, and we brought the site completely up.

We monitored the site for the next hour or so to ensure things were stable, and even had to wake Spladug at 5am for some quick help with some corrupted listings. Once we had cleaned up those issues ,we decided to get some much needed sleep.


What Happened... again

In a extremely cruel twist of fate, we started to notice issues with our EBS volumes again at around 10am this morning. We immediately contacted AWS about the issue, and they began to work on repairing the bad volumes. The issue was not to as large of an extent as it was previously, so no AWS status post was made at that time, as it was only effecting a few volumes -- ours.

Then, something really bad happened. Something which made the earlier outage a comparative walk in the park.

Part of reddit's database backend is a handful of Postgres replication clusters. We are using master-slave replication across multiple different masters using a program called londiste.

Shortly after noticing the EBS issues this morning, our database replication took a severe turn for the worse. Data which had been committed to the slaves was not committed to the masters. In a normal replication scenario, this should never, ever happen. The master commits the data, then tells the slave it is safe to commit the same data.

We are still investigating as to why replication failed. All we know is that it definitely broke when the EBS disks on the masters started having issues. We could make some speculation about the disks possibly losing writes when Postgres flushed commits to disk, but we have no proof to determine what happened.

The replication issue resulted in key conflicts on some of our slave databases. If you work in RDBMS at all, you know this is an extremely bad thing. Since there was inconstant data in the cluster, we were forced to bring the site down to prevent further inconsistencies. You can see a graphical representation of this here.

There was no easy way to untangle the mess that the broken replication had left behind, and our only option was to partially rebuild our slaves by dropping and then re-replicating the affected tables. We opted to do this rebuild while we waited for Amazon to migrate the data on the master to better hardware, since we already had the site down. This data moving process took several grueling hours, during which the site was completely down. At approximately 1:30PM PDT, the data migration and slave rebuild both completed (coincidentally just 10 minutes apart) and we were able to bring the site back online.

What we are doing about it and what we could have done better.

Nothing wears on our fragile sanity more than when reddit goes down. Like you, we never, ever want to have this happen again.

Amazon's Elastic Block Service is an extremely handy technology. It allows us to spin up volumes and attach them to any of our systems very quickly. It allows us to migrate data from one cluster to another very quickly. It is also considerably cheaper than getting a similar level of technology out of a SAN.

Unfortunately, EBS also has reliability issues. Even before the serious outage last night, we suffered random disks degrading multiple times a week. While we do have protections in place to mitigate latency on a small set of disks by using raid-0 stripes, the frequency of degradation has become highly unpalatable. To Amazon's credit, they are working very closely with us to try and determine the root cause of the problem and implement a fix.

Over the course of the past few weeks, we have been working to completely move Cassandra off of EBS and onto the local storage which is directly attached to the EC2 instances. This move will be executed within the month. While the local storage has much less functionality than EBS, the reliability of local storage outweighs the benefits of EBS. After the outage today, we are going to be investigating doing the same for our Postgres clusters.

One last change that we will make is fixing a mistake we made a long time ago. When we first started using Amazon's EC2, there were no "best practices" yet. One mistake we made was using a single EBS disk to back some of our older master databases (the ones that hold links, accounts and comments). Fixing this issue has been on our todo list for quite a while, but will take time and require a scheduled site outage. This task has just moved to the top of our list.


Some answers to common questions


Q: No other AWS site appeared to be having issues. Why was reddit affected so severely, while other sites stayed up?

A: This outage affected a specific product(EBS), on a specific AWS availbility zone. Unfortunately, a disproportionate amount of our services rely on this exact product, in this exact availability zone. We also use EBS more heavily than sites similar to us.


Q: Why is reddit tied so tightly to the affected availability zone?

A: When we started with Amazon, our code was written with the assumption that there would be one data center. We have been working towards fixing this since we moved two years ago. Unfortunately, progress has been slow in this area. Luckily, we are currently in a hiring round which will increase the technical staff by 200% :) These new programmers will help us address this issue.


Q: Why was reddit taken down by something as simple as a disk issue? Don't you RAID?!?!

A: Disks always fail eventually. We have standard protections in place to prevent problems from disk failures. However, there is very little that can be done to prevent issues when such a large amount of disks fail, like what happened today. And in this particular case, it hit one of the few servers where we don't use a RAID.

Q: Why are you using Amazon as a scapegoat?

A: We'll certainly admit that more could be done from our side to prevent hosting issues from affecting us so gravely. However this was a very serious outage which affected a large proportion of our disks. We would be lying if we said Amazon didn't have some fault here.

TL;DR

Yes, the site was down for two separate, long periods of time today. The downtime was catalyzed by an outage with Amazon's EBS service. Jedberg, spladug, and myself are working hard to prevent future issues. We take great pride in the site, and whenever it goes down it seriously hurts for us, as much as it hurts the community.
discuss this post on reddit