blog.reddit -- what's new on reddit: reddit's May 2010 "State of the Servers" report

We promised we'd make a blog post about last week's issues, but every time we sat down to write it, a new issue popped up. It's been a crappy week. Anyway, although we're still knee-deep in looking for ways to make the site more stable, it's probably time to write up that, uh, writeup.

Warning: this post is dense with needlessly technical information. But there's a cartoon if you make it to the end.

she who entangled programmers

The primary cause of the initial slowness and later the all-out outage was the amount of time we spent blocking on Cassandra. Before you blame Cassandra outright, we've been running on Cassandra for our persistent cache for a while now, and when we first deployed it we deployed three machines to the cluster, which at the time was plenty. For performance we stuck a memcached cluster in front of it (because using it to replace memcacheDB was an emergency and we didn't want to wait for Cassandra's 0.6 release, which ships with in-core row-caching).

(Still with me?)

Since then the amount of data that we've been storing in it has significantly increased and so has the effective load on the memcached/Cassandra pair. memcached is blindingly fast, so we didn't notice that effective load was becoming way too high for just Cassandra to be able to keep up on only three nodes, as it was when we first deployed it. On Wednesday, we changed our memcached hashing implementation (from modulo to ketama, actually to avoid exactly this kind of problem in the future), effectively emptying the memcaches. We've done this before, and the resultant load is usually a bit high for an hour or so, but we've always carried through it just fine. The load on the Cassandra cluster rose instantly and dramatically (which is normal), and stayed high (which is not). Our Cassandra cluster was now so underprovisioned for the load that it wasn't able to catch up. Requests hitting Cassandra started failing with a TimeoutException, and didn't stop.

Normally the load would go down as more keys are requested and stored in memcached, so repeated requests for the same key would succeed as soon as the first one completes. Since our data in Cassandra has very high locality this would alleviate the load very quickly, but a pair of quirks kept this from happening. First, Cassandra has an internal queue of work to do. When it times out a client (10s by default), it still leaves the operation in the queue of work to complete (even though the person that asked for the read is no longer even holding the socket), which given a constant stream of requests makes the amount of pending work snowball effectively infinitely (specifically, ROW-READ-STAGE's PENDING operations grow unbounded). And second, since the request has timed out, reddit isn't able to take the answer and put it in memcached. Combined, these keep us from ever catching up once we fall behind.

The way you resolve this with Cassandra is by adding more nodes. But the node-bootstrapping process is particularly expensive (it incurs a process called anti-compaction, which duplicates most of the data on the node being split to make room for the new one) and when the node gets this far behind (that is, when you most need to be able to bootstrap a new node) it actually makes things worse. Adding new nodes too quickly can make things worse yet: the way a bootstrap works is by finding the node in the ring with the highest "load" (that is, the most disk-space used). The new node takes responsibility for half of his ring-space, has the data shipped to himself, and starts serving it up. But the old node doesn't delete the now obsolete data until it's told to execute a cleanup operation, and continues to count that obsolete, unreachable data towards his own load until that time, so new bootstrapping nodes overestimate his load. With a small enough ring this can cause the load of the nodes to become severely imbalanced, which when you're already imbalanced and loaded exacerbates the issue.

We eventually worked around this by adding a temporary hack to the code that if the data was not found in memcached, the app should assume that it doesn't exist instead of going on to check Cassandra. This removed the read-load from Cassandra but has the side-effect that if you vote or submit a link in that time, when we go to re-order the listings in which the voted item appears we end up fetching an empty listing, adding the voted item, and storing a new 1-item list, overwriting the non-empty list that used to be stored. This is what caused some listings to become empty or inconsistent. Because we receive so many votes this tends to keep the active listings (like /r/reddit.com) very close to consistent but less active ones (like smaller reddits and user-pages) then have a huge gap of data. As we write this a mapreduce job is running to recalculate those listings from the canonical store at about 20/sec, and should be done by the end of the week. During this time no links, comments, or votes were lost, just the persisted cache of what goes into what listing.

With the read-load gone from Cassandra we were able to finally bring up some new nodes. This is already a long and expensive process but in our case because of some application quirks also required some hand-holding, which took all Wednesday night and into Thursday until around noon. Some particular members of the #cassandra room on Freenode IRC were there for a lot of time walking us through many of the required processes and Cassandra internals to help us keep the downtime as short and the data-loss as small as possible. In particular, the users driftx, jbellis, ericflo, and benblack were all especially helpful, and we owe you guys a case of beer some time.

Here are some of the lessons we have learnt about Cassandra so far:

It scales up nicely, but doesn't scale down nearly as well. At only three nodes we weren't able to take advantage of most of Cassandra's safeguards for ending up in this situation.
After bootstrapping a new node, the old ones don't delete old data until you run the cleanup, which can unbalance your cluster.
The pending queue is a good thing to monitor
It's significantly preferable to be over-provisioned than underprovisioned, and you need good statistics on your growth to project the number of nodes you'll need to handle your load as it increases so that you don't fall behind the curve. We've learnt that this is true of all things in our system, but our monitoring was skewed by memcached masking the load issue.

Cassandra is saved by the Amazonians

To make things worse, during the outage we had some initial problems getting the capacity from EC2 that we needed to quickly scale up Cassandra. Sometimes a hedge fund or biotech company will come through and fire up 1000 instances in a single EC2 zone to do some number crunching. A properly architected application will be able to work around this problem by simply coming up in another zone. We hope to one day have our application at that point, but right now we're sensitive to internal latency and so operating cross-zone is a serious performance hit. Since the bulk of our application is in a zone that had this capacity problem on Thursday, we had trouble getting the instance sizes we needed alongside them. To Amazon's credit, they did help us get what we needed quickly once we told them about the problem.

she must have been pregnant*

After finally recovering from the Cassandra failure and preparing to head off for some much-needed sleep, our internal message bus (rabbitmq) died, which added about an hour to the downtime. It dies like this pretty often at 2am or at other especially bad times. Usually it doesn't cause any data-loss, just sleep-loss (its queues are persisted and the apps just build up their own queues until it comes back up), but in this case it decided to crash in a way that corrupted its database of persisted queues beyond repair. rabbitmq accounts for the only unrecoverable data-loss incurred, which was about 400 votes. As far as we can tell, these were entirely unlinked events. Coincidentally, rabbitmq crashed twice more that day and a few more times into the weekend. For now we've upgraded to the latest version of Erlang (rabbitmq is written in Erlang) since R13B-4 is rumoured to have significantly better memory management which can act as a temporary stopgap for the apparent reasons for some of the crashes, but not all of them. Things have improved thus far, but replacing rabbitmq is at the top end of our extremely long list of things to do.

broken listings

Last week's woes only got in the way of fixing our aforementioned data integrity problems. Right before the proverbial shit / fan interaction last week, we had an unrelated caching bug appear where a cache node went into an indeterminate state where it wasn't down so much as returning empty results (in this case, empty precomputed listings). This normally isn't bad, except when those corrupted listings start having updates and inserts applied to them, and are subsequently cached mostly empty (the same symptom as during the Cassandra downtime but for unrelated reasons). Since all listings are stored in the same caches, this affected some random sampling of listings all over the site. The mapreduce job will fix instances of this inconsistency regardless of the cause.

email verification irony

We've also discovered that a lot of the verification emails we've been sending out haven't been going through. It seems that the mail server admins at some popular domains (e.g., comcast.net, rr.com, adelphia.net, and me.com) have their servers configured to consider all mail from reddit to be spam. This is because Trend Micro has marked Amazon's entire EC2 network as a "dial-up pool", and the aforementioned domains subscribe to Trend Micro's list and block all mail from anyone on said list. We've written to Trend Micro explaining that we're actually neither a spammer nor an individual end user, but rather an honest website that's kind of a big deal, and they sent us a form letter explaining how to configure Outlook Express and encouraging us to ask our ISP for further information. We'll try to figure something out as soon as time allows.

moving forward

We think the twine and gum holding our servers together should last for the time being, but just in case things fall apart again, we've gone ahead and updated our failure aliens (or, "failiens") to a new set provided by SubaruKayak.

We hope you don't get to see them for a long, long time, but here's a sneak peek at the new 404 page:

We invite all reddit artists to contribute your own images of the alien in strange, exotic locales. You can get the template here. (Kinda looks like Homsar, doesn't he?)

in conclusion

Despite these problems, our traffic keeps growing, so thank you again for your continued patience -- even if it means we'll have the usual problems of scaling up every few weeks.

We know that this site would be nothing without its community, and we thank you for your support.

Tuesday, May 11, 2010

reddit's May 2010 "State of the Servers" report