Thursday, April 29, 2010

Now this is how you pitch your product to an open source company

After our call to open source developers on Monday and the addition of our code repository to github, we've been ecstatic to see a flurry of activity in #reddit-dev on freenode as well as on our mailing list. We'd like to announce that our most recent accepted patch is from the gentlemen at a YCombinator start-up called embed.ly who, as you might have guessed, specialize in providing embeddable media using the open oEmbed standard.

When a link gets submitted to reddit, within a few minutes of it appearing on the new page, we run it through a "media scraper" which is responsible for finding images to generate a thumbnail for the link as well as for finding any embedible content (such as videos). Unfortunately, each provider generates their embed codes a little differently, and it has been cumbersome to keep our scrapers up to date.

Enter embed.ly. They've got quite the list of supported content providers, and gave us the advantage of not needing to keep a long and tangled list of how to deal with each API individually. Rather than simply pitch us that it was a good idea to work with them on this, they grabbed our source and wrote us a new scraper which we could drop right in. Not only does this free us up from having to keep track of changing embed APIs, it allows new APIs to be taken advantage of automatically. For this, in addition to getting our thanks for simplifying our lives, screeley and agibby get their awards.

Their contributions aren't yet reflected in the repository because, as you may have noticed, the repository is still a little bit behind our production branch. We're sorry about that, and we're going to merge up next week as we have a bunch of long term projects coming to a conclusion. In addition to the new media scraper, the updated version will include our move to Cassandra, the new spam control measures, and all of our new sponsored link code.


TLDR: open source is great, and embed.ly has helped us double the number of sites our media scraper knows about (so you can watch more videos from more sources in reddit by clicking the play button).

A reddit experiment: Help us catch spammers by verifying your email address (please?)

One of the most powerful tools for fighting spam is the humble verification email: If you force all new users to specify an email address, and then verify it by sending a test message, it makes the spammers' job dramatically harder.

However, reddit's always been about openness and privacy. I remember the day I created my account, back in 2005, and how impressed I was that all I had to do was type in a username and a password -- three tenths of a second later, my account had been created and I was logged into it. No annoying, "You must click the link in your email before you can do anything fun" message.

Further, there are plenty of occasions when reddit users wish to remain anonymous -- they're publishing controversial words, or sharing deeply personal stories on IAmA, or posting a photo of their... well, you get the point. It's a fine line to walk, crushing spammers without hurting our community.

After much careful consideration, we think we've found the right balance, so we're going to start an experiment today. Here's how it works:

First and foremost, nobody has to verify their email address. If you're paranoid about this sort of thing and would rather jump off a cliff than tell reddit your email address, you'll still be able to log in, vote, post crazy comments, submit links to bunker supplies and tinfoil hat designs, and everything else that you're used to.

In fact, we think (and hope) that normal, non-spammy users won't even notice any change. The only ones who should have a problem are people who submit one crummy link after another, as often as the site will let them. We're going to start limiting them to a certain number of crummy links per hour (and per day, per week, etc).

So what defines a crummy link? Well:
  • Links that are flagged as spam are crummy.
  • Links that fail deputy moderation are crummy.
  • Links that get more downvotes than upvotes are crummy.
  • Links with even a tiny positive score are successful.
  • Links that survive 24 hours without getting marked as spam are successful.
  • Links that get explicitly approved by a moderator are successful.
And what happens if you use up your "crummy-links" quota? If you haven't verified your email address, you'll be prompted to. Once you do, you'll be granted a lot more leeway.

But what if you're really hard on your luck, and despite making a good-faith effort, you use up even this larger crummy-links quota? Or if you don't want to verify your email address for some reason? You can try again in a little while, or you can message the moderators of the reddits you're submitting to and ask them to certify that you're not spamming. There's two ways they can do this: They can manually approve links you've already submitted, which as mentioned above will free up space in your quota, or they can add you to their reddit's whitelist, which will let you submit as often as you want within their community.

And just in case there was any doubt: reddit will never sell your address, bother you with unsolicited email, or anything remotely evil or annoying. But we really think it will make the spammers stand out if as many people as possible verify their email addresses. For best results, use the most prestigious address you have. In other words, throwaway addresses like g634c3Gssd2d@mailinator.com stink, free accounts like joe@hotmail.com and sue12345@gmail.com are so-so, and anything ending in, say, .edu.au, .gov.uk, or .mil is freaking outstanding.

As an added incentive, you may notice something new in your trophy case afterwards.

TLDR: It would help us fight spam if honest users verified their email addresses. But we're not going to make anyone do it.

Wednesday, April 28, 2010

You can now target sponsored links to particular communities, and rerun them without losing the comments

We've been very happy with the way that our sponsored link system has been used. Advertisers seem to have been pretty happy as well. Like all things that ain't broke, we decided, 'Let's fix it!'"

For the visual among you, TheOatmeal was nice enough to make us a walkthrough of the new system.

For those of you who want to get down to business, here's the link.

In this major revision, we've added the ability to target specific reddits (and the subscribers thereof) as well as the ability to extend links. We've been running in a private beta for a few weeks to get the bugs worked out, so you may have already started seeing examples of the new ads popping up. If you didn't, great: it means we've succeeded in making this unobtrusive. Targeted sponsored links will appear at the top of the hot listing for the targeted reddit as well as at the top of the hot page only to users who subscribe to that reddit (gory details below). One of the key features of sponsored links is the ability for advertisers to actually interact with the community they are helping to support. In the previous system, once an ad had run its course, it was gone, never to be seen again. Sometimes, this was for the best (you know who you are). Other times it meant that a really productive and interesting comment thread on an ad was cut short. To fix this, we've created the notion of "campaigns" around links.

Once a sponsored link is approved, it can be rerun at any time (as soon as tomorrow rather than our current 48 hour waiting period). This means a popular ad can be extended and, since these campaigns can be targeted or untargeted, moved around to other potential target reddits to extend the conversation.

Here's the gory details on how the new algorithm works. Like our original version, the new version is built around the notion of everyone paying the same CPM. For any given day, we take the pool of all bids for that day, treat them as the cost for "selling out" reddit's advertisements for that day, and allocate each advertiser a slice of the pie commensurate with their contribution. To include targeting we added two refinements to this:

  • Targeted links will compete with untargeted sponsored links on the front page, but only when the current user is a subscriber to that reddit. In this case, the bid will be weighted by the traffic of the targeted reddit relative to an average front page reddit. (In this way, targeting to /r/reddit.com or /r/politics is just like running untargeted, but targeting /r/music would get a 2-4X boost).

  • On the hot listing for each reddit, we compute a separate pool of just links that are targeted to that reddit. If there is no pool, there are no ads. If there are, we divvy up pageviews by contribution, and (unlike on the front page) render one at the top on every page load (rather than intermingling it with new links).

We're trying to keep these links as unobtrusive as possible. If you are logged in and you don't like an ad, vote it down and it won't show up again to you. As before, if you like an ad, vote it up (and it still won't show up again).

We'd also like to thank our beta testers who were brave enough to risk their time and money on a potentially unstable ad platform for the last few weeks while we got the bugs out.

tldr: oh look! here's a web comic!

Edit: Cool! We're in TechCrunch!

Monday, April 26, 2010

pls send me teh codez

Since we announced that we went open source, something that we haven't done a good job of explaining is that part of being open is that we want to let ideas and code flow in both directions. We know that you're a great and intelligent community with an open source spirit and lots of great ideas and expertise and we want to help those willing and able to contribute to do so.

To recognise those that meaningfully contribute, we're introducing the Open Source Contributor award:



How can you get started?

  1. Get an idea or an itch to scratch. We have an infinitely long to-do list if you need ideas; there's never a shortage of them. The biggest resource that our tiny team lacks is time, and a lot of oft-requested features are easy low-hanging fruit that just aren't in our time budget.

  2. Get the code (executive nerd summary: git clone http://code.reddit.com/repo/reddit.git or fork our github repository; executive summary: tell someone else to do it).

  3. Join /r/redditdev and/or hop on the mailing list and tell us your idea. We can give you an idea of feasibility and guide you through the architecture and tell you where it would go. Sorry: you can't implement a feature that punches people when you downvote them unless you're willing to market the peripherals yourself.

  4. Code like the wind!
The hardest part of contributing to most open source projects is setting up and learning the environment, and reddit is no exception. The first patch is always the hardest, and if you can join us we want to hold your hand through that process to make it as painless as we can.

To jump-start the award, the first recipients are:

Wednesday, April 07, 2010

You've just been drafted.

Pretty soon you're going to start seeing a pink box like this at the top of the front page every once in a while:


This "reddit's spam filter needs your help!" box is the first step in a major overhaul that's been in the works for the better part of the last year and has been in our dreams since forever (reddit admins have weird dreams).

We've never been comfortable with the collateral damage caused when our anti-spam and anti-cheating mechanisms catch an innocent victim, but until now, the alternative would be to let spam and cheating take over the site.

Like most components of reddit, this is a story of outgrowing:
  • In the earliest days, there was no spam.
  • Then, there was some spam, but users would downvote it right away.
  • Then, the New queue was so flooded with spam that it became unreadable, which ultimately starved the front page of good submissions. So we (the admins) started removing it manually.
  • Then we asked you guys to report spam so we at least didn't have to go looking for it.
  • Then, even sorting through the reports got overwhelming, and we had to turn the job over to moderators.
  • Then, the moderators were overwhelmed and an automated spam filter had to be set up for each reddit community.
  • Then, traffic grew so much that the spam filter's tiny false positive rate started accumulating into a constant stream of stories about poor souls who were unfairly blocked. Most redditors are understandably sympathetic to these stories, and so there have been numerous prominent submissions that inevitably end with us being accused of censorship -- or at best, being a police state. And that makes us feel terrible.
Adding to the problem is the fact that the spam filter only really works when it's fed a constant stream of training data -- "This is spam." "This is not spam." ... it has a really voracious appetite for this training data, and moderators simply can't keep up anymore. So the malnourished spam filter starts acting crazy, and in a vicious cycle, the moderators get more work to do.

So now, in the hopes of solving this problem once and for all, we're drafting people like you to help out. We call this new system "deputy moderation" and will be putting up instructions here just as soon as we get around to it. (It's a wiki, so you can jump in and help too.)

As you flood us with useful training data, our spam filter will get better and better. Soon your feedback will -- when there's a quorum and a clear consensus -- be able to stand in for moderators when they're asleep or otherwise not available to tend their queues. We expect Australian and New Zealic redditors will especially love this new feature, for reasons enumerated quite bracingly over here.

If this stuff works, we'll be able to decommission a lot of our sneakiest anti-spam measures, which while extremely powerful at stopping spam seem to also be the ones with the worst collateral damage.

P.S. The first two words of this blog post are a lie -- it won't be "pretty soon" unless everything goes according to plan, and it never does. We're going to deploy this change very slowly and carefully, since it could kill the site in about seventy-three different complex ways. We'll probably enable users for this in order of seniority. Or maybe descending karma. But we will get it out as quickly as we safely can.

Thanks for volunteering being volunteered!