I wrote previously that Twitter's architecture is not scalable for real-time messaging. Indeed, the Twitter development team talked about this recently.
Twitter is, fundamentally, a messaging system. Twitter was not architected as a messaging system, however. For expediency's sake, Twitter was built with technologies and practices that are more appropriate to a content management system. Over the last year and a half we've tried to make our system behave like a messaging system as much as possible, but that's introduced a great deal of complexity and unpredictability.
Dare Obasanjo recently looked at the issues and implications and suggested that it's not the technical architecture to blame, but perhaps rather the logical model.
If Twitter was simply a micro-content publishing tool with push notifications for SMS and IM then the team wouldn't be faulted for designing it as a Content Management System (CMS). In that case you'd just need three data structures
- a persistent store for each users tweets
- a cache of their tweets in memory to improve read performance
- a persistent list of [IM and SMS] end points subscribed to each users tweets and an asynchronous job (i.e. a daemon) which publishes to each users subscribers after each post
Unfortunately, Twitter isn't just a blogging tool that allows people to subscribe to my posts via SMS & IM instead of just RSS. It also has the notion of followers. That's when things get hairy.
Nothing is as easy as it looks. When Robert Scoble writes a simple “I’m hanging out with…” message, Twitter has about two choices of how they can dispatch that message:
- PUSH the message to the queue’s of each of his
6,86424,875 followers, or
- Wait for the
6,86424,875 followers to log in, then PULL the message.
The trouble with #2 is that people like Robert also follow
6,80021,146 people. And it’s unacceptable for him to login and then have to wait for the system to open records on 6,80021,146 people (across multiple db shards), then sort the records by date and finally render the data. Users would be hating on the HUGE latency.
Scoble challenges the assumption that there are copies made, but Dare discusses that further in a follow up post. That said, it's not disk storage here that's a problem - as Israel said, it's about figuring out how to efficiently dispatch that message with as little latency as possible.
The impact of this "follow" model goes even deeper. A deeper challenge that Twitter faces is a greater burden in routing. For example, with Exchange Server, the sender defines where the message will be received. When I send a message, I put a clearly defined recipient list on my e-mail - Exchange Server doesn't need to figure out who to send it to (a mailing list is clearly defined up-front, so even there it's just a one-hop resolution), only where those mailboxes are. Not only this, but, like the Internet generally, e-mail works and scales because it is a distributed system - the burden of routing is shared across a wide, redundant infrastructure.
Twitter doesn't have this luxury. Every time any Twitter user sends a message, the system must decide which of the millions of other Twitter users should receive that message. Dare may have called Twitter amateurs compared to the the pros, but even Facebook doesn't need to deal with this on a real-time basis. (Facebook does in fact employ model #1 above for the news feeds - the reason you can't really delete a story from your mini-feed - but it doesn't need to do that on a real-time basis). Before we even get to the I/O math that Israel discusses, Twitter must decide whether it should be delivered to each of your followers - @replies (which these "super users" naturally have a lot of) are delivered based on logic here, and that decision needs to be made for each of those 25,000 users.
Let's also not forget about Twitter's track functionality. Every time a message comes in, Twitter needs to decide which users (in the entire Twitter universe) need to get it. Blaine Cook discussed this feature on a recent Gillmor Gang and said it only scales for sending "on demand" to IM/SMS endpoints - as he put it, "if you have an event-driven system, you don’t have to do the queries against large data sets. You are just looking at an individual piece of data in memory." (In my mind, this further reinforces the likelihood that Dare and Israel "guessed right" on the architectural model). Regardless of the specific implementation, it's a routing challenge that Facebook, instant messaging, and e-mail systems don't really have to face.
Let’s assume that the average word length for English is 5.10. On Twitter, it’s likely less since we tend to use more abbreviations and shorter words given the 140 char limit. Taking out 30 chars for punctuation gives us (conservatively) about 20 distinct words per message. Twitter needs to look at every message and decide who should receive that message. Not only that, but Twitter supports multiple "contains" clauses. This may seem simple to implement initially, and there are ways to attack it efficiently, but with a lot of volume it is clear that Twitter has to do a fair amount of processing in near-real-time that comparable services don't need to do.
So, it's clear that Twitter faces some unique challenges because of its logical model.
The real problem I have is that Twitter doesn't fail gracefully. Twitter should not go down under load - it should degrade gracefully into something that's not-quite-real-time by queuing up "delivery". The challenge in delivering should not prevent me from pulling what has already been delivered, nor should it prevent me from sending something that will eventually be delivered. If a message can't really be delivered immediately, that's not the end of the world - the fact is most clients today aren't really real-time since they use the HTTP API instead of XMPP.