I’ve just finished reading an interesting take on the frequent Twitter - yes this is another post about Twitter, but this time its not really knocking them, so much as talking about the engineering challenges that they are having - outages over on Techcrunch and it occurred to me that the problems they are having are actually similar to problems that we have seen some of our customers struggle with in the text analytics space. Now I don’t pretend to know the first thing about Twitters architecture, but given its built on Ruby on Rails I would guess that everytime a Tweet is sent in an SQL query is executed to get the distribution list and then its sent out. Actually I hope its not that because that would be a horribly inefficient model but lets work on the assumption that’s how they are doing it for now - they must at least be caching the distribution lists for the frequent tweeters. This approach of course is the norm coming from most Web based shops as the LAMP (LAMR?) stack is the dictating factor in many designs, and can be pretty much seen in the above 2007 slide from former chief Twitter architect Blaine Cook but if you can think around that then the problem becomes IMO a very solvable one, at a very low cost both in engineering time and hardware. Twitter is effectively a big alerting service, when a piece of text is submitted all the people subscribed to that feed (for lack of a better word) need to be told that a new message is available. This is no different from say creating a Google news alert about your favourite subject and having it send you frequent email updates as new text that matches your query flows in (actually that’s a much a harder problem as in the Twitter case the people the alert needs to be sent to is fixed rather than being variable by queries) Recently I built a alerting service POC (I had a few hours to kill!) that takes a piece of text and executes queries against it telling you what queries hit. Sounds perfectly reasonable and something that would be easy to do. However the telling bit is that I’m able to execute approximately 1/2 Million queries against a piece of text (1000 odd words) in around 40ms, all on a bog standard desktop PC running on a single core - and yes increasing cores increases performance linearly, but increasing queries doesn’t, going to a million queries will only slow it down to approximately 50ms. The trick of course is not to write it all using SQL and a scripting language but to craft it in good old fashioned C / C++ - something that seems to be less and less fashionable these days, but still can’t be beaten for high performance plumbing problems. In the prototype I used good old SQLite as my persistent data store, but in a production system I would use a custom storage format to increase the speed even more - the bottleneck in my alerting system is ensuring that the query you think hits actually does. This of course wouldn’t be a problem in a Twitter type solution as its a binary query. This sort of approach is the perfect answer to Twitters constant downtime. Breaking away from the LAMR stack and writing some custom code that is designed from the ground up to solve the problem at hand should solve all their dataflow issues using a relatively small number of machines and in a relatively short space of time. Certainly less than trying to make a scripting language based solution scale. Of course for those of us who just don’t get / understand twitter then all that downtime could be seen as a bonus!