Almost ten months ago we launched Facebook Chat to 70 million users. We ventured into a lot of new territories with this product: not only were there tricky web design and product issues, we needed to develop and launch a trio of new backend services to support all of Chat's functionality. Eugene wrote a great post detailing the inner workings of Chat, and recently we gave a talk (videos: 1,2,3,4) about the product from front to back.
Since the product's launch last April Facebook has grown to over 175 million active users, and more than two-thirds of them have used Chat. It's been both an engineering and operational challenge to keep pace with Facebook's rate of growth. As we push our infrastructure closer and closer to its limit, some obvious (and not so obvious) bugs have expressed themselves. The individual servers work closer to their limits and it becomes more difficult to keep all the component pieces running in harmony. The channel servers are the most intricate piece of the backend. They're responsible for queuing a given user's messages and pushing them to their web browser via HTTP. We made an early decision to write the channel servers in Erlang. The language itself has many pros and cons, but we chose Erlang to power Chat because its model lends itself well to concurrent, distributed, and robust programming.
It's easy to model our millions of concurrent users with a few lightweight processes each, where the same tactic in, say, C++ would have been more daunting. Programming languages are always a tradeoff; Erlang makes some hard things easy but, unfortunately, some easy things hard. One of the earliest challenges we faced was reducing the channel servers' memory footprint. High-level languages often provide rich data types and powerful abstractions for manipulating them. Erlang strings, for instance, are linked lists of characters, allowing programmers to use all the list-manipulation goodies that Erlang provides. In this case, however, it pays to control the representation a little closer and use arrays of characters like one might in C++. In this case we traded back some of Erlang's power in favor of CPU and memory usage. We also exploited the nature of our application to make another trade-off: just before a user's HTTP response process goes to sleep to wait for a new message to arrive, we force a pass of the garbage collector. We spend more cycles in that process than we usually would, but we ensure that it's using as little as memory as possible before it sleeps.
Working in high-level languages can be tricky, but that's not to say our C++ services have been without problem. The chatloggers store the state of Chat conversations between page loads. After launch these servers ran for many months without incident — we considered them the most stable of the trio of Chat services. However as more and more users began using Chat we started running the chatlogger machines closer to their capacity. We began to see that after several days of smooth operation the tier's CPU usage would increase dramatically and they would become unresponsive. We recompiled one of the daemons with OProfile support and observed that, after several days, most of the CPUs' time was being spent in malloc. The backtrace revealed that the use of lexical_cast within a loop was causing a lot of allocations, and after many days of operation the heap was so fragmented that the daemon spent most of its time trying to find free chunks of memory rather than servicing requests. We reorganized the code so that the allocations were done outside of the loop and CPU usage dropped back to its usual low hum. Our difficulties haven't always been bugs in our programming.
The presence servers are the simplest of the trio — in fact they're far simpler than even the example service bundled with Thrift. Their job is to receive periodic batched updates from the channel servers. Each cluster of channel servers keeps an array to record which users are available to chat. Building a list of a user's online friends might require knowledge from each of the many channel clusters, so we instead publish the channel servers' collective knowledge to one place — the presence servers — so that building a list of online friends requires only one query. These servers support two operations: reading a list of users' statuses from the array, and writing a chunk of information shipped from the channel servers. Each presence machine receives a tremendous amount of information: they regularly receive several bits per Facebook user id, whether the corresponding user is available or not.
That volume of write traffic completely overshadows the read traffic, and pushes the limits of the both the servers' network adapters and the network infrastructure within our data centers. We therefore made a crucial CPU vs. bandwidth trade-off: the channel servers compress each chunk of data using zlib before shipping to the presence servers. Our more serious challenges have been operational. A tremendous amount of data flows between our servers, devices, and networking infrastructure every minute. The millions of open browser windows and tabs on Facebook (both active and idle) hold an open connection to the channel servers. Internally we use load balancers to manage the sheer number of connections. Using many of their proprietary tricks we're able to connect many users with servers using only a few devices. At launch we had assumed the load balancers were a limitless resource, but in fact there is an upper bound on the number of simultaneous connections each can handle before they begin resetting connections. If that happens the web browsers on the other end try to reconnect by making a request to our web servers, who in turn send a flurry of requests to the channel servers. The connection is reestablished briefly and then the cycle continues, and all the while our users have sporadic or no access to Facebook Chat.
At first the problem only affected users at peak Chat usage, but as weeks passed and our user base increased the outages became more frequent and began to affect the web and channel servers, as well as the infrastructure behind them. We explored (and fixed) several unrelated issues before we finally realized the root cause. We added more load balancers and the connection resets promptly stopped. We've spent a great deal of effort building infrastructure to monitor all of Chat's parts and pieces. We're fortunate that Erlang allows us a look at the channel server's internals at runtime; we can attach a shell to a running service to find which processes are consuming what resources, to dump the state of particular data structures, and to load new code on the fly. We use Scribe to aggregate error logs across all three backend tiers, and we use fb303 to export statistics via Thrift to get a clear picture of the services' health both per-machine and in aggregate.
The news isn't all bad. We've spent a lot of time making all three services more robust and efficient, and Chat is more stable than ever. Having solved most of the scalability and stability problems in Chat (knock on wood), we're close to welcoming a new member to the family of backend services: Jabber. We've got a bunch of other cool stuff in the pipeline for the next few months, so stay tuned.