Moving fast is core to Facebook's culture and is key to keeping us innovative. Shipping code every day means our software infrastructure must be extremely dynamic. Also, to accommodate our rapidly growing user base, the size and complexity of that infrastructure has increased dramatically over the years.
In a system as large, dynamic and complex as ours, with many inter-dependencies and constantly moving parts, something could go wrong and slip into production. While automatic testing (via unit tests) helps to mitigate this problem, there is no easy solution to completely prevent this.
In Facebook's early days, the reliability of the site could be monitored by tracking a small number of graphs that capture the key aspects of the site. As the site has grown richer (more features, more sophisticated platform, more back-end services), the site can break in subtle ways that do not show up in the high-level graphs. This meant that the number of issues that were not discovered until the users complained about them started growing.
Since 2010, there have been ongoing efforts at Facebook to measure, monitor and improve our site reliability. The main goal is to uncover and capture critical user-visible errors early, before they grow out of control and become readily apparent to our users. Another goal is that once an error gets under control, we can monitor it closely to prevent regressions.
Those errors are logged with a sampling rate (currently set at 1%) using Facebook Scribe. We also log error information useful for isolating and debugging issues, such as browser type, user country, error type, PHP endpoints, server cluster and stack traces. Given our large user base, this results in massive amounts of data to be stored and processed — some 600GB/day. We use Apache Hive to process the massive amount of data to generate trends. And we aggregate the data to be able to drill down in a multidimensional space.
The collected data can also be used to compute various reliability metrics, such as percentage of page requests that are served error free, percentage of service requests that are handled successfully by each of our back-end services and percentage of users who see at least one error in a 15 minute session.
Fixing the Uncovered Issues
When we first looked at the collected data, we discovered a number of issues that had been happening for a while but had gone unnoticed. We used this data to start prioritizing and fixing them across the company. Here are a few examples:
- Engineers from our Chat team found and fixed several issues based on a large set of Chat error information collected. One is a connection issue between certain Web clusters and the Chat presence servers; another one is a message sending problem for some users (especially when a user just became online from offline). They also found some chat errors were caused by the user’s firewall/proxy blocking Facebook Chat.
- A large set of database access errors inside our PHP code were found to be caused by a lack of local database replicas. We fixed this by enabling remote database connections in the PHP code when a local replica is not available because of replication lag.
- Many issues were exposed for some back-end services. Several architectural changes ensued that made the back-end service more robust, such as throttling or rate limiting to handle spike of traffics, and fast fail-over to neighboring machines to tolerate individual hardware failures.
Monitoring and Preventing Regressions
Once the uncovered issues are investigated and get under control, we want to monitor them to make sure they stay low and don’t cause regressions.
To do this, we needed to make our error collection and measurement system real time. Given the volume of data to be processed, it requires some tricky engineering and optimizations to process this data in a short time. We eventually built a fast and robust data pipeline with a latency of no more than 10 minutes.
At this point, we have a real-time monitoring system with more than 1500 trend lines. A big blip on one of these lines usually points to a failure of some type. To automatically monitor these, we employ an in-house machine-learning system that analyzes the patterns of the trend lines over days and weeks to detect anomalies and generate alarms with a severity level. Our engineering and operations teams can then respond to these alarms and start a new round of investigation and fixing if needed.
Results and Next Steps
Since last year, this company-wide effort has helped cut down the number of user-visible errors by a factor of 5. However, fighting for reliability is a long uphill battle, especially as we move fast and our site continues to grow rapidly.
While we strive to provide the best quality of service to our users, we understand our site reliability today isn’t perfect. For our next steps, we plan to focus on two things.
First, we want to keep reducing those known errors, especially those high user-impact errors, by building a better back-end infrastructure and through bug hackathons. Second, we want to keep identifying new classes of errors we aren’t capturing yet.
You can help us. If you see an error on the site that particularly bothers you, please go to the Facebook Help Center and submit a detailed report (preferably with instructions and/or screen shots to reproduce the issues). This would greatly help us capture and fix the errors eventually. Thanks in advance!
Do you like building Web infrastructure? Facebook is hiring infrastructure engineers. Apply here.
Qiang is a software engineer on the infrastructure team.