Last week we added the ability to search posts using Graph Search, a feature that has been two years in the making. With one billion new posts added every day, the posts index contains more than one trillion total posts, comprising hundreds of terabytes of data. Indexing these posts and building a system to return real-time results has been a significant engineering challenge—and this is just beginning.
Collecting the Data
One of the biggest challenges we faced in building this product was that Facebook's underlying data schema reflects the needs of a rapidly iterated web service. New features often demand changes to data schemas, and our culture aims to make these changes easy for engineers. However, these variations makes it difficult to sort posts by time, location, and tags as wall posts, photos, and check-ins all store this information differently. We currently have 70 different kinds of data we sort and index on, many of them specific to certain types of posts. In addition, the data is contained in a production MySQL database. This means that harvesting this data puts a significant load on databases that are simultaneously serving production traffic, and the process must be carefully monitored.
Building the Index
We store our harvested data in an HBase cluster, from which we execute Hadoop map-reduce jobs to build our index in a highly parallel process. This index-building process converts the raw data into a search index that works with Unicorn, our search infrastructure. We separate the data into two parts – the document data and the inverted index. The document data for each post contains information that will later be used for ranking results. The inverted index contains what is traditionally considered to be a search index, and building the inverted index requires going through each post and determining which hypothetical search filters match.
Updating the Index
To update the index, we subscribe to changes in the MySQL databases using a technology called Wormhole. Whenever a new post is created, an existing post is modified, a post is deleted, or some connection to a post is edited, we schedule an update on the corresponding post. To avoid code duplication, we run these updates through the same logic mentioned in the "Collecting the Data" section above. However, there is one major distinction: when we originally collect the data, we intentionally bypass our caches, as we want to avoid sending them requests for old data that is likely missing from the cache. When processing an update to the index we hit those caches, as we expect the data to have been recently accessed and be present in the cache.
Serving the Index
The posts index is much larger than other search indexes that Facebook maintains. Before we started working on the ability to search through posts, all Facebook search indexes were served entirely from RAM. This was ideal for quick lookup and was a tenable setup given reasonably small search indexes. However, storing more than 700 terabytes in RAM imposes a large amount of overhead, as it involves maintaining an index that is spread across many racks of machines. The performance cost of having these machines coordinate with each other drove the Unicorn team to look into new solutions for serving the posts index. The solution we decided on involves storing the majority of the index on solid-state flash memory. We managed to preserve performance by carefully separating out the most frequently accessed data structures and placing those in RAM.
With a trillion posts in the index, most queries return many more results than anyone could ever read. This leads us to the results ranking step. To surface content that is valuable and relevant to the user, we use two primary techniques: query rewriting and dynamic result scoring. Query rewriting happens before the execution of the query, and involves tacking on optional clauses to search queries that bias the posts we retrieve towards results that we think will be more valuable to the user. Result scoring involves sorting and selecting documents based on a number of ranking "features," each of which is based on the information available in the document data. In total, we currently calculate well over a hundred distinct ranking features that are combined with a ranking model to find the best results. We will continue to work on refining these models as we roll out to more users and listen to feedback.
Like many other products at Facebook, the ability to search over posts was originally conceived as a Hackathon project. My second day as a Facebook intern coincided with a company-wide Hackathon, and I spent the night aiming to implement a way for my friends and me to find old posts we had written. I quickly discovered that the project was much more challenging than I had first anticipated. However, the engineering culture at Facebook meant that I was supported and encouraged to continue working on it, despite the scope of the project. The majority of the work–infrastructure, ranking, and product–has been accomplished in the past year by a few dozen engineers on the Graph Search team. We are very excited to share the results of our hard work with people using Facebook, and look forward to seeing how people use the new feature and improving it based on their feedback. We hope that being able to search for posts will enable a richer Facebook experience for everyone.
Ashoat Tevosyan is an engineer on the search quality and ranking team.