Connecting and sharing with others is Facebook’s primary value. That value necessitates having the ability to easily and efficiently find the people and information we care about. The search team at Facebook is focused on building a search product to enable our more than 400 million users to quickly find what they're looking for. In July 2007 we explained the complexities of serving one of the largest user bases in the world and the reasons for building our own in-house search service. Serving more than 150 million queries a day, and supporting a user base that has grown by more than 10x since then reinforces that decision.
The Role of Search on Facebook
We know that engagement on Facebook has a lot to do with how many connections someone has, especially for new users. Since people heavily rely on search to create and navigate their social graphs, their success/failure to do so is a success/failure of search. Facebook search success means that you can find a specific "Bob" without knowing his last name, or find that awesome-but-not-yet-popular-band your friend just told you about. Enabling this means catering the results specifically to you, since the worst result for one person might be the best result for another.
- Personal Context: Unlike most search engines, every Facebook search involves two key elements – a query and a querier. Just as we need to understand the query, it’s as essential to understand the person behind the query. People are more likely to be looking for things located in their own city/country or for people who share the same college/workplace. We consider this information and much more when ranking results. The more we know about you, the better your search results will be.
- Social Context: An important subset of personal context, social context refers to the people one knows and cares about. The“Jose Gonzales” with whom you have 5 mutual friends is a better result than those with no friends in common. Note that the better job search does at helping you find and connect, the better your search results will be going forward. While personal context makes use of things you care about, social context deals with the things your friends care about. Since calculating social context for every query is technically complex, we built a separate service for it. We will cover the details of this service in a future blog post.
- The Query: We tokenize the query based on the suspected language (Chinese tokenized on characters, English on spaces), correct potential spelling mistakes, find "Elizabeth Jones" even though you typed in "Liz Jones," etc. We also prioritize results based on how they matched the query; e.g we rank entities with "chicago" in their title differently from those located in Chicago. We've made good progress in understanding queries, but have a lot more left to do.
- Global Popularity: An entity popular amongst a large audience deserves high ranking. Someone searching “Michael Jackson” is more likely to want the pop star than a friend of a friend by the same name. To determine global popularity we look at how many people are connected to an entity as well as how engaged they are — a Poker application with a few frequent users might be more relevant than one with several infrequent users.
Complexities of User-Centric Search
Our emphasis on personal and social context leads to some interesting technical challenges which make it different from the traditional search problem.
- Ranking on the critical path: Since our most important ranking features depend on who the searcher is, all our feature generation and ranking happens as a part of the query execution workflow i.e. our indices can't store pre-ranked results to optimize lookups. . Instead, we have to generate ranking features like is_same_high_school and num_mutual_connections on the fly for every potential result, and run them through our ranking model to find the best results. Making this model better and faster is a major focus for the team this year.
- No query cache: Caching allows a service to compute results once and reuse them across multiple requests. Usually a small number of unique queries make up a large portion of all requests (see Zipf's Law), so most search engines can cache the best results for their most popular queries. Good caching strategies can give you a 50-60% cache hit-rate – at a large scale, this means millions of dollars of savings and much improved performance. Facebook search can’t use this huge optimization because the request is [user, query] and not [query]. We rarely see the same [user, query] more than once a day, rendering traditional caching models useless. Unlike most fast food chains, we wait till you order before we start cooking. Identifying novel caching opportunities is another key focus of our search team.
- Large hot index: Another way search engines usually reduce work is to create a much smaller ‘hot’ index comprised of high quality documents. Enough results from the hot index means never having to hit the slower cold index. This works when the hot index contains the set of documents that have a high likelihood of being the best or ‘good enough’ for most queries. Unfortunately, there is no such thing as good-enough when you’re looking for a specific person on Facebook, rendering most of our index 'hot.’
- Live updates: People on Facebook are constantly changing their profile info and connecting to new friends, pages and applications. Since this information determines search relevance, we update our index within seconds of any change. Our index data structures need to manage thousands of concurrent reads and writes for months on end without disastrous fragmentation. We’ll share more about our indexing, live updates, and data structures in future posts.
While searching for people is still the predominant use for Facebook search, an increasing number of users are starting to use search to connect with bands, restaurants, celebrities, and discover applications. Additionally, a few months ago we enabled users to search through recent public content and content produced by their friends. Indexing the massive amount of content our users produce with the ability to filter to just friends' content required building infrastructure with its very own unique and challenging problems. Having shared the ‘what’ of Facebook search, we look forward to sharing more of the ‘how’ over the next few weeks. Iif you’re interested in helping, check out the jobs page.
Akhil Wable, an engineer at Facebook, is still trying to figure out a good way to use ‘cache’ as a pun