The post Toward a deeper understanding of the way AI agents see things appeared first on Facebook Code.

]]>**HOW IT WORKS:** Our researchers trained agents on the same games used in previous research, in which a pair of agents communicate about images using a fixed-size vocabulary. Unlike in those previous studies, which suggested that the agents developed a shared understanding of what the images represented, our researchers found that they extracted no concept-level information. The paired agents could arrive at an image-based consensus based solely on low-level feature similarities, without determining, for example, that pictures of a Boston terrier and a Chihuahua both represent dogs. In fact, the agents were able to reach a consensus even when presented with similar patterns of visual noise, which included no recognizable objects.

**WHY IT MATTERS:**

Fine-tuning experimental methodologies is important for the long-term goal of creating systems that develop more natural language-based communication. This work improves our understanding of the visual semantics agents use, which allows us to design future setups in which agents have stronger reasons to develop more natural communication strategies.

**READ THE FULL PAPER:**

How agents see things: On visual representations in an emergent language game

The post Toward a deeper understanding of the way AI agents see things appeared first on Facebook Code.

]]>The post Facebook London and U.K. Year of Engineering campaign focus on future of engineering appeared first on Facebook Code.

]]>Across the world, the demand for engineering and technology professionals is rising, but the supply of skilled professionals to fill those roles has not kept pace. By 2020, the U.K. alone will create 157,000 new jobs requiring deep analytical skills. As part of our commitment to the Year of Engineering campaign and to tackle this global skills shortage head-on, we are doing the following:

- We are launching a mentoring program for students ages 15-18 that’s designed to help bridge the gap between school and the workplace. Starting in November and continuing into 2019, mentors from Facebook London will meet with students five times during the academic year. Mentors will educate and train them in the skills they need for a career in engineering and technology so they can be better prepared to enter the workforce after graduation.
- We will host an open house day on November 16 at the Facebook London office, where students will be able to meet engineers and technologists, ask questions, and see demonstrations of the products we’re building in the U.K.
- We are exhibiting at the
__WorldSkills UK LIVE__conference being held November 15-17 at the NEC in Birmingham, England. Facebook engineers will be on the Year of Engineering stand to offer attendees career advice and demonstrations of our products, focusing on future technologies such as augmented reality. - We will produce a number of videos showcasing the variety of engineering work that Facebook is involved in.
- We will create a series of video interviews with engineers to be used as editorial content for the Year of Engineering portal and social media channels.

We are building deep relationships with organizations such as Anita Borg/Grace Hopper, SHPE, and NSBE, which support women and people of color in computer science and engineering. Internally, we are working toward building a culture of inclusion and authenticity while also supporting the next generation of engineers through our internship program. In the U.S., Facebook has launched CodeFWD, a free online program for educators and organizations to inspire interest in computer programming, and to expand underrepresented students’ access to and participation in the field.

In the past year, we have built and opened a brand-new London office and announced our plans to hire 500 additional employees, with an expected 2,300 employees in the U.K. by the end of 2018. Facebook London is our biggest engineering hub outside the United States; teams there focus on engineering and technology development for global products, including ads engineering, Workplace, community integrity, infrastructure and developer tooling, and AR/VR. The year ahead will mark a large engineering- and technology-focused recruitment drive for EMEA: We are always looking for talented engineers as we continue to grow and scale our teams covering product design, UX research, data science, machine learning, and software and hardware engineering.

Addressing the skills and diversity gaps is a long-term process, but we continue to work with government, industry leaders, nonprofit organizations, and academic institutions to find effective and impactful solutions.

For more information about the Year of Engineering campaign, visit yearofengineering.gov.uk.

To learn more about CodeFWD, visit techprep.org/codefwd.

The post Facebook London and U.K. Year of Engineering campaign focus on future of engineering appeared first on Facebook Code.

]]>The post Data @Scale – Boston recap appeared first on Facebook Code.

]]>The event brought together engineering leaders from Akamai, Datadog, DataXu, Facebook, Google, HubSpot, InfluxData, OM1, Starburst, TCB Analytics, and Wayfair to discuss the challenges associated with building and operating data systems that serve millions or even billions of people. More than 300 attendees gathered to hear talks on topics such as data deletion, protecting patient privacy, machine learning on Kubernetes, and building data pipelines at scale.

Summaries and videos of some of the presentations from the event are below. If you’re interested in joining the next event, visit the @Scale website or join the @Scale community.

Philip Wickline, Chief Technology Officer at OM1

The growing availability of health care data is changing the face of clinical research and the practice of medicine. Philip explores the challenges that OM1 faces in protecting the privacy of patients while working with massive amounts of patient data. He provides an overview of the ways patient data is represented and the solutions OM1 employs for working with large-scale health care data sets while still preserving patient privacy.

Tanya Cashorali, Chief Executive Officer at TCB Analytics

Tanya walks through several big data environments that exhibit various attributes. She examines the positive and negative effects of organizational constraints around technology choices that drive decisions regarding on-premises vs. cloud-based infrastructures, different database technologies, and deployment strategies. She concludes with recommendations for achieving a balance of flexibility and control.

Ryan Betts, Director of Platform Engineering at InfluxData

Ryan presents several lessons learned while scaling InfluxDB across a large number of deployments — from single-server open source instances to highly available, high-throughput clusters. He walks through a number of failure conditions that informed subsequent design choices, discusses trade-offs between monolithic and service-oriented database implementations, and then closes with personal experiences implementing multiple-query processing systems.

Gabriela Jacques Da Silva, Software Engineer at Facebook

Donghui Zhang, Software Engineer at Facebook

The volume of data processed by Facebook’s analytics workload has been rapidly increasing, resulting in greater compute and storage demands. Gabriela and Donghui share their work using sampling as a technique to offset such demand while still providing good approximate query results. They also discuss the challenges that this poses to approximate computation, such as the need to consider uncertainty propagation when calculating aggregated metrics. Finally, they show the benefits in terms of resource consumption in both compute and storage.

Ariel Weisberg, PMC Member at Apache Cassandra

*NOTE: This session has no recording.
*Ariel explores the trade-offs and benefits of introducing Transient Replication, which is an adaptation of Witness Replicas, into Apache Cassandra. He starts with an overview of existing replication techniques and explains how Transient Replication and an optimization called Cheap Quorums facilitate up to 50% disk space and compute savings under non-failure conditions. He concludes with a view into techniques for mitigations that enable Transient Replication to perform well even under failure conditions.

Ben Strahs, Software Engineer, Privacy & Data Use at Facebook

Deletion is critical to helping people control their data. It has unique technical challenges at scale, including managing deletion across distributed systems and building in mechanisms to confirm completeness and accuracy. In this talk, Ben shares Facebook’s Deletion Framework, a system Facebook built to automatically detect gaps, ensure completeness, and make sure the correct data is deleted.

Ben Clark, Chief Architect at Wayfair

Ben describes how Wayfair transformed its data plumbing components from a hodgepodge of underengineered components into a scalable infrastructure capable of handling the superlinear growth of its Kafka clusters and destination data stores. He outlines the factors that led to the decision to rewrite components and finishes with an overview of Tremor, a traffic shaper and router that is replacing logstash.

Michelle Casbon, Senior Engineer at Google

Michelle presents Kubeflow, a framework on Kubernetes that provides a single, unified tool for running common processes such as model training, evaluation, and serving, as well as monitoring, logging, and other operational tools. She discusses three challenges in developing ML applications: scalability, portability, and composability. She wraps up with a demo featuring a simple use case running locally, on premises, on a cloud platform, and using specialized GPU hardware.

Suchi Raman, Director, Product Development at DataXu

Suchi shares the lessons learned during DataXu’s multiyear journey to the cloud. She starts with an overview of the company’s previous on-premises analytics infrastructure and how it scaled this to its current “cloud native” warehouse architecture using Glue Data Catalog, Athena (Presto-as-a-service), and Lambdas on AWS. She wraps up with a discussion on how DataXu leveraged AWS spot instances to significantly reduce the company’s daily operational costs.

Patrick Dignan, Technical Lead at HubSpot

Patrick walks through the evolution of the Elasticsearch infrastructure at HubSpot. He starts by describing the challenge HubSpot faced when increasing the security of its data systems due to GDPR. From there, he explores the techniques used by the data infrastructure team to seamlessly migrate teams’ indices from the old cluster to the new secure cluster. By moving to a centrally managed ingestion pipeline, HubSpot significantly reduced cluster migration time and, as a result, improved testability, maintainability, and reliability.

Andrii Rosa, Software Engineer at Facebook

Matt Fuller, VP of Engineering at Starburst

Andrii and Matt present the Presto Cost-Based Optimizer (CBO), recently introduced by Starburst, along with a case study on integrating the Presto CBO at Facebook scale. Matt provides a brief overview of the Presto architecture, followed by a discussion of how joins work in Presto and how CBO can improve the efficiency of joins through automatic join type selection, automatic left/right side selection, and join reordering based on selectivity estimates and cost. In the second half of the talk, Andrii explores a new mechanism in Presto that computes statistics seamlessly and efficiently, ensuring that all Presto-generated data is ready for CBO without any extra manual steps. They then discuss future work enhancing the CBO and statistics collection in Presto.

Aniruddha Bohra, Principal Architect at Akamai

Aniruddha presents two systems — Palisade and Akamill — that were built at Akamai to address the challenges of safely collecting large volumes of data at low latency for data processing and analytics. He starts with an overview of Palisade, an overload protection system for reporting and analytics data processing infrastructures. He then discusses Akamill, a flexible stream processing system used by Palisade that provides buffering, connection pooling, and transformations to support near real-time data collection for applications. He concludes with a view into how the two systems combine to smooth out spikes, using thinning and throttling.

Jeremy Karn, Staff Data Engineer at Datadog

Jeremy walks through best practices used at Datadog, ensuring that trillions of data points are processed every day. He starts by describing the technology stack and the tools used for authoring and running Datadog’s pipelines. He then talks about the use of ephemeral clusters, the benefits of job isolation, and the ability to scale the clusters based on job needs. Finally, he discusses the mechanisms used to recover quickly in the face of unavoidable conditions, such as hardware failures, increased load, upstream delays, and bad code changes.

The post Data @Scale – Boston recap appeared first on Facebook Code.

]]>The post Making floating point math highly efficient for AI hardware appeared first on Facebook Code.

]]>We have developed an alternate approach to making AI models run efficiently. Building on a lineage of ideas reaching back to the early days of computer science more than 70 years ago, our method optimizes floating point itself.

We have made radical changes to floating point to make it as much as 16 percent more efficient than int8/32 math. Our approach is still highly accurate for convolutional neural networks, and it offers several additional benefits:

- Our technique can improve the speed of AI research and development. When applied to higher-precision floating point used in AI model training, it is as much as 69 percent more efficient.
- Today, models are typically trained using floating point, but then they must be converted to a more efficient quantized format that can be deployed to production. With our approach, nothing needs to be retrained or relearned to deploy a model. AI developers can thus deploy efficient new models more easily.
- Integer quantization schemes today are growing ever more complicated and in some cases might be “overfitting” on a particular task (and thereby not retaining their general-purpose application). An efficient, general-purpose floating point arithmetic that preserves accuracy can avoid this issue.

Our techniques are discussed in detail in the research paper “Rethinking floating point for deep learning.” It will take time to develop new chips designed to perform floating point math with these techniques. But the potential benefits include faster AI computation in data centers, lower-power designs for better AI on mobile devices, and simple, faster ways to achieve performance goals with fewer software changes. The slowdown of Moore’s Law and the age of “dark silicon” is at hand. Continued performance gains will require rethinking low-level hardware design decisions made decades ago, such as the IEEE 754 floating point standard, and use of mathematical approximation where applicable. Neural networks in particular provide an excellent opportunity for this reevaluation, as they are quite tolerant of variation and experimentation.

Our hardware designs for ASIC/FPGA and C++/PyTorch code for its evaluation are now publicly available to the AI community. We hope that the AI community will join us in exploring this new approach.

Engineers who work in other fields may not be familiar with how traditional floating point would compare with our alternatives, so a brief summary may be helpful. As is commonly known, floating point can represent both large and small real numbers in a reasonable amount of computer storage, using a system that is broadly similar to scientific notation. This format can be used to represent values such as 1,000,000 and 0.0625 in a fixed-width encoding and radix (typically binary). It is important to note that floating point can precisely represent only a limited choice of real numbers, as we have a limited number of bits. All other values can be represented by one of several forms of rounding to a nearest available floating point value.

A traditional binary floating point format has a sign, a significand, and an exponent. A sign bit indicates whether the number is positive or negative. The significand (whose fractional part is commonly known as the mantissa) is a binary fixed point number of the form 0.bbb… or 1.bbb…, where the fractional part bbb… is represented by some fixed number of binary bits after the radix point. (In decimal arithmetic, the radix point is also known as the decimal point, separating integral from fractional values.) The exponent is a signed integer that represents multiplication of the significand by a power of 2. A significand with a leading binary 1 (1.bbb…) is known as normal, whereas one with a leading binary 0 (0.bbb…) is denormal. The IEEE 754 floating point standard, common in most modern-day computers, has both normal and denormal significands. The leading digit of the significand need not be explicitly stored; in IEEE 754, the exponent field determines whether it is 1 or 0.

This graphic shows an encoding of -1.625 in 16-bit IEEE 754 binary16 half-precision floating point, with a fixed-size, 5-bit exponent and 10-bit significand fraction. The IEEE exponent has a bias of -15 added to it, so the encoded exponent 15 below actually represents (15 – 15) or 0.

The neural networks that power many AI systems are usually trained using 32-bit IEEE 754 binary32 single precision floating point. Reduction to 16 bits (half precision or formats such as bfloat16) yields some performance gains, but it still pales in comparison to the efficiency of equivalent bit width integer arithmetic. These floating point variants can use the original 32-bit floating point neural network data quite readily, but integer quantization to 8 (or fewer) bits often needs learned quantization parameters and model retraining. Many int8/32 quantization schemes can work as accurately as the original floating point model, but they might also be overfitting on the task at hand, unable to retain their accuracy when tested on tasks other than the ImageNet validation set.

But there are a variety of alternatives to integer, fixed point, or floating point for computer arithmetic as practiced today. Some of these methods reach back to the 1950s:

- Nonlinear significand maps (logarithmic number systems, Kingsbury and Rayner 1971; reciprocal closure, Gustafson 2015)
- Binary stochastic numbers (von Neumann 1952, Gaines 1969)
- Entropy coding (tapered floating point, Morris 1971; posits, Gustafson and Yonemoto 2017)

We’ve used this line of ideas to produce a floating point arithmetic that can outperform int8/32. Our implementation is quite different from floating point as seen today in hardware, even with variations such as denormal flush-to-zero or word size/field bit width changes such as bfloat16 or minifloat. Unlike int8/32 quantization, our implementation is still a general-purpose floating point arithmetic, with results interpretable out of the box.

To develop a new method for highly efficient floating point, we considered various sources of hardware floating point inefficiency:

**Large word size**: Much compute energy is spent moving data: external DRAM to internal SRAM, SRAM to register, or register to register (flip-flops). The larger the floating point word size, the more energy is spent.**General fixed point machinery**: Significands are fixed point, and fixed point adders, multipliers, and dividers on these are needed for arithmetic operations. The greater the precision (significand length) of the floating point type, the larger these components will be. Hardware multipliers and dividers are usually much more resource-intensive (chip area, power, and latency) than hardware adders.**General floating point machinery**: This handles the “floating” of the radix point and is thus integral to a floating point representation. Examples are leading zero (LZ) counters for renormalization, shifters for significand alignment, and rounding logic. Floating point precision also dominates the hardware resources used for this machinery.**IEEE 754 specific machinery**: This provides denormal support for gradual underflow as implemented in the IEEE 754 standard, with additional shifter, LZ counter, and other modifications needed for significand renormalization. Denormal handling adds complexity and overhead to most floating point operations.

Shrinking word size provides an obvious energy advantage. We can try compressing 32-bit data into 8 or 16 bits. A typical floating point fixed-size field encoding forces difficult choices to be made for reducing dynamic range (exponent) and precision (significand), when what we need is some preservation of both.

We can handle this trade-off differently. Floating point is itself a quantization of (infinite precision) real numbers. A quantizer adapted to the seen data distribution has less reproduction error. We typically don’t have much prior knowledge about the data distributions encountered on a general-purpose computer. Neural network distributions, however, are near Gaussian in practice, sometimes further controlled by procedures such as batch normalization. Standard floating point keeps as much significand precision at 10^5 as at 10^-5, but most neural networks perform their calculations in a relatively small range, such as -10.0 to 10.0. Tiny numbers in this range (for example, 0.0001) are frequently used, but not large ones. Ideally, we could change the quantizer to give higher precision where we need it and keep some dynamic range for small numbers.

Tapered floating point can let us achieve these goals and reduce word size. Gustafson’s posit is an excellent form of tapering. Posits encode the exponent in a variable number of bits using a prefix-free code, with the significand fraction occupying the rest. It maximizes precision around +/-1.0, with less precision toward 0 or +/-infinity. It is both lossy compression and expansion, losing precision in some places to preserve dynamic range elsewhere. It can thus give both higher precision (in certain places) and greater dynamic range than could be the case with IEEE-style floating point. The posit idea can be extended to other prefix-free codes, such as Huffman coding, when we don’t know the data distribution up front.

It is possible to avoid multipliers and dividers for operating on significands. A significand can be considered generally as a fraction map f(x), mapping a fixed point value x in [0, 1) to [1, 2). (This approach was detailed in Lindstrom et al. 2018.) In typical normalized floating point, f(x) is the affine function 1+x (which we’ll call a linear domain number).

When f(x) = 2^x, we have the logarithmic number system (LNS), in which multiplication and division turn into addition and subtraction. LNS addition, though, requires huge hardware lookup tables to compute the sum or difference of two log domain numbers. This has been one of the main problems with LNS adoption, as these tables can be more cumbersome than hardware multipliers. Note that typical floating point is already a combination of logarithmic (exponent) and linear (significand) representations, but the LNS representation is fully logarithmic.

A useful operation in computer linear algebra is multiply-add: calculating the sum of a value *c* with a product of other values *a x b* to produce *c + a x b*. Typically, thousands of such products may be summed in a single accumulator for a model such as ResNet-50, with many millions of independent accumulations when running a model in deployment, and quadrillions of these for training models.

Floating point fused multiply-add (FMA) is a common means of multiply-add with reduced error, but it is much more complicated than a standard floating point adder or multiplier. A technique known as Kulisch accumulation can avoid FMA complexity. A similar operation was in the first programmable digital computer, Konrad Zuse’s Z3 from 1941. Gustafson has also proposed standard usage of Kulisch accumulation in his recent floating point studies. The idea is not to accumulate in floating point but instead maintain a running sum in fixed point, large enough to avoid underflow or overflow. Unlike floating point addition, Kulisch accumulation* *exactly represents the sum of any number of floating point values. The summation is associative and reproducible regardless of order. When done with all sums, we convert back to floating point by significand alignment and rounding.

The diagram below shows an example accumulation step. A Kulisch accumulator currently contains the value 35.5, and we are adding 0.84375 into it, represented as a linear domain floating point value. This floating point value being summed may have come previously from a product of scalar values or just a single value that we wish to accumulate. The floating point value is converted to fixed point by aligning the significand’s radix point based on the floating point exponent. This conversion uses an adjustment factor that is the effective exponent of the accumulator’s most significant bit (6 in our example). The aligned significand and accumulator are then summed together with carry. (For simplicity, we have omitted additional bits of precision that a Kulisch accumulator may have to support underflow and overflow.) Kulisch accumulation is costly in 32+ bit floating point, as the accumulator, shifter, and adder may be 500+ bits in size, but it is quite practical for smaller types.

Kulisch accumulation cannot be used directly for log domain summation. But just as Kulisch accumulation performs the sum in a different form (fixed point) than that of the arguments (floating point), we can take a similar approach here, so we don’t need a huge LNS sum/difference lookup table. We can approximate log values in the linear domain, Kulisch accumulate in the linear domain, and then convert back to log domain when all sums are complete. This strategy works very well for general linear algebra, as vector inner product requires many repeated sums in an accumulator.

The posit encoding that was useful for word size reduction also avoids this problem, as the posit significand is always normalized. Gradual underflow prevents precision falling off immediately rather than gradually, which is handled in the IEEE 754 denormal representation by the location of the leading one in the significand fraction. Posit tapering toward smaller numbers results in significand fraction bits being used instead on the exponent, extending the dynamic range and reducing the precision. Posit tapering is functionally similar to denormal gradual underflow, but with no overhead for renormalizing the significand. Posit gradual overflow is likewise supported in a similar manner with tapering.

To achieve our performance gains, we combine these four techniques. A log domain representation avoids hardware multipliers. We repurpose posit encoding for log numbers. To compete against int8/32, we consider an 8-bit format called (8, 1, alpha, beta, gamma) log. (8, 1) are the posit parameters. This encoding gives a more than 16 million to 1 ratio between our largest and smallest positive values while preserving 4 bits of (log domain) precision around 1.0, all in 8 bits (only 256 possible values). The alpha, beta, and gamma values control log-to-linear and linear-to-log conversion accuracy.

As noted above, we perform log domain sums in the linear domain. This result is very approximate, but unlike FMA, we have no linear domain error with Kulisch accumulation for sequential sums. We call this technique ELMA, or exact log-linear multiply-add. The log domain multiplication is exact, as are all linear domain sums, but the log-to-linear conversion is approximate, as is the return linear-to-log conversion. The trade-off is quite acceptable in practice.

Hardware lookup tables are used for the conversions, but they are much smaller than those required for LNS addition. Larger alpha, beta, and gamma parameters will yield more exact results, but also consume more chip area and power.

Compared with floating point FMA, the ELMA multiply-add circuit at its core is simple. Three adders, a lookup table, and a shifter do most of the work:

Unlike int8/32, our 8-bit log format for neural networks does not require learning quantization parameters, activation sampling, or retraining of the original network. We simply take the 32-bit floating point parameters of a network such as ResNet-50 and convert them using round-to-nearest-even. Usage of posit encoding preserves both the needed dynamic range and precision in such a small type.

Using (8, 1, 5, 5, 7) log with ELMA in the same manner as original ResNet-50 math, we achieved 75.23 percent top-1 and 92.66 percent top-5 accuracy on the ImageNet validation set, a loss of 0.9 percent and 0.2 percent, respectively, from the original. These results are similar to those of many existing int8/32 quantization methods. It is possible that the fine-tuning training and model tweaks used in int8/32 quantization can further improve our method’s performance, but our baseline result is achieved with minimal software effort. All math is still performed in a general-purpose floating point arithmetic, using compressed encoding as our quantizer. Our design with ELMA can also be used for nonlinear algebra tasks such as polynomial evaluation.

Using a commercially available 28-nanometer ASIC process technology, we have profiled (8, 1, 5, 5, 7) log ELMA as 0.96x the power of int8/32 multiply-add for a standalone processing element (PE). In a full 32×32 systolic array for matrix multiplication, the log ELMA PE formulation is 0.865x the power of the int8/32 PE version. The power savings largely comes from eliminating hardware multipliers.

Extended to 16 bits — and even without denormal support, which provides a lot of inefficiency for IEEE 754 — this method uses 0.59x the power and 0.68x the area of IEEE 754 half-precision FMA, with reduced latency. These gains at 16 bits can be leveraged to support training more complex AI models in the same amount of time. Against 32-bit IEEE 754 single-precision FMA, ELMA will not be effective, though, as the Kulisch accumulator is massive (increasing adder/shifter sizes and flip-flop power), and the log-to-linear lookup table is prohibitive.

Realizing the promise of AI requires significant efficiency gains that we can achieve only with new approaches, not just building on old ones. For example, software emulation is often too slow to effectively test new arithmetic designs on cutting-edge AI models. It is unfortunately more difficult to perform experiments in FPGA/ASIC hardware than software, leaving the universe of these potential gains largely underexplored. If, however, new hardware is developed to harness these techniques, it could benefit a wide range of AI research and applications.

We plan to investigate 16-bit ELMA designs in hardware and comparing behavior with IEEE 754 half-precision floating point and bfloat16 for AI model training and other tasks. These alternative ideas and numerical approximation are not always applicable, but AI provides a unique opportunity to explore their boundaries and help overturn old notions of what is possible in hardware.

The post Making floating point math highly efficient for AI hardware appeared first on Facebook Code.

]]>The post Open-sourcing FBGEMM for state-of-the-art server-side inference appeared first on Facebook Code.

]]>To enable large-scale production servers to run the newest, most powerful deep learning models efficiently, we have created FBGEMM, a low-precision, high-performance matrix-matrix multiplications and convolution library. FBGEMM is optimized for server-side inference, and unlike previously available alternatives, it delivers both accuracy and efficiency when performing quantized inference using contemporary deep learning frameworks. With this library, we have achieved greater than 2x performance gains on the current generation of CPUs with respect to our current production baseline.

We are now open-sourcing FBGEMM to provide other engineers with all the fundamental building blocks for performing efficient low-precision inference, packaged in a convenient single library. You can deploy it now using the Caffe2 front end, and it will soon be callable directly by PyTorch 1.0 Python front end.

Together with QNNPACK, a new library for mobile devices that we open-sourced last week, engineers now have comprehensive support for quantized inference as part of the PyTorch 1.0 platform.

FBGEMM offers several key features:

- It is specifically optimized for low-precision data, unlike the conventional linear algebra libraries used in scientific computing (which work with FP32 or FP64 precision).
- It provides efficient low-precision general matrix-matrix multiplication (GEMM) for small batch sizes and support for accuracy-loss-minimizing techniques such as row-wise quantization and outlier-aware quantization.
- It also exploits fusion opportunities to overcome the unique challenges of matrix multiplication at lower precision with bandwidth-bound pre- and post-GEMM operations.

FBGEMM has been deployed at scale here at Facebook, where it has benefited many AI services, end to end, including speeding up English-to-Spanish translations by 1.3x, reducing DRAM bandwidth usage in our recommendation system used in feeds by 40%, and speeding up character detection by 2.4x in Rosetta, our machine learning system for understanding text in images and videos. Rosetta is used by many teams across Facebook and Instagram for a wide variety of use cases, including automatically identifying content that violates our policies, more accurately classifying photos, and surfacing more-personalized content for people using our products.

Fully connected (FC) operators are the biggest consumers of floating point operations (FLOPs) in the deep learning models deployed in Facebook’s data centers. We performed data-center-wide profiling for FLOPs usage in representative models running in production here at Facebook. The pie chart below shows the distribution of the deep learning inference FLOPs in our data centers measured over a 24-hour period.

FC operators are just plain GEMM, so overall efficiency directly depends on GEMM efficiency. Many deep learning frameworks implement convolution as im2col followed by GEMM, because performant GEMM implementations are readily available in linear algebra libraries from the high-performance computing (HPC) domain. But straightforward im2col adds overhead from the copy and replication of input data, so some deep learning libraries also implement direct (im2col-free) convolution for improved efficiency. As explained in more detail below, we provide a way to fuse im2col with the main GEMM kernel to minimize im2col overhead. The high-performance GEMM kernel is a critical part, but it’s not the only one. In general, there is a mismatch between what HPC libraries provide and the requirements of deep learning inference. HPC libraries usually do not support quantized GEMM-related operations efficiently. They are not optimized for shapes and sizes of matrices common in deep learning inference. And they do not take advantage of the constant nature of the weight matrix.

Deep learning models have typically used FP32 data types for representing activations and weights, but computations with mixed-precision data types (8-bit or 16-bit integers, FP16, etc.) are generally much more efficient. Recent industry and research works have shown that inference using mixed-precision works well without adversely affecting accuracy. FBGEMM uses this alternative strategy and improves inference performance with quantized models. Furthermore, newer generations of GPUs, CPUs, and specialized tensor processors natively support lower-precision compute primitives, such as FP16/INT8 in Nvidia tensor cores or INT8 in Google processors. So the deep learning community is moving toward low-precision models. This movement indicates that quantized inference is a step in the right direction, and FBGEMM provides a way to perform efficient quantized inference on current and upcoming generation of CPUs.

Implementing high-accuracy, low-precision inference is essential for optimizing deep learning models. In developing FBGEMM, we used a quantization strategy similar to the one described in detail in this paper. Each value in a matrix is quantized with the help of a scale factor and a zero point in an affine way, so computations in the quantized domain map directly to computations in real domain. These scale- and zero-point values are shared among multiple entries in the matrix (e.g., all rows may have the same scale and zero point). In the equation below, A is the real-valued matrix, and Aq is the quantized matrix; a_scale is a real-valued constant, and a_zero_point is a constant in quantized domain.

With this quantization framework, we can represent matrix-matrix multiplications in the quantized domain as follows:

It is important to note several details:

1) Each output value (i, j) in the C matrix requires the sum of ith row of A matrix (row offset), the sum of jth column of B matrix (column offset), and a constant factor in addition to the dot product.

2) If one of the matrices is constant, the constant factor computations can be combined with row (or column) offsets calculations for that matrix. These offsets are used later during the requantization step.

3) Dot product results are accumulated into higher precision and are scaled back to lower precision for the next layer. We call this process requantization.

These background details highlight that when we perform low-precision GEMM, there are other operations around it that are equally important for overall efficiency. If these extra operations (such as row offset calculation or post-accumulation quantization) are not performed carefully along with low-precision GEMM, they can offset the gains of working at lower precision.

FBGEMM is distinct from other libraries in several ways: It combines small compute with bandwidth-bound operations. It exploits cache locality by fusing post-GEMM operations with macro kernel and provides support for accuracy-loss-reducing operations. And it supplies modular building blocks to construct an overall GEMM pipeline as needed by plugging and playing different front-end and back-end components.

A key ingredient of FBGEMM is performant low-precision GEMM, which we have implemented using an approach similar to the one taken by other research works (Goto et al. and BLIS framework) targeting FP32 and FP64 data types but not low-precision. The following sample code shows a typical way of implementing high-performance GEMM on modern CPU architectures. Here M, N, and K are standard matrix dimensions: A is an MxK matrix, B is a KxN matrix, and C is an MxN matrix. MCB, NCB, KCB, MR, and NR are target-specific constants, and their values depend on available caches and registers on a given CPU. (CB refers to cache block and R refers to register.) The naive three-loop matrix-matrix multiplication is converted into the following five loops around a microkernel for an implementation that works well with a CPU memory hierarchy with multilevel caches and vector registers.

Loop1 for ic = 0 to M-1 in steps of MCB Loop2 for kc = 0 to K-1 in steps of KCB //Pack MCBxKCB block of A Loop3 for jc = 0 to N-1 in steps of NCB //Pack KCBxNCB block of B //--------------------Macro Kernel------------ Loop4 for ir = 0 to MCB-1 in steps of MR Loop5 for jr = 0 to NCB-1 in steps of NR //--------------------Micro Kernel------------ Loop6 for k = 0 to KCB-1 in steps of 1 //update MRxNR block of C matrix

As shown in this example, high-performance GEMM implementations work by packing currently used blocks of A and B matrices into smaller chunks that are accessed sequentially in the innermost microkernel. “Packing” here refers to reorganization of matrix data into another array such that the access pattern in the inner kernel of the optimized implementation is sequential. Sequential access of data in the inner kernel is important for achieving high effective bandwidth on modern hardware architectures. Packing is a bandwidth-bound operation because it only reads and writes data. So if we can combine small compute operation with the bandwidth-bound packing operation, the compute cost gets overlapped, and overall packing time remains the same.

We take advantage of the bandwidth-bound nature of packing routines and combine simple compute operations with packing. The figure below shows various packing routines that we have implemented so far. For example, PackingWithRowOffset performs row offset calculations while reorganizing the data in the necessary format for the inner kernel. The row offsets are calculated only for the block that is currently getting packed, i.e., the MCBxKCB block. These row offsets are used later in the post-GEMM requantization pipeline. The advantage of calculating row offsets while packing is that we do not need to make two passes over the A matrix data, thereby avoiding moving data multiple times to the CPU and also avoiding cache pollution. Newer packing routines can also be added while reusing the rest of the flow.

It’s important to note that one of the matrices in GEMM for inference is the weight matrix and is constant during inference. We can therefore prepack it once and use it multiple times for different activations, avoiding the cost of repacking (shown inside Loop3 in the code above). The relative cost of packing the weight matrix can be significant if the activation matrix is small. But this cost must be paid by general GEMM implementations not specifically designed for the case when one of the matrices is constant.

FBGEMM is designed from the ground up while keeping these requirements in mind. It allows us to use prepacked matrices, which avoids large internal memory allocations and allows fusion of post GEMM operations such as nonlinearities, bias addition, and requantization. The FBGEMM library targets quantizations to 8-bit integers because our target hardware efficiently supports 8-bit integer operations.

The diagram above shows how we can combine different packing methods for A and B matrices while keeping the core computations the same, and then construct a pipeline of output operations. FBGEMM implementation allows you to construct a pipeline by picking any packing routine for A, any packing routine for B, any of the core kernels (accumulation into INT16, INT32, or FP32), and any combination of post-GEMM operations. The design is extensible, and newer packing or post operations can be added into the FBGEMM library as needed. The gemmlowp library also allows composing the core kernel with a post-GEMM operation called output pipeline, but FBGEMM extends it to input packing.

Typically, GEMM libraries from HPC domains are optimized for large matrices that are square or almost square. For matrix multiplications in networks such as Faster-RCNN — which is used in Rosetta, Resnet50, Speech, and NMT — the most commonly occurring shapes are shown in the figure below.

Each bubble represents typical M, N, and K dimensions for matrix-matrix multiplications. The size of the bubble is proportional to the K value. As is clear from the figure, matrices come in all shapes and sizes. M is sometimes very small, and at other times N is very small. We need efficient implementations for all of these cases.

With inner kernels, FBGEMM takes a “one size doesn’t fit all” approach, so the implementation dynamically generates efficient matrix-shape specific vectorized code. For example, if we see at runtime that M is 1, we query whether or not an efficient kernel exists for M = 1 and use that if so. If not, we generate that kernel, store it in the kernel cache, and use it. We need to carefully craft only a few kernels, then map other matrix dimensions to them.

Overall, the optimized loop structure for our implementation looks as follows:

Loop1 for ic = 0 to M-1 in steps of MCB Loop2 for kc = 0 to K-1 in steps of KCB //Pack MCBxKCB block of A Loop3 for jc = 0 to N-1 in steps of NCB //--------------------Inner Kernel------------ //Dynamically generated inner kernel //Loop4 and Loop5 are in assembly

FBGEMM is a C++ library, and the following code listing shows the GEMM interface that it exposes. The flexible interface is implemented with the help of C++ templates.

template< typename packingAMatrix, typename packingBMatrix, typename cT, typename processOutputType> void fbgemmPacked( PackMatrix<packingAMatrix, typename packingAMatrix::inpType, typename packingAMatrix::accType>& packA, PackMatrix<packingBMatrix, typename packingBMatrix::inpType, typename packingBMatrix::accType>& packB, cT* C, void* C_buffer, std::int32_t ldc, const processOutputType& outProcess, int thread_id, int num_threads);

The interface is specifically designed to support optimized quantized inference and fusion of post-GEMM operations. The template parameters packA and packB provide packing routines for the current block. Because FBGEMM is targeted toward inference, we assumed that the B matrix is already packed (i.e., the packB.pack function is never called). The next three arguments are related to the C matrix. C is the pointer to the C matrix itself. C_buffer is the pointer to a preallocated buffer memory that is used to store intermediate 32-bit integers or FP32 values. And ldc is the standard leading dimensions of the C matrix. outProcess is a template argument that can be a C++ functor implementing a pipeline of output processing elements. It is called after a block of C matrix is computed in the C matrix to take advantage of cache locality. The final two parameters are related to parallelization. Internally, FBGEMM is intentionally designed not to create any threads. Usually, such a library is intended to be used as a backend by deep learning frameworks, such as PyTorch and Caffe2, that create and manage their own threads. Overall, this interface allows use of different packing methods and the construction of a pipeline of post-GEMM operations on the currently computed block of output matrix.

Because depthwise convolution is sufficiently different from GEMM, we also include a specialized kernel for it in FBGEMM. We believe most of the important use cases found in our data centers (including convolutions) can be efficiently implemented using our approach of composing various input-packing and output-processing operations. As with QNNPACK, our depthwise convolution kernel vectorizes over channels. But it takes advantage of size constraints for code not being as strict as mobile platforms, allowing us to do more aggressive unrolling and inlining with template specializations. Compared with QNNPACK — which needs to prepare various requantization options, including the ones purely using fixed-point operations in case the target platform lacks good floating-point supports — FBGEMM uses FP32 operations when scaling INT32 intermediate GEMM output to INT8 during requantization.

The FBGEMM interface allows for flexible composition with various output processing schemes, which is illustrated by how we perform 16-bit accumulation (depicted by the figure below). FBGEMM supports INT8 matrix multiplication with INT16 accumulation to get better performance for compute-bound cases. INT8 FMA with accumulation into INT16 is performed with a combination of `vpmaddubsw`

and `vpaddsw`

vector instructions. With INT8, we work on 4x more elements in comparison with FP32 per vector instruction, but we use two vector instructions for each vector FMA. Therefore, theoretical peak for accumulating into 16 bits is 2x that of FP32. INT16 accumulation, however, usually leads to frequent overflow/saturation, which we avoid by using outlier-aware quantization. That is, we split matrix B into B = B’ + B_sparse, where B’ has numbers only with small magnitude, and big numbers are separated as B_sparse. We denote the matrix with outlier numbers as B_sparse because B typically has only a few big numbers so B_sparse is usually very sparse. After the splitting, A * B can be computed as A * B’ + A * B_sparse, where we can safely use INT16 accumulation for A * B’ because B’ only contains small numbers. The majority of computation will happen in A * B’ given the sparsity of B_sparse.

The splitting of B, packing of B’, and converting B_sparse in an efficient sparse matrix format (we use compressed sparse column) needs to be done only once as a preprocessing step because B is constant during inference. FBGEMM computes dense matrix times sparse matrix multiplication (i.e., A * B’) as a part of the postprocessing pipeline. Sparse matrix computation is usually a memory-bandwidth-bound operation, so it is important to fuse it with the main computation. Instead of computing A * B’ and A * B_sparse separately, they are fused so a part of A * B_sparse can be computed when packed A and partial result of C is cache-resident.

We ran performance benchmarking for the FBGEMM library on Intel(R) Xeon(R) CPU E5-2680 v4 using a single thread on a single core. We used a Broadwell machine with a base frequency of 2.4 GHz, with turbo mode disabled to get reliable run-to-run results. The following graph shows the FP32 theoretical peak number against the actual performance we get for INT8 GEMMs with accumulation into 16 bits. As mentioned earlier, theoretical single-core peak for accumulation into 16 bits for this Broadwell machine is 2x the FP32 peak, i.e., 153.6 giga operations per second (GOPS). Accumulation into 16 bits is used for the cases that are compute-bound, as these are the cases in which we get the most performance benefits. For the bandwith-bound cases, accumulation into 16 bits does not buy us any better performance, but accumulation into 16 bits may overflow unless we use outlier-aware quantization; hence, we avoid using accumulation into 16 bits for bandwidth-bound cases.

The following graph shows performance for the bandwidth-bound cases, where we perform accumulation into 32 bits. INT8 FMA with INT32 accumulation is performed with a combination of vpbroadcastd, vpmaddubsw, vpmaddwd, and vpaddd vector instructions. Since 4 instructions are used for INT8 FMA , the theoretical compute peak for INT8 is not better than FP32 even though each element size is 4x smaller. The figure also shows the roofline peak for the same machine. Overall, the Broadwell machine has a theoretical peak bandwidth of 76.8 Gigabyte/sec for all cores. We measured a stream triad bandwidth of 15.6 GB/sec per core and use this number to calculate roofline peak. FP32 roofline peak numbers are the best theoretically possible numbers and in practice the achieved performance is lesser than these roofline numbers. We compare INT8 performance against these theoretically best numbers for FP32. As shown in the graph below, accumulation into 32 bits is most beneficial for the small batches. Matrix dimension M is the batch dimension. We are able to achieve better-than-FP32 theoretical roofline performance, because of the benefits of using less bandwidth, by working with lower precision data.

Quantized inference is already proving useful on the current generation of server hardware. Careful implementation of quantization has shown us encouraging results on language translation models, recommendation systems, and models for text understanding in images and videos. But we can continue to build on our work with FBGEMM. There are a variety of other models in use across Facebook in which quantized inference is not yet implemented, and FBGEMM combined with a deep learning framework has the potential to improve efficiency there as well. Certain newer models, such as ResNeXt-101 and ResNext3D, are more accurate but are so compute-heavy that deploying them at scale is very resource-intensive without improved efficiency. We hope that FBGEMM will help fill the necessary efficiency gap for deployment.

As the deep learning models in computer vision grow wider and deeper in search of better accuracy, the use of groupwise convolution and depthwise — a special case of groupwise — convolution is increasing. When the number of groups is large, however, groupwise convolution is inefficient when performed with im2col followed by the GEMM method. We already have a specialized implementation for depthwise convolutions, but we intend to add direct groupwise convolution to FBGEMM as well. The most-frequently-used Caffe2 operators are already implemented using FBGEMM, and a direct integration with PyTorch is planned. We are also planning to add more features to further improve efficiency, such as merging depthwise convolution with 1×1 convolution and improving performance-debugging support.

We hope open-sourcing FBGEMM will allow other engineers to take advantage of this high-performance kernel library, and we welcome contributions and suggestions from the wider community. The HPC community has long provided the standard interface for GEMM, and with FBGEMM, we show that combining certain operations with input and output packing is more efficient. We hope future GEMM interfaces from the HPC community will find inspiration in these ideas.

*We’d like to acknowledge the contributions to this project from the FBGEMM team, along with the AI developer platform team and our AI system co-design group.*

The post Open-sourcing FBGEMM for state-of-the-art server-side inference appeared first on Facebook Code.

]]>The post Open-sourcing foundational tools for AI performance appeared first on Facebook Code.

]]>QNNPACK and FBGEMM are high-performance kernel libraries that enable mobile devices and servers to run the latest AI models more efficiently. Both libraries have been deployed to production at Facebook, where they are improving the performance of computer vision models on mobile devices and speeding up computer vision models, machine translations, and other services running on our servers. We are open-sourcing both libraries so others can boost performance of their deep learning models as well as contribute any performance improvements they make in return.

Deep learning frameworks (such as PyTorch) commonly use higher-precision floating-point numbers (e.g., floating-point 32 bit) to represent the weights and activations of a neural network during training. But after model training is finished, higher-precision floating-point representations and calculations become overkill. Many types of models can be adapted to use low-precision integer arithmetics for inference without noticeable accuracy loss.

QNNPACK and FBGEMM enable these high-performance, low-precision calculations for operations such as matrix multiplication and convolution, which are important in state-of-the-art deep learning architectures.

QNNPACK and FBGEMM are now publicly available, so the AI research and engineering community can use them to improve performance for reduced-precision on-CPU inference. At Facebook, QNNPACK helps mobile devices deliver a real-time implementation of Mask R-CNN, a model for semantic segmentation and keypoint estimation. FBGEMM has delivered encouraging server-side results on language-translation models, recommendation systems, and models for text understanding in images and videos.

In general, as deep learning models grow wider and deeper in search of better accuracy, they are increasingly reliant on the operations that QNNPACK and FBGEMM optimize. Furthermore, the deep learning community is continuing to move toward low-precision models, as evidenced by the newer generations of GPUs, CPUs, and specialized tensor processors that all natively support lower-precision compute primitives. This indicates that optimization through quantized inference will continue to be important for deploying new AI products.

The post Open-sourcing foundational tools for AI performance appeared first on Facebook Code.

]]>The post Getafix: How Facebook tools learn to fix bugs automatically appeared first on Facebook Code.

]]>Modern production codebases are extremely complex and are updated constantly. To create a system that can automatically find fixes for bugs — without help from engineers — we built a tool that learns from engineers’ previous changes to the codebase. It finds hidden patterns and uses them to identify the most likely remediations for new bugs.

This tool, called Getafix, has been deployed to production at Facebook, where it now contributes to the stability of apps that billions of people use. Getafix works in conjunction with two other Facebook tools, though the technology can be used to address code issues from any source. It currently suggests fixes for bugs found by Infer, our static analysis tool that identifies issues such as null pointer exceptions in Android and Java code. It also suggests fixes — via SapFix — for bugs detected by Sapienz, our intelligent automated testing system for our apps. Having previously given an overview of SapFix and Sapienz, we are now offering a deep dive into how Getafix learns how to fix bugs (using the term broadly to refer to any code issues, not just those that will cause an app to crash).

The goal of Getafix is to let computers take care of the routine work, albeit under the watchful eye of a human, who must decide when a bug requires a complex, nonroutine remediation. The tool works by applying a new method of hierarchical clustering to many thousands of past code changes that human engineers made, looking at both the change itself and also the context around the code change. This method allows it to detect the underlying patterns in bugs and the corresponding fixes that previous auto-fix tools couldn’t.

Getafix can also narrow the space of possible program changes that need to be searched to find a fix for a bug, enabling it to select an appropriate patch more quickly and without the high computation time that brute force and logic-based techniques require. This more efficient approach makes it possible to deploy Getafix to production environments. At the same time, because Getafix learns from past code changes, it also produces fixes that are easy for human engineers to understand.

Getafix is deployed in Facebook to automatically suggest fixes for the null dereference bugs that Infer reports, as well as to suggest fixes for the null dereference-related crash errors that Sapienz flags. It is also being used to resolve code quality concerns that are found when revisiting existing code with newer versions of Infer.

In current industrial practice, auto-fixes have been used primarily for basic issues, whereas code remediation is straightforward. For example, an analyzer might warn about a “dead exception,” in which the developer probably forgot to add a `throw`

before `new Exception(...)`

. An auto-fix to make that change is straightforward and can be defined by the author of the lint rule, without knowing the specific context in which it is applied.

Getafix offers significantly more general capability, remediating issues in cases where the fix is context-dependent**.** In this sample code example, Getafix offers the following fix in response to an Infer bug at line 22:

*A sample bug reported in our code review portal, along with Getafix-generated fix.*

Note that this fix depends not only on the variable `ctx`

but also on the return type of the method. Unlike simple lint remediations, fixes of these kinds cannot be baked into Infer itself.

The figure below has additional examples of fixes that Getafix offers for Infer bugs; even though the bug from Infer is the same (null method call, which indicates the risk of a NullPointerException being thrown), each fix is unique. Notice that the fixes are indistinguishable from the kinds developers typically make.

Getafix is organized as the toolchain shown in the diagram below. In this section, we’ll describe the functionality and challenges in each of the three main components.

An abstract-syntax-tree-based differencer is first used to identify concrete edits made between a pair of source files, such as successive revisions of the same file. For example, it will detect granular edits: wrapping a statement with an `if`

, adding an `@Nullable`

annotation or an `import`

, and prepending a conditional early `return`

to an existing method body, among others. In the example below, the insertion of a conditional early return if `dog`

is `null`

, the rename of `public`

to `private`

, and the move of a method were detected as concrete edits. Whereas a line-based diffing tool would mark either method as fully removed and inserted, the tree differencer detects the move and can hence also detect the insertion *within* the moved method as a concrete edit.

A challenge in the tree differencer is to efficiently and precisely align parts of the “before” and “after” trees, so the right concrete edits and their mappings from before to after trees get discovered.

Getafix performs pattern mining by using a new *hierarchical* clustering technique, along with anti-unification, an existing method of generalizing among different symbolic expressions. It then creates a collection of possibly related tree differences and uses the fix patterns representing the most common program transformations in that collection. These patterns can be abstract, containing “holes” where program transformations differ.

The following example image shows a hierarchical structure, known as a dendrogram, that results from a set of edits. (In this case, it shows the edit from the previous example above.) Each row shows an edit pattern — the “before” in purple and “after” in blue — along with some metadata. Each vertical black bar corresponds to a level in the hierarchy, where the edit pattern at the top of the black bar is the pattern obtained by anti-unification of all the other edits belonging to that level in the hierarchy. The other edits are connected by the smaller, thin black lines. Anti-unification combines “early return if dog is null” edit from the previous example with another edit in which the only difference is in what the dog is drinking. The result is an abstract fix pattern that represents the commonality. The symbol `h0`

, introduced by anti-unification, indicates a “hole” that can be instantiated based on the context.

This edit pattern can then combine with other edit patterns that have more variation in variable names but still have the same overall structure. This process produces increasingly more abstract edit patterns as we go up the tree. For example, it could combine this edit with a cat-related edit to obtain the abstract edit shown near the top of the diagram.

More seriously, though, this hierarchical matching process produces a powerful framework for Getafix to discover reusable patterns in code changes. The picture below shows the dendrogram (laid out sideways and miniaturized) obtained by combining all the 2,288 edits that fixed null pointer errors reported by Infer in our codebase over a period of time. The fix patterns we seek to mine are hidden in this dendrogram.

The idea of anti-unification-based pattern mining is not new, but several enhancements were necessary to mine patterns that can be used to generate (and rank) a reasonably small number of fixes for a new bug.

One such change is the inclusion of a portion of surrounding code that *doesn’t *change as a result of the edit. This change allows us to find not only patterns in the changes people make but also patterns in the context in which the changes are applied. For example, in the first dendrogram above, we notice that there are two distinct edits adding `if (dog == null) return;`

before `dog.drink(...);`

. Even though `dog.drink(...);`

is unchanged, it is included in the “before” and “after” parts of the pattern as context telling us where to apply this fix. At a higher level in the hierarchy of edits, the `dog.drink() `

context merges with other context to become the abstract context `h0.h1()`

, which restricts the places where the pattern can apply. A more realistic example follows in the next section.

A greedy clustering algorithm, as suggested in past literature on auto-fix tools, is unlikely to learn this context. This is because the greedy clustering algorithm maintains a single representation of each cluster, which will not include the extra context if it is not present in *all* of the edits in the training data. For example, if an edit inserting `if (list != null) return;`

before `do(list.get());`

merged with our `dog.drink()`

examples above, greedy clustering would lose all the context about where to insert the early return. Getafix’s hierarchical clustering approach keeps as much context as possible at each level, becoming more general higher in the structure. At a certain level, even the general context we hope to learn will be lost, but it will still be present at lower levels in the structure.

In addition to surrounding code, we also associate edits with the Infer bug reports that prompted them in the first place, which allows us to learn how edit patterns relate to the corresponding bug reports. The variable Infer blames in a bug report is shown as “errorVar” in the first dendrogram figure above and participates in anti-unification, ending up as hole `h0`

. This allows us to later substitute a blamed variable into `h0`

when presented with a new Infer bug report, making the fix pattern more specific.

The final step takes buggy source code and fix patterns from the pattern mining step and produces patched versions of the source code. There are typically many fix patterns to choose from (as seen in the dendrogram above). So a challenge we have to address in this step is selecting the correct pattern to fix a particular bug*. *If the pattern applies in several locations, Getafix must also select the *right match*. The following examples illustrate our general approach and how we address this challenge in Getafix.

**Example 1.** Consider the pattern we mined above: `h0.h1();`

→ `if (h0 == null) return; h0.h1();`

We briefly explain the steps to produce the following patch on previously unseen code.

Getafix creates a patch using the following steps

- Find sub-AST matching the before part:
`mListView.clearListeners();`

- Instantiate holes
`h0`

and`h1`

- Replace sub-AST with instantiated after part

Note that `h0`

in the after part is bound because of the inclusion of unmodified context `h0.h1();`

, which helpfully restricts the number of places the pattern applies. Without the unmodified context, the pattern would have been `<nothing> → if (h0 == null) return;`

. This pattern applies in unintended places, such as *after* `mListView.clearListeners();`

or even after `mListView = null;`

.

The insertion-only pattern will in fact also appear higher up in the dendrogram, where the pattern with context `h0.h1();`

is anti-unified with a pattern inserting the return in front of a different statement. The next example illustrates how Getafix deals with patterns that seem to apply in too many places.

**Example 2.** Consider the following pattern: `h0.h1()`

→ `h0!=null && h0.h1()`

Typically, this patch would get mined from fixes within `if`

conditions or `return`

expressions, so we’d expect it to be applied there as well. But it also matches in other situations, such as the call statement shown in a previous example above: `mListView.clearListeners();`

. Getafix’s ranking strategy tries to estimate the likelihood that a pattern is indeed a fix and that it is the *most likely *fix for a given context. This strategy allows the system to be less reliant on the validation step later, thus saving time.

The above pattern will compete with other patterns, such as the more specific `if (h0.h1()) { ... }`

→ `if (h0!=null && h0.h1()) { ... }`

or the pattern from Example 1, which applies only to call *statements* rather than *expressions*. More specific patterns match in fewer places and are thus considered to be more specialized for the situation, so Getafix ranks them higher.

Getafix is deployed in Facebook to automatically suggest fixes for null dereference bugs reported by Infer, our static analysis tool, as well as to suggest fixes for the null dereference-related crash bugs that Sapienz finds. It is also being used to resolve outstanding Infer bugs from the past.

In one experiment, we compared Getafix-computed fixes with actual human-written past fixes for the same Infer null method call bugs, over a data set of about 200 small edits in which fewer than five lines had changed. In about 25% of those cases, Getafix’s highest-ranked patch exactly matched the human-created patch*.*

Another experiment looked at a subset of the Instagram codebase and tried to bulk-fix about 2,000 null method call issues. Getafix was able to attempt a patch in about 60 percent of bugs. About 90 percent of those attempts passed automatic validation, meaning they were compilable and Infer no longer emitted the warning. Overall, Getafix successfully automatically patched 1,077 (about 53 percent) of null method call bugs.

In addition to suggesting fixes for new Infer bugs as they are introduced, we’ve also been using our same infrastructure to clean up the backlog of Infer bugs that made it past code review and into the codebase. We’ve cleaned up hundreds of return not nullable Infer bugs and field not nullable Infer bugs as a part of this effort. Interestingly, suggesting auto-fixes next to return not nullable and field not nullable bugs resulted in an increase in the fix rate, from 56 percent to 62 percent and from 51 percent to 59 percent, respectively. Overall, a couple of hundred additional* *bugs were fixed in the past three months because Getafix displayed these suggestions.

Getafix also produces fixes to SapFix to address the crashes that Sapienz detects. Over the past months, Getafix provided about half of the fix candidates that SapFix uses and considers valid (all tests passed). Of all fix candidates Getafix provides to SapFix, about 80 percent pass all tests.

Getafix has helped us advance toward our goal of letting computers take care of routine bug-fixing work. As we continue to refine our testing and verification tools, we expect Getafix will be able to prevent a larger portion of postdeployment failures.

We note that the fix patterns Getafix mines need not come only in response to Infer-related fixes. Indeed, they can also come from fixes made in response to manual code inspection. This additional source of fix patterns opens up the exciting possibility of automating repetitive code reviews. In other words, a bug that has been flagged and remediated across the codebase multiple times in the past can be flagged automatically in a future code commit — without a human needing to do it.

Getafix is part of our overall effort to build intelligent tools that rely on statistical analysis of large code corpora and the associated metadata. Such tools have the potential to improve all aspects of the software development life cycle, including code discovery, code quality, and operational efficiency. The insights we gain from Getafix will help us in building out and deploying additional tools in this space.

*We’d like to thank Jeremy Dubreil, as well as Alexandru Marginean and the SapFix team for their help with integrating Getafix with Infer and SapFix, respectively.*

The post Getafix: How Facebook tools learn to fix bugs automatically appeared first on Facebook Code.

]]>The post Zero-shot learning: Using text to more accurately identify images appeared first on Facebook Code.

]]>Zero-shot learning (ZSL) is a process by which a machine learns to recognize objects it has never seen before. Researchers at Facebook have developed a new, more accurate ZSL model that uses neural net architectures called generative adversarial networks (GANs) to read and analyze text articles, and then visually identify the objects they describe. This novel approach to ZSL allows machines to classify objects based on category, and then use that information to identify other similar objects, as opposed to learning each object individually, as other models do.

Researchers trained this model, called generative adversarial zero-shot learning (GAZSL), to identify more than 600 classes of birds across two databases containing more than 60,000 images. It was then given web articles and asked to use the information there to identify birds it had not seen before. The model extracted seven key visual features from the text, created synthetic visualizations of these features, and used those features to identify the correct class of bird.

Researchers then tested the GAZSL model against seven other ZSL algorithms and found it was consistently more accurate across four different benchmarks. Overall, the GAZSL model outperformed other models by between 4 percent and 7 percent, and in some cases by much more.

To become more useful, computer vision systems will need to recognize objects they have not specifically been trained on. For example, it is estimated that there are more than 10,000 living bird species, yet most computer vision data sets of birds have only a couple hundred categories. This new ZSL model, which has been open-sourced, has been shown to produce better results and offers a promising path for future research into machine learning. Much of the research into AI remains foundational, but work that improves how systems are able to understand text and correctly identify objects continues to lay the groundwork for better, more reliable AI systems.

**A Generative Adversarial Approach for Zero-Shot Learning from Noisy Texts **

The post Zero-shot learning: Using text to more accurately identify images appeared first on Facebook Code.

]]>The post React Conf recap: Hooks, Suspense, and Concurrent Rendering appeared first on Facebook Code.

]]>Sophie Alpert and Dan Abramov kicked off Day 1 with their keynote, React Today and Tomorrow. In the talk, they introduced Hooks, which are a new proposal that adds the ability to access features such as state without writing a JavaScript class. Hooks promise to dramatically simplify the code required for React components and are currently available in a React 16.7 alpha release.

On the morning of Day 2, Andrew Clark and Brian Vaughn presented Concurrent Rendering in React. Andrew covered the recently announced React.lazy API for code splitting and previewed two upcoming features in 16.7: concurrent mode and Suspense. Brian demonstrated how to use React’s new profiler tooling to make apps built in React run faster.

In the afternoon, Parashuram N spoke in detail about React Native’s New Architecture, a long-term project that the React Native team has been working on over the past year and announced in June. We’re really excited about the potential of this project to improve performance, simplify interoperability with other libraries, and set a strong foundation for the future of React Native.

Now that the conference is over, all 28 conference talks are available to stream online. There are tons of great ones from both days. We can’t wait until next year!

Follow us on Twitter for more updates and info.

The post React Conf recap: Hooks, Suspense, and Concurrent Rendering appeared first on Facebook Code.

]]>