Jeff Dean's research while affiliated with Google Inc. and other places

Publications (32)

Preprint
Full-text available
We study the problem of efficient generative inference for Transformer models, in one of its most challenging settings: large deep models, with tight latency targets and long sequence lengths. Better understanding of the engineering tradeoffs for inference for large Transformer-based models is important as use cases of these models are growing rapi...
Preprint
Full-text available
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find th...
Preprint
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of spars...
Preprint
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present...
Preprint
Full-text available
Multitask learning assumes that models capable of learning from multiple tasks can achieve better quality and efficiency via knowledge transfer, a key feature of human learning. Though, state of the art ML models rely on high customization for each task and leverage size and data scale rather than scaling the number of tasks. Also, continual learni...
Preprint
Full-text available
Most uses of machine learning today involve training a model from scratch for a particular task, or sometimes starting with a model pretrained on a related task and then fine-tuning on a downstream task. Both approaches offer limited knowledge transfer between different tasks, time-consuming human-driven customization to individual tasks and high c...
Preprint
Full-text available
Large language models have been shown to achieve remarkable performance across a variety of natural language tasks using few-shot learning, which drastically reduces the number of task-specific training examples needed to adapt the model to a particular application. To further our understanding of the impact of scale on few-shot learning, we traine...
Preprint
Full-text available
We present the design of a new large scale orchestration layer for accelerators. Our system, Pathways, is explicitly designed to enable exploration of new systems and ML research ideas, while retaining state of the art performance for current models. Pathways uses a sharded dataflow graph of asynchronous operators that consume and produce futures,...
Preprint
Scale has opened new frontiers in natural language processing -- but at a high cost. In response, Mixture-of-Experts (MoE) and Switch Transformers have been proposed as an energy efficient path to even larger and more capable language models. But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by traini...
Article
Full-text available
A decade of unprecedented progress in artificial intelligence (AI) has demonstrated the potential for many fields—including medicine—to benefit from the insights that AI techniques can extract from data. Here we survey recent progress in the development of modern computer vision techniques—powered by deep learning—for medical applications, focusing...
Article
Full-text available
Chip floorplanning is the engineering task of designing the physical layout of a computer chip. Despite five decades of research1, chip floorplanning has defied automation, requiring months of intense effort by physical design engineers to produce manufacturable layouts. Here we present a deep reinforcement learning approach to chip floorplanning....
Preprint
The number of parameters in state of the art neural networks has drastically increased in recent years. This surge of interest in large scale neural networks has motivated the development of new distributed training strategies enabling such models. One such strategy is model-parallel distributed training. Unfortunately, model-parallelism suffers fr...
Preprint
Full-text available
In this work, we present a learning-based approach to chip placement, one of the most complex and time-consuming stages of the chip design process. Unlike prior methods, our approach has the ability to learn from past experience and improve over time. In particular, as we train over a greater number of chip blocks, our method becomes better at rapi...
Article
Full-text available
Background: Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. Objective: We pr...
Conference Paper
In this keynote we describe progress in work that our research teams have been doing over the past years, including advances in difficult problems in artificial intelligence, on building large-scale computer systems for machine learning research, and, in collaboration with many teams at Google, on applying our research and systems to dozens of Goog...
Preprint
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that com...
Article
Full-text available
Here we present deep-learning techniques for healthcare, centering our discussion on deep learning in computer vision, natural language processing, reinforcement learning, and generalized methods. We describe how these computational techniques can impact a few key areas of medicine and explore how to build end-to-end systems. Our discussion of comp...
Article
Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that com...
Conference Paper
Many recent machine learning models rely on fine-grained dynamic control flow for training and inference. In particular, models based on recurrent neural networks and on reinforcement learning depend on recurrence relations, data-dependent conditional execution, and other features that call for dynamic control flow. These applications benefit from...
Article
We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expe...
Article
Full-text available
Predictive modeling with electronic health record (EHR) data is anticipated to drive personalized medicine and improve healthcare quality. Constructing predictive statistical models typically requires extraction of curated predictor variables from normalized EHR data, a labor-intensive process that discards the vast majority of information in each...
Article
Full-text available
The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the ne...
Article
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposed in theory as a way of dramatically increasing model capacity without a proportional increase in computation. In practice, however, there are significa...
Conference Paper
Over the past five years, deep learning and large-scale neural networks have made significant advances in speech recognition, computer vision, language understanding and translation, robotics, and many other fields. Deep learning allows the use of very raw forms of data in order to build higher-level understanding of data automatically, and can als...
Article
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, espe...
Conference Paper
A very simple way to improve the performance of almost any machine learning algorithm is to train many different models on the same data and then to average their predictions. Unfortunately, making predictions using a whole ensemble of models is cumbersome and may be too computationally expensive to allow deployment to a large number of users, espe...
Article
Full-text available
Object recognition and localization are important tasks in computer vision. The focus of this work is the incorporation of contextual information in order to improve object recognition and localization. For instance, it is natural to expect not to see an elephant to appear in the middle of an ocean. We consider a simple approach to encapsulate such...
Article
Full-text available
We consider the problem of building high- level, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9-layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model...

Citations

... Mixture-of-Expert. Mixture-of-Experts (MoE) in neural networks means dividing specific parts of the parameters into subsets, each of which is called an expert Fedus et al., 2022). During the forward pass, a router assigns experts to different input, and each input only interacts with the experts assigned to. ...
... In figure 7 in the Appendix, proof accuracy increases considerably when increasing the model size from 350M to 1.3B and 6.7B. However, only version 002 of the 175B INSTRUCTGPT model is able to perform better than chance, suggesting that the reasoning ability tested by our experiments could be an emergent ability (Wei et al., 2022a). We were not able to conclusively discern the cause of the significant difference in performance between version 001 and 002. ...
... Aside from the hierarchy, the feature vectors also describe timing parameters. Mirhoseini et al. [60] tackle floorplanning in an industrial context. The authors approach this problem via RL and develop an edgebased GCN that describes the high-level netlist of modules. ...
... In the past, modern technology was not available, so detecting diseases was a cumbersome and difficult issue. With the advancement of science and modern technologies that have begun to spread in abundance [1], one has been able to employ these techniques to serve the health sector, where diagnosing the disease of the human body is not easy and still challenging to pathology due to the need for a lot of information and accurate diagnosis. ...
... As an example, FPGAs offer unique performance and energy-saving advantages, but the software engineering part is challenging (Omondi and Rajapakse, 2006;Teubner et al., 2013). The challenges faced when deploying machine learning on diverse hardware in different application contexts (Hazelwood et al., 2018) even gave rise to a new conference, bridging the gap between machine learning and systems researchers (Ratner et al., 2019). The current increase in awareness regarding CO 2 emissions foregrounds these properties even more for users who want to design ML systems responsibly. ...
... This work focuses on the ability to share textual medical documents, often written by doctors and can take the form of operating reports, clinical notes or biological examination results. To facilitate privacy protection, de-identification methods [7,8,9,10,11] have been proposed as a process to remove or mask any type of Protected Health Information (PHI) from a patient, so that it becomes difficult to link an individual to its data. The type of information that con-stitutes PHI is defined in part by the privacy laws of the relevant jurisdiction. ...
... In recent years, deep learning methods have become increasingly popular as a tool for all healthcare analytics applications, particularly in medical image classification [42]. Their recent extensive application can be attributed to the increased availability of electronic health records as well as improvements in hardware and software [43][44][45][46]. ...
... Automatic differentiation frameworks that have statistical inference packages built on top of them include Theano [5], Pytorch [44], JAX [10] and TensorFlow [1]. Some of these extend to non-trivial computational constructs such as control flow and recursion [63]. ...
... These high GPU hours make these algorithms impractical, thus raising the need for research to accelerate search speed. The seminal approach for search acceleration includes the sequential model-based optimization (SMBO) [24], methods to define the scope of the desired structure [25], weight-sharing across to different structures [33], and gradient descent as used in the DARTS model [26]. Gradient descent is utilized in DARTS model [26] to boost the search speed resulting in a search time of 4 GPU days (as compared to the 2000 GPU days for NASNet [59] and the 3150 GPU days for AmoebaNet [35], as mentioned previously). ...
... Early testing has shown favourable results so far. A 2018 machine learning analysis conducted by Google of 216,221 EMRs for adult patients hospitalised for at least a day, was effective at predicting in-hospitality mortality, 30day unplanned readmissions and all final diagnoses ( Banks, 2020;Rajkomar et al., 2018). However, in many other respects, the available analytics are marred with various shortcomings. ...