January 2024
·
4 Reads
·
40 Citations
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
January 2024
·
4 Reads
·
40 Citations
May 2023
·
1,035 Reads
·
6 Citations
The explosive growth of language models and their applications have led to an increased demand for efficient and scalable methods. In this paper, we introduce Flan-MoE, a set of Instruction-Finetuned Sparse Mixture-of-Expert (MoE) models. We show that naively finetuning MoE models on a task-specific dataset (in other words, no instruction-finetuning) often yield worse performance compared to dense models of the same computational complexity. However, our Flan-MoE outperforms dense models under multiple experiment settings: instruction-finetuning only and instruction-finetuning followed by task-specific finetuning. This shows that instruction-finetuning is an essential stage for MoE models. Specifically, our largest model, Flan-MoE-32B, surpasses the performance of Flan-PaLM-62B on four benchmarks, while utilizing only one-third of the FLOPs. The success of Flan-MoE encourages rethinking the design of large-scale, high-performance language models, under the setting of task-agnostic learning.
May 2023
·
57 Reads
·
3 Citations
Pretraining is the preliminary and fundamental step in developing capable language models (LM). Despite this, pretraining data design is critically under-documented and often guided by empirically unsupported intuitions. To address this, we pretrain 28 1.5B parameter decoder-only models, training on data curated (1) at different times, (2) with varying toxicity and quality filters, and (3) with different domain compositions. First, we quantify the effect of pretraining data age. A temporal shift between evaluation data and pretraining data leads to performance degradation, which is not overcome by finetuning. Second, we explore the effect of quality and toxicity filters, showing a trade-off between performance on standard benchmarks and risk of toxic generations. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data. We also find that the effects of different types of filtering are not predictable from text domain characteristics. Lastly, we empirically validate that the inclusion of heterogeneous data sources, like books and web, is broadly beneficial and warrants greater prioritization. These findings constitute the largest set of experiments to validate, quantify, and expose many undocumented intuitions about text pretraining, which we hope will help support more informed data-centric decisions in LM development.
March 2023
·
606 Reads
·
569 Citations
We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs. While less capable than humans in many real-world scenarios, GPT-4 exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers. GPT-4 is a Transformer-based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4's performance based on models trained with no more than 1/1,000th the compute of GPT-4.
January 2023
·
615 Reads
·
15 Citations
We study the design decisions of publicly available instruction tuning methods, and break down the development of Flan 2022 (Chung et al., 2022). Through careful ablation studies on the Flan Collection of tasks and methods, we tease apart the effect of design decisions which enable Flan-T5 to outperform prior work by 3-17%+ across evaluation settings. We find task balancing and enrichment techniques are overlooked but critical to effective instruction tuning, and in particular, training with mixed prompt settings (zero-shot, few-shot, and chain-of-thought) actually yields stronger (2%+) performance in all settings. In further experiments, we show Flan-T5 requires less finetuning to converge higher and faster than T5 on single downstream tasks, motivating instruction-tuned models as more computationally-efficient starting checkpoints for new tasks. Finally, to accelerate research on instruction tuning, we make the Flan 2022 collection of datasets, templates, and methods publicly available at https://github.com/google-research/FLAN/tree/main/flan/v2.
October 2022
·
531 Reads
·
126 Citations
Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.
September 2022
·
131 Reads
·
6 Citations
Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
June 2022
·
365 Reads
·
159 Citations
Scaling up language models has been shown to predictably improve performance and sample efficiency on a wide range of downstream tasks. This paper instead discusses an unpredictable phenomenon that we refer to as emergent abilities of large language models. We consider an ability to be emergent if it is not present in smaller models but is present in larger models. Thus, emergent abilities cannot be predicted simply by extrapolating the performance of smaller models. The existence of such emergence implies that additional scaling could further expand the range of capabilities of language models.
June 2022
·
984 Reads
·
62 Citations
Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 442 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.
May 2022
·
1 Read
·
27 Citations
... In recent years, the advent of generative artificial intelligence (AI) has marked a transformative era in our society [1][2][3][4][5][6][7][8] . These advanced computational systems have demonstrated exceptional proficiency in interpreting and generating human language, thereby setting new benchmarks in AI's capabilities. ...
March 2023
... The research community has made progress in understanding how training decisions impact downstream performance with respect to data composition. For instance, controlled studies have demonstrated that training on code data improves performance on certain reasoning benchmarks (Aryabumi et al., 2024;Petty et al., 2024), meta-features of data such as age and the use of toxicity filters affect performance on many QA tasks (Longpre et al., 2024), and the balance of multilingual data affects performance on English and other languages (Chang et al., 2023;Yue et al., 2025). These works uncover valuable insights, but they tend to focus on changing only a single aspect of the training recipe while keeping the rest fixed. ...
January 2024
... (2) Efficient Attention and Model Architecture Modifying the conventional attention architecture (e.g., through the addition or combination of diverse layers) as exemplified by approaches such as SOLAR [45], MOE [46] , and Mistral [47] is anticipated to yield more efficient models. These modifications aim to improve computational efficiency, ultimately facilitating an efficient and precise inference process. ...
May 2023
... Data are also often appraised and selected for inclusion with respect to a few key features, such as privacy vulnerabilities and toxic language. 1,13,59,60 Because of the scale involved, much of this appraisal is done via algorithmic filtering. Though archival studies emphasize the importance of documenting the principles and choices involved in appraisal and selection, 57,61 most pretraining datasets provide relatively little information about how or why these choices were made. 1 While there are exceptions-the creators of The Pile, for example, explain their position on copyright and fair use, as well as providing reasons for some exclusions 12 -this is not the norm. ...
May 2023
... The model has approximately 7.23 billion parameters. The original model was finetuned on a rich collection of augmented FLAN data aligns (Longpre et al., 2023) based on Orca model (Mukherjee et al., 2023). Essentially the model was fine-tuned on rich signals from GPT-4 model (OpenAI, 2023) that include step-by-step thought processes and complex instructions guided by teacher assistance. ...
January 2023
... While Fig. 5 depicts the foundational Transformer architecture, various models have incorporated specific modifications. For example, compared to the vanilla Transformer architecture, the Llama family integrates rotary position embeddings and RMSNorm before the feed-forward SwiGLU layer [59][60][61], and the T5 family uses relative positional embedding and includes LayerNorm at the beginning of each block and the end of the last block [62,63]. Formally, given a request instance , is the verbalized request input and is its corresponding SIF representation serving as the ground truth label. ...
October 2022
... This often results in the need for large GPU devices, limiting the model's applications. Therefore, in recent years, multilingual neural machine translation based on Mixtureof-Experts (MOE) (Fedus et al., 2022a) has been proposed. Compared to dense models, MOE-based multilingual machine translation activates only a portion of the network parameters during model training and inference (Lepikhin et al., 2021), giving it excellent computational efficiency. ...
September 2022
... Recent advancements in sparse Mixture-of-Experts (MoE) architectures [54,25,6,41,74] have introduced a paradigm shift in token generation by dynamically activating only a subset of experts per input, achieving superior efficiency in comparison to dense models, particularly under memorybound constraints of autoregressive decoding [25,74]. This sparse activation approach enables MoE-based language models to generate tokens more swiftly, leveraging the efficiency of selective expert usage and avoiding the overhead of full dense layer invocation. ...
May 2022
... This also improved their interpretation of human-written natural language instructions (i.e., prompting), allowing non-technical users to make requests and adapt a model to new tasks by modifying their prompts, rather than requiring further training or fine-tuning (Stiennon et al., 2022). Current LLMs can perform various linguistic tasks that previously required the use of task-specific, fine-tuned LLMs (Kojima et al., 2023;Wei et al., 2022). Therefore, it is unsurprising that evidence is growing that LLMs can be used for certain types of grading tasks (Kortemeyer, 2023). ...
June 2022
... -BIG-Bench (Beyond the Imitation Game Benchmark)a large-scale benchmark featuring over 200 diverse tasks, such as logic, mathematics, common sense reasoning, and language generation, aimed at pushing models to exhibit deeper cognitive understanding and reasoning [30]. ...
June 2022