Conference Paper

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

Authors:
  • zhejiang xianju pharmaceutical
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In such cases, we evaluate random samples to expand the dataset. We test these regressors to model the inference latency on the Nvidia A100 GPU for the SST-2 task in the GLUE benchmarking suite [46]. For a pool of high-uncertainty samples evaluated at each iteration, we take smaller subsets of train/validation (80-20%) splits to check the prediction MSE on the validation set after training the regressor on the training set. ...
... We fine-tune our models on the nine GLUE tasks [46]. We also run automatic hyperparameter tuning in the finetuning process (i.e., search the training recipe) using the tree-structured Parzen estimator algorithm [56]. ...
... This demonstrates the capabilities of each hardware platform with like-for-like comparisons. Fig. 11 shows power consumption while running model inference with BERT-Tiny [41] on the SST-2 task [46] for different hardware platforms. Figs. ...
Preprint
Full-text available
Automated design of efficient transformer models has recently attracted significant attention from industry and academia. However, most works only focus on certain metrics while searching for the best-performing transformer architecture. Furthermore, running traditional, complex, and large transformer models on low-compute edge platforms is a challenging problem. In this work, we propose a framework, called ProTran, to profile the hardware performance measures for a design space of transformer architectures and a diverse set of edge devices. We use this profiler in conjunction with the proposed co-design technique to obtain the best-performing models that have high accuracy on the given task and minimize latency, energy consumption, and peak power draw to enable edge deployment. We refer to our framework for co-optimizing accuracy and hardware performance measures as EdgeTran. It searches for the best transformer model and edge device pair. Finally, we propose GPTran, a multi-stage block-level grow-and-prune post-processing step that further improves accuracy in a hardware-aware manner. The obtained transformer model is 2.8$\times$ smaller and has a 0.8% higher GLUE score than the baseline (BERT-Base). Inference with it on the selected edge device enables 15.0% lower latency, 10.0$\times$ lower energy, and 10.8$\times$ lower peak power draw compared to an off-the-shelf GPU.
... We conduct extensive experiments on a wide range of tasks and models to demonstrate the effectiveness of AdaLoRA. Specifically, we evaluate the performance using DeBERTaV3-base (He et al., 2021a) on natural language understanding (GLUE, Wang et al. (2019)) and question answering (SQuADv1, Rajpurkar et al. (2016) and SQuADv2, Rajpurkar et al. (2018)) datasets. We also apply our methods to BART-large and evaluate the performance on natural language generation (XSum, Narayan et al. (2018) and CNN/DailyMail, Hermann et al. (2015)) tasks. ...
... We implement AdaLoRA for fine-tuning DeBERTaV3-base (He et al., 2021a) and BART-large . We evaluate the effectiveness of the proposed algorithm on natural language understanding (GLUE, Wang et al. (2019)), question answering (SQuADv1, Rajpurkar et al. (2016) and SQuADv2, Rajpurkar et al. (2018)), and natural language generation (XSum, Narayan et al. (2018) and CNN/DailyMail Hermann et al. (2015)). All the gains have passed significant tests with p < 0.05. ...
... We present the dataset statistics of GLUE (Wang et al., 2019) in the following table. For each budget level, we tune the final budget b (T ) for AdaLoRA, the rank r for LoRA, the hidden dimension d for two adapters to match the budget requirements. ...
Preprint
Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA .
... Text question-answering [1] systems can answer natural language questions automatically, providing a convenient way for people to obtain required knowledge. A successful system must be able to carry out several Natural Language Processing (NLP) tasks, such as natural language understanding [2], information retrieval [3], or natural language inference [4], making Question Answering (QA) one of the most challenging tasks that has attracted the interest of many NLP researchers. ...
... This setting enables us to confidently evaluate the predicted answers by comparing them to ground truth using overlap-based metrics, thus providing a relatively ideal and convenient test bed for QA models. Many works have used the extractive QA task to test models' ability to answer questions [2,[8][9][10] or focus on achieving better results [11]. ...
Article
Full-text available
Extractive Question Answering, also known as machine reading comprehension, can be used to evaluate how well a computer comprehends human language. It is a valuable topic with many applications, such as in chatbots and personal assistants. End-to-end neural-network-based models have achieved remarkable performance on these tasks. The most frequently used approach to extract answers with neural networks is to predict the answer’s start and end positions in the document, independently or jointly. In this paper, we propose another approach that considers all words in an answer jointly. We introduce an encoder-decoder model to learn from all words in the answer. This differs from previous works. which usually focused on the start and end and ignored the words in the middle. To help the encoder-decoder model to perform this task better, we employ evaluation-based reinforcement learning with different reward functions. The results of an experiment on the SQuAD dataset show that the proposed method can outperform the baseline in terms of F1 scores, offering another potential approach to solve the extractive QA task.
... To obtain the transformer model accuracy, we employ the FlexiBERT 2.0 surrogate model, which outputs the GLUE score [54]. ...
... We test the models on representative natural language understanding tasks under the GLUE benchmark [54]. The included tasks are: SST-2 [55], MNLI [56], QQP, QNLI, MRPC [57], CoLA [58], STS-B [59], RTE [60], and WNLI [61]. ...
Preprint
Full-text available
Automated co-design of machine learning models and evaluation hardware is critical for efficiently deploying such models at scale. Despite the state-of-the-art performance of transformer models, they are not yet ready for execution on resource-constrained hardware platforms. High memory requirements and low parallelizability of the transformer architecture exacerbate this problem. Recently-proposed accelerators attempt to optimize the throughput and energy consumption of transformer models. However, such works are either limited to a one-sided search of the model architecture or a restricted set of off-the-shelf devices. Furthermore, previous works only accelerate model inference and not training, which incurs substantially higher memory and compute resources, making the problem even more challenging. To address these limitations, this work proposes a dynamic training framework, called DynaProp, that speeds up the training process and reduces memory consumption. DynaProp is a low-overhead pruning method that prunes activations and gradients at runtime. To effectively execute this method on hardware for a diverse set of transformer architectures, we propose ELECTOR, a framework that simulates transformer inference and training on a design space of accelerators. We use this simulator in conjunction with the proposed co-design technique, called TransCODE, to obtain the best-performing models with high accuracy on the given task and minimize latency, energy consumption, and chip area. The obtained transformer-accelerator pair achieves 0.3% higher accuracy than the state-of-the-art pair while incurring 5.2$\times$ lower latency and 3.0$\times$ lower energy consumption.
... Pretrained Language Models (PLMs) based on Transformers (Vaswani et al., 2017) such as BERT (Devlin et al., 2019), XLNET (Yang et al., 2019a), RoBERTa , AL-BERT (Lan et al., 2019), have achieved significant improvements across various NLP tasks, such as GLUE (Wang et al., 2018), MultiNLI and SQuAD (Rajpurkar et al., 2016). Although PLMs were originally developed for English, they are easily extend to other languages such as Chinese (Conneau et al., 2020;Cui et al., 2020Cui et al., , 2021Sun et al., 2021b). ...
... Different from static word representation (Mikolov et al., 2013;Pennington et al., 2014), unsupervised pre-trained language models provide contextualized representations where each token embedding varies dynamically according to the context. The milestone work BERT employs Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) tasks to pre-train the model on large-scale plain text, and it significantly improved the state of the art in Natural Language Understanding (NLU) tasks (Wang et al., 2018;Rajpurkar et al., 2016;. ...
Preprint
Full-text available
Pretrained language models (PLMs) have shown marvelous improvements across various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence of characters, and completely ignore word information. Although Whole Word Masking can alleviate this, the semantics in words is still not well represented. In this paper, we revisit the segmentation granularity of Chinese PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both characters and words. To achieve this, we design objective functions for learning both character and word-level representations. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA performance on all these tasks. Further analysis demonstrates that words are semantically richer than characters. More interestingly, we show that MigBERT also works with Japanese. Our code and model have been released here~\footnote{https://github.com/xnliang98/MigBERT}.
... NLI (formerly known as Recognizing Textual Entailment (RTE)) is one of the core tasks in the popular benchmarks for Natural Language Understanding GLUE (Wang et al., 2018) and Super GLUE (Wang et al., 2019). Hundreds of machine learning systems compete on these benchmarks, improving the state of NLU. ...
... RTE was later reformulated as a three-way decision and ultimately renamed Natural Language Inference in the SNLI (Bowman et al., 2015) and the MNLI corpora. Both the RTE and the NLI tasks form part of the Natural Language Understanding benchmarks GLUE (Wang et al., 2018) and Super-GLUE (Wang et al., 2019). The NLU benchmarks attracted a lot of attention from the community and by 2020 the state-of-the-art systems reported human level performance. ...
Article
Full-text available
In this paper, we present INFERES-an original corpus for Natural Language Inference (NLI) in European Spanish. We propose, implement , and analyze a variety of corpus-creating strategies utilizing expert linguists and crowd workers. The objectives behind INFERES are to provide high-quality data, and, at the same time to facilitate the systematic evaluation of automated systems. Specifically, we focus on measuring and improving the performance of machine learning systems on negation-based adversarial examples and their ability to generalize across out-of-distribution topics. We train two transformer models on IN-FERES (8,055 gold examples) in a variety of scenarios. Our best model obtains 72.8% accuracy , leaving a lot of room for improvement. The "hypothesis-only" baseline performs only 2%-5% higher than majority, indicating much fewer annotation artifacts than prior work. We find that models trained on INFERES generalize very well across topics (both in-and out-of-distribution) and perform moderately well on negation-based adversarial examples.
... Models such as IndoNLG [76], MuRIL [77], IndicNLPSuite [78], mT5 [79], mT6 [80], XLM-R [81], XLM-E [82], and INFOXLM [83] are multilingual pretrained models trained on larger datasets producing optimal performance. A summary of pretraining models with their datasets is in Table 3. GLUE [84], RACE, and SQuAD T2T Transformer [11] Colossal Clean Crawled Corpus (C4) [11] Developed a common framework to convert a variety of text-based language problems into a text-to-text format GLUE and SQuAD ...
... Evaluating a transformer-based pretrained model is vital, as model efficacy is paramount in its patronage. Some benchmarking frameworks have been proposed in this regard for general [84] and specific domain models [35,73]. In [76,78], there is some benchmarks to evaluate monolingual and multilingual language models. ...
Article
Full-text available
Transfer learning is a technique utilized in deep learning applications to transmit learned inference to a different target domain. The approach is mainly to solve the problem of a few training datasets resulting in model overfitting, which affects model performance. The study was carried out on publications retrieved from various digital libraries such as SCOPUS, ScienceDirect, IEEE Xplore, ACM Digital Library, and Google Scholar, which formed the Primary studies. Secondary studies were retrieved from Primary articles using the backward and forward snowballing approach. Based on set inclusion and exclusion parameters, relevant publications were selected for review. The study focused on transfer learning pretrained NLP models based on the deep transformer network. BERT and GPT were the two elite pretrained models trained to classify global and local representations based on larger unlabeled text datasets through self-supervised learning. Pretrained transformer models offer numerous advantages to natural language processing models, such as knowledge transfer to downstream tasks that deal with drawbacks associated with training a model from scratch. This review gives a comprehensive view of transformer architecture, self-supervised learning and pretraining concepts in language models, and their adaptation to downstream tasks. Finally, we present future directions to further improvement in pretrained transformer-based language models.
... In recent years, benchmarks have been gaining popularity in Machine Learning and Natural Language Processing (NLP) communities because of their ability to holistically evaluate model performance over a variety of representative tasks, thus allowing practitioners to compare and contrast different models on multiple tasks relevant for the specific application domain. General Language Understanding Evaluation (GLUE) [5] and Super-GLUE [6] are examples of popular NLP benchmarks which measure the natural language understanding capabilities of state-of-the-art (SOTA) models. NLP benchmarks are also developing rapidly in language domains, with LexGLUE [7] being an example of a recent benchmark hosting several difficult tasks in the legal language domain. ...
... NLP benchmarks have been gaining popularity in recent years because of their ability to holistically evaluate model performance over a variety of representative tasks. GLUE [5] and SuperGLUE [6] are examples of benchmarks that evaluate SOTA models on a range of natural language understanding tasks. The Generation, Evaluation and Metrics (GEM) benchmark [12] looks beyond text classification and measures performance in Natural Language Generation tasks, such as summarization and data-to-text conversion. ...
Article
Full-text available
Benchmarks for general language understanding have been rapidly developing in recent years of NLP research, particularly because of their utility in choosing strong-performing models for practical downstream applications. While benchmarks have been proposed in the legal language domain, virtually no such benchmarks exist for privacy policies despite their increasing importance in modern digital life. This could be explained by privacy policies falling under the legal language domain, but we find evidence to the contrary that motivates a separate benchmark for privacy policies. Consequently, we propose PrivacyGLUE as the first comprehensive benchmark of relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. Furthermore, we release performances from multiple transformer language models and perform model–pair agreement analysis to detect tasks where models benefited from domain specialization. Our findings show the importance of in-domain pretraining for privacy policies. We believe PrivacyGLUE can accelerate NLP research and improve general language understanding for humans and AI algorithms in the privacy language domain, thus supporting the adoption and acceptance rates of solutions based on it.
... In recent years, benchmarks have been gaining popularity in Machine Learning and Natural Language Processing (NLP) communities because of their ability to holistically evaluate model performance over a variety of representative tasks, thus allowing practitioners to compare and contrast different models on multiple tasks relevant for the specific application domain. GLUE [5] and SuperGLUE [6] are examples of popular NLP benchmarks which measure the natural language understanding capabilities of SOTA models. NLP benchmarks are also developing rapidly in language domains, with LexGLUE [7] being an example of a recent benchmark hosting several difficult tasks in the legal language domain. ...
... NLP benchmarks have been gaining popularity in recent years because of their ability to holistically evaluate model performance over a variety of representative tasks. GLUE [5] and SuperGLUE [6] are examples of benchmarks that evaluate SOTA models on a range of natural language understanding tasks. The GEM benchmark [12] looks beyond text classification and measures performance in Natural Language Generation tasks such as summarization and data-to-text conversion. ...
Preprint
Full-text available
Benchmarks for general language understanding have been rapidly developing in recent years of NLP research, particularly because of their utility in choosing strong-performing models for practical downstream applications. While benchmarks have been proposed in the legal language domain, virtually no such benchmarks exist for privacy policies despite their increasing importance in modern digital life. This could be explained by privacy policies falling under the legal language domain, but we find evidence to the contrary that motivates a separate benchmark for privacy policies. Consequently, we propose PrivacyGLUE as the first comprehensive benchmark of relevant and high-quality privacy tasks for measuring general language understanding in the privacy language domain. Furthermore, we release performances from multiple transformer language models and perform model-pair agreement analysis to detect tasks where models benefited from domain specialization. Our findings show the importance of in-domain pretraining for privacy policies. We believe PrivacyGLUE can accelerate NLP research and improve general language understanding for humans and AI algorithms in the privacy language domain, thus supporting the adoption and acceptance rates of solutions based on it.
... prediction of lung cancer with context-independent word embedding methods (WEMs) and a simple logistic regression objective, with promising preliminary results. Ideally, one would like to use state-of-the-art contextualised pretrained language models (PLMs) instead of simpler context-independent WEMs, since PLMs typically show improved performance across a range of different Natural Language Processing tasks [23,22] including tasks in the (bio)medical domain [4,19]. However, there are at least two important reasons that make using PLMs difficult for predicting lung cancer from patient free-text notes: i) Even though PLMs are language models pretrained on large amounts of raw text using selfsupervised learning [3,21], such models are data-hungry and typically need large amounts of supervised training data to achieve good predictive performance on any specific task. ...
Preprint
Full-text available
We investigate different natural language processing (NLP) approaches based on contextualised word representations for the problem of early prediction of lung cancer using free-text patient medical notes of Dutch primary care physicians. Because lung cancer has a low prevalence in primary care, we also address the problem of classification under highly imbalanced classes. Specifically, we use large Transformer-based pretrained language models (PLMs) and investigate: 1) how \textit{soft prompt-tuning} -- an NLP technique used to adapt PLMs using small amounts of training data -- compares to standard model fine-tuning; 2) whether simpler static word embedding models (WEMs) can be more robust compared to PLMs in highly imbalanced settings; and 3) how models fare when trained on notes from a small number of patients. We find that 1) soft-prompt tuning is an efficient alternative to standard model fine-tuning; 2) PLMs show better discrimination but worse calibration compared to simpler static word embedding models as the classification problem becomes more imbalanced; and 3) results when training models on small number of patients are mixed and show no clear differences between PLMs and WEMs. All our code is available open source in \url{https://bitbucket.org/aumc-kik/prompt_tuning_cancer_prediction/}.
... However, Transformer specific adaptation for visual tasks has received relatively less attention. At the same time, in the NLP domain, the dominance of large-scale pre-trained Transformer-based Large Language Models (LLM) [4,11,61], has paved way for many approaches [26,29,31] that efficiently fine-tune LLMs for different downstream NLP tasks [82,83]. In this work we compare with the most representative methods for fair benchmarking. ...
Preprint
Prompt learning is an efficient approach to adapt transformers by inserting learnable set of parameters into the input and intermediate representations of a pre-trained model. In this work, we present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT). Out method constructs downstream representations via learnable ``output'' tokens, that are akin to the learned class tokens of the ViT. Further for better steering of the downstream representation processed by the frozen transformer, we introduce residual learnable tokens that are added to the output of various computations. We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark. In addition to strong performance, we observe that our approach is an order of magnitude more prompt efficient than existing visual prompting baselines. We analytically show the computational benefits of our approach over weight space adaptation techniques like finetuning. Lastly we systematically corroborate the architectural design of our method via a series of ablation experiments.
... In this way, words are related to each other even in the case of long-term dependency. BERT has been widely adopted and has achieved state-of-the-art performance on a variety of benchmarks such as GLUE [8] for natural language understanding (NLU). ...
Preprint
Attention mechanisms have played a crucial role in the development of complex architectures such as Transformers in natural language processing. However, Transformers remain hard to interpret and are considered as black-boxes. This paper aims to assess how attention coefficients from Transformers can help in providing interpretability. A new attention-based interpretability method called CLaSsification-Attention (CLS-A) is proposed. CLS-A computes an interpretability score for each word based on the attention coefficient distribution related to the part specific to the classification task within the Transformer architecture. A human-grounded experiment is conducted to evaluate and compare CLS-A to other interpretability methods. The experimental protocol relies on the capacity of an interpretability method to provide explanation in line with human reasoning. Experiment design includes measuring reaction times and correct response rates by human subjects. CLS-A performs comparably to usual interpretability methods regarding average participant reaction time and accuracy. The lower computational cost of CLS-A compared to other interpretability methods and its availability by design within the classifier make it particularly interesting. Data analysis also highlights the link between the probability score of a classifier prediction and adequate explanations. Finally, our work confirms the relevancy of the use of CLS-A and shows to which extent self-attention contains rich information to explain Transformer classifiers.
... Datasets, models, and baselines. We use ResNet-9 classifiers trained on the CIFAR dataset (CIFAR-10, and a two-class subset called CIFAR-2); ResNet-18 [HZR+15] classifiers trained on the 1000-class ImageNet [RDS+15] dataset, and pre-trained BERT [DCL+19] models finetuned on the QNLI (Question-answering Natural Language Inference) classification task from the GLUE benchmark [WSM+18]. We provide further details on these choices of dataset and task in Appendix A. ...
Preprint
The goal of data attribution is to trace model predictions back to training data. Despite a long line of work towards this goal, existing approaches to data attribution tend to force users to choose between computational tractability and efficacy. That is, computationally tractable methods can struggle with accurately attributing model predictions in non-convex settings (e.g., in the context of deep neural networks), while methods that are effective in such regimes require training thousands of models, which makes them impractical for large models or datasets. In this work, we introduce TRAK (Tracing with the Randomly-projected After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differentiable models. In particular, by leveraging only a handful of trained models, TRAK can match the performance of attribution methods that require training thousands of models. We demonstrate the utility of TRAK across various modalities and scales: image classifiers trained on ImageNet, vision-language models (CLIP), and language models (BERT and mT5). We provide code for using TRAK (and reproducing our work) at https://github.com/MadryLab/trak .
... The novel reward function uses pre-trained MPNet [52] as a basic building block to acquire semantically accurate embeddings. The MPNet model has produced cutting-edge results on several natural language processing tasks including GLUE [55], SQuAD [40,41], RACE [21], and sentiment prediction [29] benchmarks. In the following, we provide a brief overview of the GPT-2 and MPNet models. ...
Preprint
Full-text available
Task-oriented dialog systems enable users to accomplish tasks using natural language. State-of-the-art systems respond to users in the same way regardless of their personalities, although personalizing dialogues can lead to higher levels of adoption and better user experiences. Building personalized dialog systems is an important, yet challenging endeavor and only a handful of works took on the challenge. Most existing works rely on supervised learning approaches and require laborious and expensive labeled training data for each user profile. Additionally, collecting and labeling data for each user profile is virtually impossible. In this work, we propose a novel framework, P-ToD, to personalize task-oriented dialog systems capable of adapting to a wide range of user profiles in an unsupervised fashion using a zero-shot generalizable reward function. P-ToD uses a pre-trained GPT-2 as a backbone model and works in three phases. Phase one performs task-specific training. Phase two kicks off unsupervised personalization by leveraging the proximal policy optimization algorithm that performs policy gradients guided by the zero-shot generalizable reward function. Our novel reward function can quantify the quality of the generated responses even for unseen profiles. The optional final phase fine-tunes the personalized model using a few labeled training examples. We conduct extensive experimental analysis using the personalized bAbI dialogue benchmark for five tasks and up to 180 diverse user profiles. The experimental results demonstrate that P-ToD, even when it had access to zero labeled examples, outperforms state-of-the-art supervised personalization models and achieves competitive performance on BLEU and ROUGE metrics when compared to a strong fully-supervised GPT-2 baseline
... For example, the BERT-tiny model has N = 3 blocks, d emb = 768 embedding dimensions, H = 12 attention heads, and n = 30 input tokens. Datasets for five BERT tasks are SQuAD1 [18], SQuAD2 [11], and MNLI-m, MRPC, SST-2 from GLUE benchmarks [22]. ...
Preprint
Full-text available
It is increasingly important to enable privacy-preserving inference for cloud services based on Transformers. Post-quantum cryptographic techniques, e.g., fully homomorphic encryption (FHE), and multi-party computation (MPC), are popular methods to support private Transformer inference. However, existing works still suffer from prohibitively computational and communicational overhead. In this work, we present, Primer, to enable a fast and accurate Transformer over encrypted data for natural language processing tasks. In particular, Primer is constructed by a hybrid cryptographic protocol optimized for attention-based Transformer models, as well as techniques including computation merge and tokens-first ciphertext packing. Comprehensive experiments on encrypted language modeling show that Primer achieves state-of-the-art accuracy and reduces the inference latency by 90.6% ~ 97.5% over previous methods.
... Those who would grant understanding to current or near-future LLMs base their views on the performance of these models on several measures, including subjective judgment of the quality of the text generated by the model in response to prompts (although such judgments can be vulnerable to the Eliza effect), and more objective performance on benchmark datasets designed to assess language understanding and reasoning. For example, two standard benchmarks for assessing LLMs are the General Language Understanding Evaluation (GLUE) (27) and its successor (SuperGLUE) (28), which include large-scale datasets with tasks such as "textual entailment" (given two sentences, can the meaning of the second be inferred from the first?), "words in context" (does a given word have the same meaning in two different sentences?), and yes/no question answering, among others. OpenAI's GPT-3, with 175 billion parameters, performed surprisingly well on these tasks (5), and Google's PaLM, with 540 billion parameters, performed even better (7), often equaling or surpassing humans on the same tasks. ...
Article
We survey a current, heated debate in the artificial intelligence (AI) research community on whether large pretrained language models can be said to understand language-and the physical and social situations language encodes-in any humanlike sense. We describe arguments that have been made for and against such understanding and key questions for the broader sciences of intelligence that have arisen in light of these arguments. We contend that an extended science of intelligence can be developed that will provide insight into distinct modes of understanding, their strengths and limitations, and the challenge of integrating diverse forms of cognition.
... While performing such comparisons requires significant resources in terms of data and computing power, they would be highly valuable for establishing a clearer understanding of the current state-of-the-art in the field. Efforts such as GLUE in NLP [227] are needed in spoken dialogue systems. Nonetheless, the battle so far seems to be between BERT and GPT-and novel architectures performing even better remain to be discovered. ...
Preprint
Full-text available
The remarkable success of transformers in the field of natural language processing has sparked the interest of the speech-processing community, leading to an exploration of their potential for modeling long-range dependencies within speech sequences. Recently, transformers have gained prominence across various speech-related domains, including automatic speech recognition, speech synthesis, speech translation, speech para-linguistics, speech enhancement, spoken dialogue systems, and numerous multimodal applications. In this paper, we present a comprehensive survey that aims to bridge research studies from diverse subfields within speech technology. By consolidating findings from across the speech technology landscape, we provide a valuable resource for researchers interested in harnessing the power of transformers to advance the field. We identify the challenges encountered by transformers in speech processing while also offering insights into potential solutions to address these issues.
... The author of [20] lists several NLP packages that are widely used: ...
... BERT has shown excellent performance on many NLP tasks and is now a de-facto standard in NLP. In the initial evaluation [51], BERT showed improved performance on all eight tasks from the GLUE (general language understanding evaluation) benchmark suite [219], consisting of question answering, named entity recognition, and common-sense inference. A variant of BERT, called RoBERTa [137], which only uses masked language model training but on a larger dataset and for a longer time, has become a popular practical choice due to its improved robustness and better parallel training capability. ...
Article
Natural language processing (NLP) is an area of artificial intelligence that applies information technologies to process the human language, understand it to a certain degree, and use it in various applications. This area has rapidly developed in the last few years and now employs modern variants of deep neural networks to extract relevant patterns from large text corpora. The main objective of this work is to survey the recent use of NLP in the field of pharmacology. As our work shows, NLP is a highly relevant information extraction and processing approach for pharmacology. It has been used extensively, from intelligent searches through thousands of medical documents to finding traces of adversarial drug interactions in social media. We split our coverage into five categories to survey modern NLP methodology, commonly addressed tasks, relevant textual data, knowledge bases, and useful programming libraries. We split each of the five categories into appropriate subcategories, describe their main properties and ideas, and summarize them in a tabular form. The resulting survey presents a comprehensive overview of the area, useful to practitioners and interested observers. Significance Statement The main objective of this work is to survey the recent use of NLP in the field of pharmacology, in order to provide a comprehensive overview of the current state in the area after the rapid developments which occurred in the last few years. We believe the resulting survey to be useful to practitioners and interested observers in the domain.
... The next question that needs to be addressed is which representations are chosen to compute the distribution probabilities of the relation labels. Furthermore, inspired by BERT, which uses the representation of [CLS] to handle sentence pair classification tasks, such as MNLI (multi-genre natural language inference), QQP (Quora question pairs), and QNLI (question-answer natural language inference) [38], one can speculate that the representation of [CLS] contains the semantic features of the sentence pair, or one can say that in the model proposed in this paper, the representation of [CLS] can be regarded as a kind of hybrid feature. The subsequent computational process is consistent with Section 3.2.1, and the representation of [CLS] containing the semantic information is concatenated with the representations of [ES] according to the following equation: ...
Article
Full-text available
Relation extraction, a fundamental task in natural language processing, aims to extract entity triples from unstructured data. These triples can then be used to build a knowledge graph. Recently, pre-training models that have learned prior semantic and syntactic knowledge, such as BERT and ERNIE, have enhanced the performance of relation extraction tasks. However, previous research has mainly focused on sequential or structural data alone, such as the shortest dependency path, ignoring the fact that fusing sequential and structural features may improve the classification performance. This study proposes a concise approach using the fused features for the relation extraction task. Firstly, for the sequential data, we verify in detail which of the generated representations can effectively improve the performance. Secondly, inspired by the pre-training task of next-sentence prediction, we propose a concise relation extraction approach based on the fusion of sequential and structural features using the pre-training model ERNIE. The experiments were conducted on the SemEval 2010 Task 8 dataset and the results show that the proposed method can improve the F1 value to 0.902.
... In this context, Instance-wise Contrastive Learning (Instance-CL) has recently achieved remarkable success in self-supervised learning. Supervised Contrastive Learning (SCL) [24] term fine-tuned to the objective can significantly improve the performance on natural language understanding tasks from the GLUE benchmark [25]. Compared to [12,13,15], the effectiveness of SCL is studied in this paper. ...
Article
Full-text available
Instructors face significant time and effort constraints when grading students’ assessments on a large scale. Clustering similar assessments is a unique and effective technique that has the potential to significantly reduce the workload of instructors in online and large-scale learning environments. By grouping together similar assessments, marking one assessment in a cluster can be scaled to other similar assessments, allowing for a more efficient and streamlined grading process. To address this issue, this paper focuses on text assessments and proposes a method for reducing the workload of instructors by clustering similar assessments. The proposed method involves the use of distributed representation to transform texts into vectors, and contrastive learning to improve the representation that distinguishes the differences among similar texts. The paper presents a general framework for clustering similar texts that includes label representation, K-means, and self-organization map algorithms, with the objective of improving clustering performance using Accuracy (ACC) and Normalized Mutual Information (NMI) metrics. The proposed framework is evaluated experimentally using two real datasets. The results show that self-organization maps and K-means algorithms with Pre-trained language models outperform label representation algorithms for different datasets.
... Each pair was labeled by human annotators as a paraphrase or not. [14,35] Wang et al. [49] proposed GLUE 13 (general language understanding evaluation benchmark) -a tool for evaluating and analyzing the performance of models across a diverse range of existing NLU tasks based on NLI. Moreover, Wang et al. [48] proposed SuperGlue [46] as an improvement on Glue by having more challenging tasks, more diverse task formats, and so on. ...
Article
Full-text available
Natural Language Inference (NLI) is a hot topic research in natural language processing, contradiction detection between sentences is a special case of NLI. This is considered a difficult NLP task which has a big influence when added as a component in many NLP applications, such as Question Answering Systems, text Summarization. Arabic Language is one of the most challenging low-resources languages in detecting contradictions due to its rich lexical, semantics ambiguity. We have created a dataset of more than 12k sentences and named ArNLI, that will be publicly available. Moreover, we have applied a new model inspired by Stanford contradiction detection proposed solutions on English language. We proposed an approach to detect contradictions between pairs of sentences in Arabic language using contradiction vector combined with language model vector as an input to machine learning model. We analyzed results of different traditional machine learning classifiers and compared their results on our created dataset (ArNLI) and on an automatic translation of both PHEME, SICK English datasets. Best results achieved using Random Forest classifier with an accuracy of 99%, 60%, 75% on PHEME, SICK and ArNLI respectively.
... We experiment on seven NLP tasks that have been widely used in the literature (Kim, 2014;Wang et al., 2018;2019). These evaluation tasks and an example prompt/target pair are shown in Figure 9 in the Appendix; additional dataset details are described in Appendix A. The seven tasks are: Sentiment Analysis (Socher et al., 2013, SST-2); Subjective/Objective Sentence Classification (Conneau & Kiela, 2018, SUBJ); Question Classification (Li & Roth, 2002, TREC); Duplicated-Question Recognition (Chen et al., 2017;Wang et al., 2018, QQP); Textual Entailment Recognition (Dagan et al., 2006; Table 1: Models used in this paper. ...
Preprint
Full-text available
We study how in-context learning (ICL) in language models is affected by semantic priors versus input-label mappings. We investigate two setups-ICL with flipped labels and ICL with semantically-unrelated labels-across various model families (GPT-3, InstructGPT, Codex, PaLM, and Flan-PaLM). First, experiments on ICL with flipped labels show that overriding semantic priors is an emergent ability of model scale. While small language models ignore flipped labels presented in-context and thus rely primarily on semantic priors from pretraining, large models can override semantic priors when presented with in-context exemplars that contradict priors, despite the stronger semantic priors that larger models may hold. We next study semantically-unrelated label ICL (SUL-ICL), in which labels are semantically unrelated to their inputs (e.g., foo/bar instead of negative/positive), thereby forcing language models to learn the input-label mappings shown in in-context exemplars in order to perform the task. The ability to do SUL-ICL also emerges primarily with scale, and large-enough language models can even perform linear classification in a SUL-ICL setting. Finally, we evaluate instruction-tuned models and find that instruction tuning strengthens both the use of semantic priors and the capacity to learn input-label mappings, but more of the former.
... It achieved significantly worse results in symbolic reasoning and named-entity recognition. Similarly, Zhong et al. (2023) compared ChatGPT with fine-tuned language models BERT (Kenton and Toutanova, 2019) and RoBERTa (Liu et al., 2019) on the GLUE benchmark (Wang et al., 2018), consisting of sentiment analysis, linguistic acceptability, paraphrase, textual similarity, natural language inference, and question answering. The overall results showed that ChatGPT performed comparably to the BERT model, while it was outperformed by the RoBERTa model. ...
Preprint
Full-text available
ChatGPT has shown strong capabilities in natural language generation tasks, which naturally leads researchers to explore where its abilities end. In this paper, we examine whether ChatGPT can be used for zero-shot text classification, more specifically, automatic genre identification. We compare ChatGPT with a multilingual XLM-RoBERTa language model that was fine-tuned on datasets, manually annotated with genres. The models are compared on test sets in two languages: English and Slovenian. Results show that ChatGPT outperforms the fine-tuned model when applied to the dataset which was not seen before by either of the models. Even when applied on Slovenian language as an under-resourced language, ChatGPT's performance is no worse than when applied to English. However, if the model is fully prompted in Slovenian, the performance drops significantly, showing the current limitations of ChatGPT usage on smaller languages. The presented results lead us to questioning whether this is the beginning of an end of laborious manual annotation campaigns even for smaller languages, such as Slovenian.
... However, in different situations, the best-performance model is also different. For example, ELECTRA achieves better performance in some tasks in GLUE [28], ALBERT requires less training cost, and RoFormer is more effective in Chinese NLP tasks. As a result, it would be satisfactory if a proposed method could be compatible with multiple pre-trained models. ...
Article
Full-text available
In practical applications, the raw input to a Knowledge Based Question Answering (KBQA) system may vary in forms, expressions, sources, etc. As a result, the actual input to the system may contain various errors caused by various noise in raw data and processes of transmission, transformation, translation, etc. As a result, it is significant to evaluate and enhance the robustness of a KBQA model to various noisy questions. In this paper, we generate 29 datasets of various noisy questions based on the original SimpleQuestions dataset to evaluate and enhance the robustness of a KBQA model, and propose a model which is more robust to various noisy questions. Compared with traditional methods, the main contribution in this paper is that we propose a method of generating datasets of different noisy questions to evaluate the robustness of a KBQA model, and propose a KBQA model which contains incremental learning and Mask Language Model (MLM) in the question answering process, so that our model is less affected by different kinds of noise in questions and achieves higher accuracies on datasets of different noisy questions, which shows its robustness. Experimental results show that our model achieves an average accuracy of 78.1% on these datasets and outperforms the baseline BERT-based model by an average margin of 5.0% with the similar training cost. In addition, further experiments show that our model is compatible with other pre-trained models such as ALBERT and ELECTRA.
... GLUE The General Language Understanding Evaluation (GLUE) benchmark [22] is a tool for evaluating and analyzing the performance of natural language understanding models across nine NLU tasks: Single-Sentence Tasks, Similarity and Paraphrase Tasks and Inference Tasks. ...
Preprint
Natural language understanding(NLU) is challenging for finance due to the lack of annotated data and the specialized language in that domain. As a result, researchers have proposed to use pre-trained language model and multi-task learning to learn robust representations. However, aggressive fine-tuning often causes over-fitting and multi-task learning may favor tasks with significantly larger amounts data, etc. To address these problems, in this paper, we investigate model-agnostic meta-learning algorithm(MAML) in low-resource financial NLU tasks. Our contribution includes: 1. we explore the performance of MAML method with multiple types of tasks: GLUE datasets, SNLI, Sci-Tail and Financial PhraseBank; 2. we study the performance of MAML method with multiple single-type tasks: a real scenario stock price prediction problem with twitter text data. Our models achieve the state-of-the-art performance according to the experimental results, which demonstrate that our method can adapt fast and well to low-resource situations.
... In our setting, D T covers a small subset of topics in D S , which is the 20160901 version dump of Wikipedia. Our tasks are different from GLUE-like multi-task learning [Wang et al., 2019b], because our focus is on the problems created by the divergence between prominent sense-dominated generic word embeddings and their sense in narrow target topics. We do not experiment on the cross-domain sentiment classification task popular in domain adaptation papers since they benefit more from sharing sentiment-bearing words, than learning the correct sense of polysemous words, which is our focus here. ...
Preprint
Full-text available
Our goal is to improve reliability of Machine Learning (ML) systems deployed in the wild. ML models perform exceedingly well when test examples are similar to train examples. However, real-world applications are required to perform on any distribution of test examples. Current ML systems can fail silently on test examples with distribution shifts. In order to improve reliability of ML models due to covariate or domain shift, we propose algorithms that enable models to: (a) generalize to a larger family of test distributions, (b) evaluate accuracy under distribution shifts, (c) adapt to a target distribution. We study causes of impaired robustness to domain shifts and present algorithms for training domain robust models. A key source of model brittleness is due to domain overfitting, which our new training algorithms suppress and instead encourage domain-general hypotheses. While we improve robustness over standard training methods for certain problem settings, performance of ML systems can still vary drastically with domain shifts. It is crucial for developers and stakeholders to understand model vulnerabilities and operational ranges of input, which could be assessed on the fly during the deployment, albeit at a great cost. Instead, we advocate for proactively estimating accuracy surfaces over any combination of prespecified and interpretable domain shifts for performance forecasting. We present a label-efficient estimation to address estimation over a combinatorial space of domain shifts. Further, when a model's performance on a target domain is found to be poor, traditional approaches adapt the model using the target domain's resources. Standard adaptation methods assume access to sufficient labeled resources, which may be impractical for deployed models. We initiate a study of lightweight adaptation techniques with only unlabeled data resources with a focus on language applications.
Chapter
TPCx-AI is the latest TPC benchmark addressing some of the numerous challenges in AI benchmarking. It has the ability to scale datasets, to emulate machine learning and deep learning end-to-end pipelines, and to provide solutions that are commercially available with pricing and support. With TPC benchmarks it can be difficult to get started as the nature of system benchmarks require larger server configurations and software solutions. TPCx-AI is a TPC Express Benchmark and thus provides an executable kit that simplifies the setup and execution of the benchmark. Moreover, the TPCx-AI kit includes two reference implementations: one for single-node environments—using scikit-learn as its primary ML processing library—and the other targeted for multi-node environments—centered around the use of Spark—. This paper provides insights into the TPCx-AI kit installation and setup steps. It also shares preliminary scaling data of the key phases of the benchmark (e.g. pre-processing, training, serving, etc.), distribution of the phases for each of the use cases, and finally an example of how one can look at resource utilization for a specific use case. We show that for the single-node implementation each of the use cases show a unique runtime distribution for the time spent in preprocessing, training and serving phases. In addition, we also show how the processing runtimes scale with dataset sizes as they would in real world, with some use cases scaling poorly. We anticipate that optimizations in the platform as well as software stack can lead to overall reductions in runtime and expect this work to inspire others to investigate the runtime characteristics of the ML/DL workloads included in TPCx-AI and their possible optimizations.KeywordsAI BenchmarksArtificial IntelligenceMachine LearningDeep LearningAITPCExpress BenchmarksTPCx-AIRuntime PerformanceScalability
Preprint
Large Language Models (LLMs), such as OpenAI’s GPT-4 or Google’s Bard, have created unprecedented opportunities for analyzing and generating language data on a massive scale. Because language is core to all areas of psychology, this new technology holds the potential to transform the field. In this Review, we first present emerging applications of LLMs for psychological measurement, experimentation, and practice across areas of psychology. We show how LLMs can make certain tasks profoundly more efficient (content analysis, questionnaire item generation, systematic reviews), while also unlocking entirely new research questions and methods across areas. Second, we review the foundations of LLMs. We explain how the way that LLMs were constructed (i.e. to predict the next word or utterance, not to reason like a human) is both the source of their strengths and their limitations. Third, we examine three major concerns with the application of LLMs to psychology, and how each might be overcome. Finally, we recommend several necessary investments that can help address these concerns. These include: (a) field-initiated “keystone” datasets; (b) increased standardization of performance benchmarks; and (c) investments in shared computing and analysis infrastructure, to ensure that the future of LLM-powered R&D is equitable.
Chapter
Large pre-trained language models are successfully being used in a variety of tasks, across many languages. With this ever-increasing usage, the risk of harmful side effects also rises, for example by reproducing and reinforcing stereotypes. However, detecting and mitigating these harms is difficult to do in general and becomes computationally expensive when tackling multiple languages or when considering different biases. To address this, we present FairDistillation: a cross-lingual method based on knowledge distillation to construct smaller language models while controlling for specific biases. We found that our distillation method does not negatively affect the downstream performance on most tasks and successfully mitigates stereotyping and representational harms. We demonstrate that FairDistillation can create fairer language models at a considerably lower cost than alternative approaches.KeywordsKnowledge distillationFairnessBERTLanguage models
Chapter
It is important to learn whether text information remains valid or not for various applications including story comprehension, information retrieval, and user state tracking on microblogs and via chatbot conversations. It is also beneficial to deeply understand the story. However, this kind of inference is still difficult for computers as it requires temporal commonsense. We propose a novel task, Temporal Natural Language Inference, inspired by traditional natural language reasoning to determine the temporal validity of text content. The task requires inference and judgment whether an action expressed in a sentence is still ongoing or rather completed, hence, whether the sentence still remains valid, given its supplementary content. We first construct our own dataset for this task and train several machine learning models. Then we propose an effective method for learning information from an external knowledge base that gives hints on temporal commonsense knowledge. Using prepared dataset, we introduce a new machine learning model that incorporates the information from the knowledge base and demonstrate that our model outperforms state-of-the-art approaches in the proposed task.
Article
Full-text available
NLP has achieved great progress in the past decade through the use of neural models and large labeled datasets. The dependence on abundant data prevents NLP models from being applied to low-resource settings or novel tasks where significant time, money, or expertise is required to label massive amounts of textual data. Recently, data augmentation methods have been explored as a means of improving data efficiency in NLP. To date, there has been no systematic empirical overview of data augmentation for NLP in the limited labeled data setting, making it difficult to understand which methods work in which settings. In this paper, we provide an empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting, summarizing the landscape of methods (including token-level augmentations, sentence-level augmentations, adversarial augmentations, and hidden-space augmentations) and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks. Based on the results, we draw several conclusions to help practitioners choose appropriate augmentations in different settings and discuss the current challenges and future directions for limited data learning in NLP.
Chapter
We introduce a linguistically enhanced combination of pre-training methods for transformers. The pre-training objectives include POS-tagging, synset prediction based on semantic knowledge graphs, and parent prediction based on dependency parse trees. Our approach achieves competitive results on the Natural Language Inference task, compared to the state of the art. Specifically for smaller models, the method results in a significant performance boost, emphasizing the fact that intelligent pre-training can make up for fewer parameters and help building more efficient models. Combining POS-tagging and synset prediction yields the overall best results.
Chapter
Pruning aims to reduce the number of parameters while maintaining performance close to the original network. This work proposes a novel self-distillation based pruning strategy, whereby the representational similarity between the pruned and unpruned versions of the same network is maximized. Unlike previous approaches that treat distillation and pruning separately, we use distillation to inform the pruning criteria, without requiring a separate student network as in knowledge distillation. We show that the proposed cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions, naturally complementing magnitude-based pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that self-distilled pruning increases mono- and cross-lingual language model performance. Self-distilled pruned models also outperform smaller Transformers with an equal number of parameters and are competitive against (6 times) larger distilled networks. We also observe that self-distillation (1) maximizes class separability, (2) increases the signal-to-noise ratio, and (3) converges faster after pruning steps, providing further insights into why self-distilled pruning improves generalization.KeywordsIterative pruningSelf-distillationLanguage models
Preprint
Full-text available
Transformers have achieved great success in a wide variety of natural language processing (NLP) tasks due to the attention mechanism, which assigns an importance score for every word relative to other words in a sequence. However, these models are very large, often reaching hundreds of billions of parameters, and therefore require a large number of DRAM accesses. Hence, traditional deep neural network (DNN) accelerators such as GPUs and TPUs face limitations in processing Transformers efficiently. In-memory accelerators based on non-volatile memory promise to be an effective solution to this challenge, since they provide high storage density while performing massively parallel matrix vector multiplications within memory arrays. However, attention score computations, which are frequently used in Transformers (unlike CNNs and RNNs), require matrix vector multiplications (MVM) where both operands change dynamically for each input. As a result, conventional NVM-based accelerators incur high write latency and write energy when used for Transformers, and further suffer from the low endurance of most NVM technologies. To address these challenges, we present X-Former, a hybrid in-memory hardware accelerator that consists of both NVM and CMOS processing elements to execute transformer workloads efficiently. To improve the hardware utilization of X-Former, we also propose a sequence blocking dataflow, which overlaps the computations of the two processing elements and reduces execution time. Across several benchmarks, we show that X-Former achieves upto 85x and 7.5x improvements in latency and energy over a NVIDIA GeForce GTX 1060 GPU and upto 10.7x and 4.6x improvements in latency and energy over a state-of-the-art in-memory NVM accelerator.
Article
Full-text available
Recent advances in graph-based learning approaches have demonstrated their effectiveness in modelling users’ preferences and items’ characteristics for Recommender Systems (RSs). Most of the data in RSs can be organized into graphs where various objects (e.g. users, items, and attributes) are explicitly or implicitly connected and influence each other via various relations. Such a graph-based organization brings benefits to exploiting potential properties in graph learning (e.g. random walk and network embedding) techniques to enrich the representations of the user and item nodes, which is an essential factor for successful recommendations. In this paper, we provide a comprehensive survey of Graph Learning-based Recommender Systems (GLRSs). Specifically, we start from a data-driven perspective to systematically categorize various graphs in GLRSs and analyse their characteristics. Then, we discuss the state-of-the-art frameworks with a focus on the graph learning module and how they address practical recommendation challenges such as scalability, fairness, diversity, explainability, and so on. Finally, we share some potential research directions in this rapidly growing area.
Article
Full-text available
This paper explores the application of deep learning in automating the scoring of open-ended candidate responses to pre-hire employment selection assessments. Using job applicant text data from pre-employment virtual assessment center exercises, three algorithmic approaches were compared: a traditional bag of words (BoW), long short-term memory (LSTM) models, and robustly optimized bidirectional encoder representations from transformers approach (RoBERTa). Measurement and assessment best practices were leveraged in the development of the candidate assessment items and human labels (subject matter experts’ (SME) ratings on job-relevant competencies), producing a rich set of data to train the algorithms on. The trained models were used to score the candidate textual responses on the given competencies, and the level of agreement with expert human raters was assessed. Using data from three companies hiring for three different occupations and across seven competencies, three algorithmic approaches to automatically score text were evaluated, showcasing correlations between SME and algorithmically scored competencies on holdout samples that were very strong (avg r = 0.84 for the best performing method, RoBERTa) and nearly identical to the inter-rater reliability achieved by multiple expert raters following consensus (avg r = 0.85). Criterion-related validity, subgroup differences, and decision accuracy are investigated for each algorithmic approach. Lastly, the impact of smaller sample sizes to train the algorithms is explored.
Article
We expose the statistical foundations of deep learning with the goal of facilitating conversation between the deep learning and statistics communities. We highlight core themes at the intersection; summarize key neural models, such as feedforward neural networks, sequential neural networks, and neural latent variable models; and link these ideas to their roots in probability and statistics. We also highlight research directions in deep learning where there are opportunities for statistical contributions.
Article
Deep neural networks are suffering from over parameterized high storage and high consumption problems. Pruning can effectively reduce storage and computation costs of deep neural networks by eliminating their redundant parameters. In existing pruning methods, filter pruning achieves more efficient inference, while element-wise pruning maintains better accuracy. To make a trade-off between the two endpoints, a variety of pruning patterns has been proposed. This study analyzes the performance characteristics of sparse DNNs pruned by different patterns, including element-wise, vector-wise, block-wise, and group-wise. Based on the analysis, we propose an efficient implementation of group-wise sparse DNN inference, which can make better use of GPUs. Experimental results on VGG, ResNet, BERT and ViT show that our optimized group-wise pruning pattern achieves much lower inference latency on GPU than other sparse patterns and the existing group-wise pattern implementation.
Article
Recurrent neural networks are effective models to process sequences. However, they are unable to learn long-term dependencies because of their inherent sequential nature. As a solution, Vaswani et al. introduced the Transformer, a model solely based on the attention mechanism that is able to relate any two positions of the input sequence, hence modelling arbitrary long dependencies. The Transformer has improved the state-of-the-art across numerous sequence modelling tasks. However, its effectiveness comes at the expense of a quadratic computational and memory complexity with respect to the sequence length, hindering its adoption. Fortunately, the deep learning community has always been interested in improving the models’ efficiency, leading to a plethora of solutions such as parameter sharing, pruning, mixed-precision, and knowledge distillation. Recently, researchers have directly addressed the Transformer’s limitation by designing lower-complexity alternatives such as the Longformer, Reformer, Linformer, and Performer. However, due to the wide range of solutions, it has become challenging for researchers and practitioners to determine which methods to apply in practice in order to meet the desired trade-off between capacity, computation, and memory. This survey addresses this issue by investigating popular approaches to make Transformers faster and lighter and by providing a comprehensive explanation of the methods’ strengths, limitations, and underlying assumptions.
Conference Paper
Full-text available
Semantic Textual Similarity (STS) measures the meaning similarity of sentences. Applications include machine translation (MT), summarization, generation, question answering (QA), short answer grading, semantic search, dialog and conversational systems. The STS shared task is a venue for assessing the current state-of-the-art. The 2017 task focuses on multilingual and cross-lingual pairs with one sub-track exploring MT quality estimation (MTQE) data. The task obtained strong participation from 31 teams, with 17 participating in all language tracks. We summarize performance and review a selection of well performing methods. Analysis highlights common errors, providing insight into the limitations of existing models. To support ongoing work on semantic representations, the STS Benchmark is introduced as a new shared training and evaluation set carefully selected from the corpus of English STS shared task data (2012-2017).
Article
Full-text available
We describe an approach for unsupervised learning of a generic, distributed sentence encoder. Using the continuity of text from books, we train an encoder-decoder model that tries to reconstruct the surrounding sentences of an encoded passage. Sentences that share semantic and syntactic properties are thus mapped to similar vector representations. We next introduce a simple vocabulary expansion method to encode words that were not seen as part of training, allowing us to expand our vocabulary to a million words. After training our model, we extract and evaluate our vectors with linear models on 8 tasks: semantic relatedness, paraphrase detection, image-sentence ranking, question-type classification and 4 benchmark sentiment and subjectivity datasets. The end result is an off-the-shelf encoder that can produce highly generic sentence representations that are robust and perform well in practice. We will make our encoder publicly available.
Conference Paper
Full-text available
This paper describes the PASCAL Net- work of Excellence Recognising Textual Entailment (RTE) Challenge benchmark 1. The RTE task is defined as recognizing, given two text fragments, whether the meaning of one text can be inferred (en- tailed) from the other. This application- independent task is suggested as capturing major inferences about the variability of semantic expression which are commonly needed across multiple applications. The Challenge has raised noticeable attention in the research community, attracting 17 submissions from diverse groups, sug- gesting the generic relevance of the task.
Article
Sentence vectors represent an appealing approach to meaning: learn an embedding that encompasses the meaning of a sentence in a single vector, that can be used for a variety of semantic tasks. Existing models for learning sentence embeddings either require extensive computational resources to train on large corpora, or are trained on costly, manually curated datasets of sentence relations. We observe that humans naturally annotate the relations between their sentences with discourse markers like "but" and "because". These words are deeply linked to the meanings of the sentences they connect. Using this natural signal, we automatically collect a classification dataset from unannotated text. Training a model to predict these discourse markers yields high quality sentence embeddings. Our model captures complementary information to existing models and achieves comparable generalization performance to state of the art models.
Article
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference dataset can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks.
Article
This paper describes the Second PASCAL Recognising Textual Entailment Chal-lenge (RTE-2)., We describe the RTE-2 dataset and overview the submissions for the challenge. One of the main goals for this year's dataset was to pro-vide more "realistic" text-hypothesis ex-amples, based mostly on outputs of ac-tual systems. The 23 submissions for the challenge present diverse approaches and research directions, and the best results achieved this year are considerably higher than last year's state of the art.
Article
Predictions of the secondary structure of T4 phage lysozyme, made by a number of investigators on the basis of the amino acid sequence, are compared with the structure of the protein determined experimentally by X-ray crystallography. Within the amino terminal half of the molecule the locations of helices predicted by a number of methods agree moderately well with the observed structure, however within the carboxyl half of the molecule the overall agreement is poor. For eleven different helix predictions, the coefficients giving the correlation between prediction and observation range from 0.14 to 0.42. The accuracy of the predictions for both beta-sheet regions and for turns are generally lower than for the helices, and in a number of instances the agreement between prediction and observation is no better than would be expected for a random selection of residues. The structural predictions for T4 phage lysozyme are much less successful than was the case for adenylate kinase (Schulz et al. (1974) Nature 250, 140-142). No one method of prediction is clearly superior to all others, and although empirical predictions based on larger numbers of known protein structure tend to be more accurate than those based on a limited sample, the improvement in accuracy is not dramatic, suggesting that the accuracy of current empirical predictive methods will not be substantially increased simply by the inclusion of more data from additional protein structure determinations.
Conference Paper
In this paper, we present an alternative to the Turing Test that has some conceptual and practical advantages. Like the original, it involves responding to typed English sentences, and English-speaking adults will have no difficulty with it. Unlike the original, the subject is not required to engage in a conversation and fool an interrogator into believing she is dealing with a person. Moreover, the test is arranged in such a way that having full access to a large corpus of English text might not help much. Finally, the interrogator or a third party will be able to decide unambiguously after a few minutes whether or not a subject has passed the test. Copyright © 2011, Association for the Advancement of Artificial Intelligence. All rights reserved.
The fifth PASCAL recognizing textual entailment challenge
  • Luisa Bentivogli
  • Ido Dagan
  • Hoa Trang Dang
  • Danilo Giampiccolo
  • Bernardo Magnini
Luisa Bentivogli, Ido Dagan, Hoa Trang Dang, Danilo Giampiccolo, and Bernardo Magnini. 2009. The fifth PASCAL recognizing textual entailment challenge.
Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation
  • Daniel Cer
  • Mona Diab
  • Eneko Agirre
  • Inigo Lopezgazpio
  • Lucia Specia
Daniel Cer, Mona Diab, Eneko Agirre, Inigo LopezGazpio, and Lucia Specia. 2017. Semeval-2017 task 1: Semantic textual similarity-multilingual and cross-lingual focused evaluation. In 11th International Workshop on Semantic Evaluations.
Automatically constructing a corpus of sentential paraphrases
  • B William
  • Chris Dolan
  • Brockett
William B Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In Proceedings of IWP.
The third PASCAL recognizing textual entailment challenge
  • Danilo Giampiccolo
  • Bernardo Magnini
  • Ido Dagan
  • Bill Dolan
Danilo Giampiccolo, Bernardo Magnini, Ido Dagan, and Bill Dolan. 2007. The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL workshop on textual entailment and paraphrasing, pages 1-9. Association for Computational Linguistics.
Deep contextualized word representations
  • E Matthew
  • Mark Peters
  • Mohit Neumann
  • Matt Iyyer
  • Christopher Gardner
  • Kenton Clark
  • Luke Lee
  • Zettlemoyer
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. In Proceedings of NAACL 2018.
Recursive deep models for semantic compositionality over a sentiment treebank
  • Richard Socher
  • Alex Perelygin
  • Jean Wu
  • Jason Chuang
  • D Christopher
  • Andrew Manning
  • Christopher Ng
  • Potts
Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D Manning, Andrew Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1631-1642.
Learning general purpose distributed sentence representations via large scale multi-task learning
  • Sandeep Subramanian
  • Adam Trischler
  • Yoshua Bengio
  • Christopher J Pal
Sandeep Subramanian, Adam Trischler, Yoshua Bengio, and Christopher J. Pal. 2018. Learning general purpose distributed sentence representations via large scale multi-task learning. In Proceedings of ICLR.