Table 2 - uploaded by Christopher D. Manning
Content may be subject to copyright.
Source publication
I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy
(56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance
and examine some useful improvements that have recently been made to the Stanford Part-o...
Context in source publication
Context 1
... numbers are on the now fairly standard splits of the Wall Street Journal portion of the Penn Treebank for POS tagging, following [6]. 3 The details of the corpus appear in Table 2 and comparative results appear in Table 3. 3gramMemm shows the performance of a straightforward, fast, discrimina- tive sequence model tagger. It uses the templates t 0 , w −1 , t 0 , w 0 , t 0 , w +1 , t 0 , t −1 , t 0 , t −2 , t −1 and the unknown word features from [1]. ...
Similar publications
The authors present the concept and implementation of the keyword occurrence analysis that is based on the separate counting of (key)words in the four main parts of scientific and other types of articles: the title—followed by a list of authors, abstract, list of keywords, and the main text. Also, the analysis is meant to be applied to more than on...
In this paper we focus on the task of detecting emotion in texts.
Among the most employed approaches,categorical ones are mainly used
for their simplicity and intuitiveness while dimensional ones, although
less common, may provide more objective and accurate results. In
current works, both methods often result in tagging texts with emotion
labels (...
Citations
... Specifically, we study the token level representations by measuring performance on partof-speech tagging. The part-of-speech tagging task has a history in NLP (Manning, 2011), and we use the CoNLL-2003 dataset (Sang & Meulder, 2003). We train a linear classifier on representations from ModernBERT, MARIA 1B, and MARIA 7B on 10000 sentence examples with POS labels that can belong to 48 different classes. ...
Historically, LLMs have been trained using either autoregressive (AR) or masked language modeling (MLM) objectives, with AR models gaining dominance in recent years. However, AR models are inherently incapable of masked infilling, which is the ability to predict masked tokens between past and future context. In contrast, MLM models suffer from intrinsic computational inefficiencies during both training and inference that hinder their scalability. This work introduces MARIA (Masked and Autoregressive Infilling Architecture), a novel approach that leverages the strengths of both paradigms to achieve state-of-the-art masked infilling performance. MARIA combines a pre-trained MLM and AR model by training a linear decoder that takes their concatenated hidden states as input. This minimal modification enables the AR model to perform infilling while retaining its inherent advantages in terms of faster inference with KV caching. Our results demonstrate that MARIA significantly outperforms existing methods, namely discrete diffusion models, on masked infilling tasks.
... challenging, even for experienced annotators (Marcus, Santorini, and Marcinkiewicz 1993). Inter-annotator agreement on labeling tasks depends on factors like task description and guidance, annotator skills and knowledge, and level of attention (Manning 2011). ...
Correct labels are indispensable for training effective machine learning models. However, creating high-quality labels is expensive, and even professionally labeled data contains errors and ambiguities. Filtering and denoising can be applied to curate labeled data prior to training, at the cost of additional processing and loss of information. An alternative is on-the-fly sample reweighting during the training process to decrease the negative impact of incorrect or ambiguous labels, but this typically requires clean seed data. In this work we propose unsupervised on-the-fly meta loss rescaling to reweight training samples. Crucially, we rely only on features provided by the model being trained, to learn a rescaling function in real time without knowledge of the true clean data distribution. We achieve this via a novel meta learning setup that samples validation data for the meta update directly from the noisy training corpus by employing the rescaling function being trained. Our proposed method consistently improves performance across various NLP tasks with minimal computational overhead. Further, we are among the first to attempt on-the-fly training data reweighting on the challenging task of dialogue modeling, where noisy and ambiguous labels are common. Our strategy is robust in the face of noisy and clean data, handles class imbalance, and prevents overfitting to noisy labels. Our self-taught loss rescaling improves as the model trains, showing the ability to keep learning from the model's own signals. As training progresses, the impact of correctly labeled data is scaled up, while the impact of wrongly labeled data is suppressed.
... Indeed, some of the most frequent tokens, in particular punctuation markers and determiners, are both highly frequent and extremely easy to 'get right' (e.g., the and I). By contrast, persentence accuracy rates of POS taggers tend to be considerably more modest (hovering around 50-57%) and considerably lower rates for non-standard varieties and registers for which there is little training data (Manning 2011). ...
The Multi-Feature Tagger of English (MFTE) provides a transparent and easily adaptable open-source tool for multivariable analyses of English corpora. Designed to contribute to the greater reproducibility, transparency, and accessibility of multivariable corpus studies, it comes with a simple GUI and is available both as a richly annotated Python script and as an executable file. In this article, we detail its features and how they are operationalised. The default tagset comprises 74 lexico-grammatical features, ranging from attributive adjectives and progressives to tag questions and emoticons. An optional extended tagset covers more than 70 additional features, including many semantic features, such as human nouns and verbs of causation. We evaluate the accuracy of the MFTE on a sample of 60 texts from the BNC2014 and COCA, and report precision and recall metrics for all the features of the simple tagset. We outline how that the use of a well-documented, open-source tool can contribute to improving the reproducibility and replicability of multivariable studies of English.
... We also used the sample of Penn Treebank Wall Street Journal available in NLTK (PennWSJ-NLTK) that contains 3,914 sentences (100,676 tokens) and was mostly used for training the POS taggers. PennWSJ-NLTK was chosen as the default training corpus because different PennWSJ corpus samples or the corpus as a whole have been used as such often in the literature (Manning, 2011;Toutanova, Klein, Manning, and Singer, 2003). ...
Web-based platforms offer suitable experimental environments enabling the construction and reuse of natural language processing (NLP) pipelines. However, systematic evaluation of NLP tools in an open science web-based setting is still a challenge, as suitable experimental environments for the construction and reuse of NLP pipelines are still rare. This paper presents TextFlows, an open-source web-based platform, which enables user-friendly construction, sharing, execution, and reuse of NLP pipelines. It demonstrates that TextFlows can be easily used for systematic evaluation of new NLP components by integrating seven publicly available open-source part of speech (POS) taggers from popular NLP libraries, and evaluating them on six annotated corpora. The integration of new tools into TextFlows supports tools reuse, while the use of precomposed algorithm comparison and evaluation workflows supports experiment reproducibility and testing of future algorithms in the same experimental environment. Finally, to showcase the variety of evaluation possibilities offered in the TextFlows platform, the influence of various factors, such as the training corpus length and the use of pre-trained models, have been tested.
... Our findings highlighted the performance disparities across states and suggested the concerns of inconsistencies in the NVDRS data annotations. Several studies have explored to address data annotation errors in NLP through various approaches [6][7][8][9][10][11][12][13] , for example, utilizing conventional probabilistic approaches 14 , training machine learning models (e.g., Support Vector Machines) [15][16][17][18][19][20][21][22] , and developing generative models via active learning 23 . However, the conventional probabilistic approaches cannot handle infrequent events or compare events with similar probabilities. ...
Background
Data accuracy is essential for scientific research and policy development. The National Violent Death Reporting System (NVDRS) data is widely used for discovering the patterns and causing factors of death. Recent studies suggested the annotation inconsistencies within the NVDRS and the potential impact on erroneous suicide-circumstance attributions.
Methods
We present an empirical Natural Language Processing (NLP) approach to detect annotation inconsistencies and adopt a cross-validation-like paradigm to identify possible label errors. We analyzed 267,804 suicide death incidents between 2003 and 2020 from the NVDRS. We measured annotation inconsistency by the degree of changes in the F-1 score.
Results
Our results show that incorporating the target state’s data into training the suicide-circumstance classifier brings an increase of 5.4% to the F-1 score on the target state’s test set and a decrease of 1.1% on other states’ test set.
Conclusions
To conclude, we present an NLP framework to detect the annotation inconsistencies, show the effectiveness of identifying and rectifying possible label errors, and eventually propose an improvement solution to improve the coding consistency of human annotators.
... To compute linguistic representations, we rely on Stanza (Qi et al. 2020) to perform segmentation, tokenization, part-of-speech (PoS) tagging, and dependency and constituent parsing. For these tasks, and in particular for the case of English and news text, the performance is high enough to be used for applications (Manning 2011;Berzak et al. 2016), and it can be even superior to that obtained by human annotations. This also served as an additional reason to focus our analysis on news text, ensuring that the tools we rely on are accurate enough to obtain meaningful results. ...
We conduct a quantitative analysis contrasting human-written English news text with comparable large language model (LLM) output from six different LLMs that cover three different families and four sizes in total. Our analysis spans several measurable linguistic dimensions, including morphological, syntactic, psychometric, and sociolinguistic aspects. The results reveal various measurable differences between human and AI-generated texts. Human texts exhibit more scattered sentence length distributions, more variety of vocabulary, a distinct use of dependency and constituent types, shorter constituents, and more optimized dependency distances. Humans tend to exhibit stronger negative emotions (such as fear and disgust) and less joy compared to text generated by LLMs, with the toxicity of these models increasing as their size grows. LLM outputs use more numbers, symbols and auxiliaries (suggesting objective language) than human texts, as well as more pronouns. The sexist bias prevalent in human text is also expressed by LLMs, and even magnified in all of them but one. Differences between LLMs and humans are larger than between LLMs.
... Larger text data sets require associated computational methods for analysis, and the field of computational linguistics has advanced considerably, making the accuracy of some algorithms such as part-of-speech (PoS) tagging, where words in a sentence are tagged according to their part-of-speech (e.g. noun, verb, adjective) on par with human interpretations for individual words, or so-called tokens (Manning 2011). However, the complexity of language makes it hard for computers to understand underlying meanings beyond part-ofspeech tagging, particularly in rich natural language contributions. ...
What cultural ecosystem services (CES) do people perceive in their immediate surroundings, and what sensory experiences are linked to these ecosystem services? And how are these CES and experiences expressed in natural language? In this study, we used data generated through a gamified application called Window Expeditions, where people uploaded short descriptions of landscapes they were able to experience through their windows during the COVID-19 pandemic. We used a combination of annotation, close reading and distant reading using natural language processing and graph analysis to extract CES and sensory experiences and link these to biophysical landscape elements. In total, 272 users contributed 373 descriptions in English across more than 40 countries. Of the cultural ecosystem services, recreation was the most prominently described, followed by heritage, identity and tranquility. Descriptions of sensory experiences focused on the visual but also included auditory experiences and touch and feel. Sensory experiences and cultural ecosystem services varied according to biophysical landscape elements, with, for example, animals being more associated with sound and touch/feel and heritage being more associated with moving objects and the built environment. Sentiments also varied across the senses, with the visual being more strongly associated with positive experiences than other senses. This study showed how a hybrid approach combining manual analysis and natural language processing can be productively applied to landscape descriptions generated by members of the public, and how CES on everyday lived landscapes can be extracted from such data sources.
... Furthermore, previous work has shown that morphological taggers substantially degrade when evaluated out-of-domain, be that any type of text different from the data used for training in terms of topic, text genre, temporality, and so forth. (Manning 2011). This point led us to research whether lemmatizers based on finegrained morphological information will degrade more when used out-of-domain than those requiring only coarse-grained UPOS tags. ...
... Unlike the vast majority of previous work on contextual lemmatization, which has been mostly evaluated in-domain (McCarthy et al. 2019), we also report results in out-ofdomain settings. It should be noted that by out-of-domain we mean to evaluate the model on a different data distribution from the data used for training (Manning 2011). ...
... However, if we look at the evaluation method a bit more closely, things are not as clear as they seem. As has been argued for POS tagging (Manning 2011), word accuracy as an evaluation measure is easy because you get many free points for punctuation marks and for the many tokens that are not ambiguous with respect to its lemma, namely, those cases in which the lemma and the word form are the same. Following this, a more realistic metric might consist of looking at the rate of getting the whole sentence correctly lemmatized, just as was proposed for POS tagging (Manning 2011). ...
Lemmatization is a natural language processing (NLP) task that consists of producing, from a given inflected word, its canonical form or lemma. Lemmatization is one of the basic tasks that facilitate downstream NLP applications, and is of particular importance for high-inflected languages. Given that the process to obtain a lemma from an inflected word can be explained by looking at its morphosyntactic category, including fine-grained morphosyntactic information to train contextual lemmatizers has become common practice, without considering whether that is the optimum in terms of downstream performance. In order to address this issue, in this article we empirically investigate the role of morphological information to develop contextual lemmatizers in six languages within a varied spectrum of morphological complexity: Basque, Turkish, Russian, Czech, Spanish, and English. Furthermore, and unlike the vast majority of previous work, we also evaluate lemmatizers in out-of-domain settings, which constitutes, after all, their most common application use. The results of our study are rather surprising. It turns out that providing lemmatizers with fine-grained morphological features during training is not that beneficial, not even for agglutinative languages. In fact, modern contextual word representations seem to implicitly encode enough morphological information to obtain competitive contextual lemmatizers without seeing any explicit morphological signal. Moreover, our experiments suggest that the best lemmatizers out-of-domain are those using simple UPOS tags or those trained without morphology and, lastly, that current evaluation practices for lemmatization are not adequate to clearly discriminate between models.
... As for cross-lingual summarization, it requires the model to generate a summary of an article in a different language (Bhattacharjee et al., 2023;. Although POS tagging (Manning, 2011;Nivre et al., 2017;Chiche and Yitagesu, 2022) primarily assesses the model's ability to understand monolingual text, we include it as multilingual experiments to show the universality of our methods. ...
... Annotation Errors and AED Several recent work found errors in widely used benchmarks, such as CoNLL 2003 for Named Entity Recognition (Wang et al., 2019;Reiss et al., 2020;Rücker and Akbik, 2023), TACRED for relation extraction (Alt et al., 2020), WSJ for syntax (Manning, 2011;Dickinson and Meurers, 2003), and Ima-geNet for object classification (Beyer et al., 2020;Northcutt et al., 2021;Vasudevan et al., 2022). 1 https://github.com/mainlp/VariErr-NLI AED has a long-standing tradition in NLP. ...