Conference Paper

Syntactic annotations for the Google Books Ngram Corpus

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

We present a new edition of the Google Books Ngram Corpus, which describes how often words and phrases were used over a period of five centuries, in eight languages; it reflects 6% of all books ever published. This new edition introduces syntactic annotations: words are tagged with their part-of-speech, and head-modifier relationships are recorded. The annotations are produced automatically with statistical models that are specifically adapted to historical text. The corpus will facilitate the study of linguistic trends, especially those related to the evolution of syntax.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Figure 1A shows the general pattern of hand usage in the QWERTY layout and an example of how the RSR is calculated. Figure 1B shows the RSR for English books of the Google Books corpus [28] published each available year since 1900, illustrating a general upwards trend of this ratio that specially speeds up in the early 1990s. The vertical red bar shows the RSR computed for a corpus of English text on the web [24], displaying a higher tendency of right side letters on the web as compared to books. ...
... We test if the QWERTY effect is present when (i) decoding text, i.e. when evaluating items with names or titles (e.g. [28] published since 1900. The red bar shows the empirical value of RSR in an English-speaking web text corpus [24]. ...
... Third, we compute a set of controls to include in our analysis, in order to test possible factors that can increase or diminish the QWERTY effect. We distinguish two types of controls: i) linguistic controls that can be computed for all datasets, including the amount of letters and words in the name and the average letter and word frequency as computed in the Google books dataset [28], and ii) contextual controls that include communitydependent observable variables such as the amount of views of a video, the year of a movie, or the price of a product. ...
Preprint
The QWERTY effect postulates that the keyboard layout influences word meanings by linking positivity to the use of the right hand and negativity to the use of the left hand. For example, previous research has established that words with more right hand letters are rated more positively than words with more left hand letters by human subjects in small scale experiments. In this paper, we perform large scale investigations of the QWERTY effect on the web. Using data from eleven web platforms related to products, movies, books, and videos, we conduct observational tests whether a hand-meaning relationship can be found in decoding text on the web. Furthermore, we investigate whether encoding text on the web exhibits the QWERTY effect as well, by analyzing the relationship between the text of online reviews and their star ratings in four additional datasets. Overall, we find robust evidence for the QWERTY effect both at the point of text interpretation (decoding) and at the point of text creation (encoding). We also find under which conditions the effect might not hold. Our findings have implications for any algorithmic method aiming to evaluate the meaning of words on the web, including for example semantic or sentiment analysis, and show the existence of "dactilar onomatopoeias" that shape the dynamics of word-meaning associations. To the best of our knowledge, this is the first work to reveal the extent to which the QWERTY effect exists in large scale human-computer interaction on the web.
... How we speak and write about others has undoubtedly changed over the last century. Large-scale trends in word usage are evident in the Google Books Corpus (comprising ~6% of all books published in English; Lin et al., 2012;Michel et al., 2011). Words such as abuse, defend, kill, care, suffering and peace have increased in frequency over the last few decades (Wheeler et al., 2019), suggesting a greater focus on suffering and well-being. ...
... The Google Books Corpus is estimated to contain about eight million books totalling half a trillion English words (~6% of all books ever published), whilst the Corpus of Historical American English is a structured database containing about half a billion English words balanced by genre decade by decade. These corpora have been shown to contain traces of historical cultural trends in language (Lin et al., 2012;Michel et al., 2011;Pechenick et al., 2015). For example, the Google Books Corpus has shown that the frequency of T A B L E 2 Words denoting concern and indifference. ...
... It is worth considering limits on the generalizability of our findings. We focused primarily on the Google Book Corpus which is estimated to contain about half a trillion words from 6% of all books ever published (Lin et al., 2012). We also found similar effects in a subset of the Google Book Corpus focused on fictional texts and in the Corpus of Historical American English. ...
Article
Full-text available
The Enlightenment idea of historical moral progress asserts that civil societies become more moral over time. This is often understood as an expanding moral circle and is argued to be tightly linked with language use, with some suggesting that shifts in how we express concern for others can be considered an important indicator of moral progress. Our research explores these notions by examining historical trends in natural language use during the 19th and 20th centuries. We found that the associations between words denoting moral concern and words referring to people, animals, and the environment grew stronger over time. The findings support widely‐held views about the nature of moral progress by showing that language has changed in a way that reflects greater concern for others.
... We validate our TABDet with different Trigger Candidate Set ∆. Employing 2gram and 5gram sets from Google Books Ngram Corpus (Michel et al., 2011;Lin et al., 2012), with 24,267 and 62,599 candidates respectively, we observed improved detection performance with the increase in ∆ size. In Table 3, the overall AUC achieves 0.94 with 5gram, with AUC in individual task 0.98, 0.93 and 0.86 for SC, QA and NER respectively. ...
... Google Books Ngram Corpus (Michel et al., 2011;Lin et al., 2012). It is build by a sequence of ngrams occurring at least 40 times in the corpus, and this corpus contains 4% of all books ever published in the world. ...
... The Google books n-gram corpus is a corpus of books from the 1800s to the 2000s that were digitized by Google (Michel et al., 2011). We used the second version of the corpus, which contains over 6% of all books that were ever published (Lin et al., 2012). From the Google books n-gram corpus, we constructed data sets for five languages: English, French, German, Italian, and Spanish. ...
... The values for the control variable POS were assigned on the basis of the parts-of-speech tagging in the Google books corpora provided by Lin et al. (2012). We assigned the most frequent POS tag to word types that occur with multiple POS tags in the Google books corpora. ...
Article
Previous studies provided evidence for a connection between language processing and language change. We add to these studies with an exploration of the influence of lexical‐distributional properties of words in orthographic space, semantic space, and the mapping between orthographic and semantic space on the probability of lexical extinction. Through a binomial linear regression analysis, we investigated the probability of lexical extinction by the first decade of the twenty‐first century (2000s) for words that existed in the first decade of the nineteenth‐century (1800s) in eight data sets for five languages: English, French, German, Italian, and Spanish. The binomial linear regression analysis revealed that words that are more similar in form to other words are less likely to disappear from a language. By contrast, words that are more similar in meaning to other words are more likely to become extinct. In addition, a more consistent mapping between form and meaning protects a word from lexical extinction. A nonlinear time‐to‐event analysis furthermore revealed that the position of a word in orthographic and semantic space continues to influence the probability of it disappearing from a language for at least 200 years. Effects of the lexical‐distributional properties of words under investigation here have been reported in the language processing literature as well. The results reported here, therefore, fit well with a usage‐based approach to language change, which holds that language change is at least to some extent connected to cognitive mechanisms in the human brain.
... freqband is the frequency grouping 3 from the Oxford English Dictionary (OED) based on the raw frequencies from Google Ngrams version 2 (Lin et al., 2012). ...
... wordage is the number of years 5 since a word was first used (as of 2000), as reported by Google Ngram, based on 450 million words scanned from Google Books (Lin et al., 2012). ...
Article
Full-text available
Purpose: There are many aspects of words that can influence our lexical processing, and the words we are exposed to influence our opportunities for language and reading development. The purpose of this study is to establish a more comprehensive understanding of the lexical challenges and oppor- tunities students face. Method: We explore the latent relationships of word features across three established word lists: the General Service List, Academic Word List, and discipline-specific word lists from the Academic Vocabulary List. We fit exploratory factor models using 22 non-behavioral, empirical measures to three sets of vocabulary words: 2,060 high-frequency words, 1,051 general academic words, and 3,413 domain-specific words. Results: We found Frequency, Complexity, Proximity, Polysemy, and Diversity were largely stable factors across the sets of high-frequency and general academic words, but that the challenge facing learners is structurally different for domain-specific words. Conclusion: Despite substantial stability, there are important differences in the latent lexical features that learners encounter. We discuss these results and provide our latent factor estimates for words in our sample.
... According to the CFW method, the vectors were composed of the values of regularized pointwise mutual information (in the form proposed in Bochkarev et al., 2021) for bigrams of the form Wx and xW, where W is the target word and x is one of the most frequent words. The frequency data on words and phrases required for constructing the vectors were extracted from the large diachronic corpus Google Books Ngram (Lin et al., 2012). To train the neural network, we use the frequency data averaged over the period 1900-2019. ...
Article
Full-text available
In recent works, a new psycholinguistic concept has been introduced and studied that is the socialness of a word. In particular, Diveica et al., 2022 presents a dictionary with socialness ratings obtained using a survey method. The socialness rating reflects word social significance. Unfortunately, the size of the existed dictionary with word socialness ratings is relatively small. In this paper, we propose linear and neural network predictors of socialness ratings by using pre-trained fasttext vectors as input. The obtained Spearman`s correlation coefficient between human socialness ratings and machine ones is 0.869. The trained models allowed obtaining socialness ratings for 2 million English words, as well as a wide range of words in 43 other languages. An unexpected result is that the linear model provides highly accurate estimate of the socialness ratings, which can be hardly further improved. Apparently, this is due to the fact that in the space of vectors representing words there is a selected direction responsible for meanings associated with socialness driven by of social factors influencing word representation and use. The article also presents a diachronic neural network predictor of concreteness ratings using word co-occurrence vectors as input data. It is shown that using a one-year data from a large diachronic corpus Google Books Ngram one can obtain accuracy comparable to the accuracy of synchronic estimates. We study some examples of words words that are characterised by significant changes in socialness ratings over the past 150 years. It is concluded that changes in socialness ratings can serve as a marker of word meaning change.
... After the test run of synonym matches with paper and patent texts (described below), we identified some homonyms irrelevant to genes. To reduce these homonyms, we removed non-biological phrases in common use based on their frequent occurrences in published books in society, available from Google Books Ngram Corpus (version 3, February 2020)[87]: the phrases with higher counts than an empirically-chosen threshold (1.8×10 7 counts in the books published by 2020) were automatically removed from our gene synonyms. The remaining synonyms underwent extensive manual inspection and curation, and we then finalized our gene synonym sets. ...
Preprint
Full-text available
Current technological revolutions, involving artificial intelligence, mRNA vaccines, and quantum computing, are largely driven by industry. Despite the existing perception that commercial motives promote cutting-edge innovation, concerns may arise about their risk of limiting scientific exploration from diverse perspectives, which nurtures long-term innovative potential. Here, we investigate the interplay between scientific exploration and industrial influence by analyzing about 20 million papers and US, Chinese, and European patents in genetic research, a domain of far-reaching societal importance. We observe that research on new genes has declined since the early 2000s, but the exploration of novel gene combinations still underpins biotechnology innovation. Fields of highly practical or commercial focus are less likely to adopt the innovative approaches, exhibiting lower research vitality. Additionally, continuous scientific research creates exploratory opportunities for innovation, while industry's R&D efforts are typically short-lived. Alarmingly, up to 42.2-74.4% of these exploratory opportunities could be lost if scientific research is restrained by industry interests, highlighting the cost of over-reliance on commercially-driven research. Given the industry's dominance in recent technologies, our work calls for a balanced approach with long-term scientific exploration to preserve innovation vitality, unlock the full potential of genetic research and biotechnology, and address complex global challenges.
... The most straightforward approach is to replace lexical features with word representations, such as Brown clusters (Brown et al., 1992;Lin et al., 2012) or word embeddings (Turian et al., 2010), such as word2vec (Mikolov et al., 2013). Lexical features can then be replaced or augmented with the resulting word representations. ...
Preprint
As more historical texts are digitized, there is interest in applying natural language processing tools to these archives. However, the performance of these tools is often unsatisfactory, due to language change and genre differences. Spelling normalization heuristics are the dominant solution for dealing with historical texts, but this approach fails to account for changes in usage and vocabulary. In this empirical paper, we assess the capability of domain adaptation techniques to cope with historical texts, focusing on the classic benchmark task of part-of-speech tagging. We evaluate several domain adaptation methods on the task of tagging Early Modern English and Modern British English texts in the Penn Corpora of Historical English. We demonstrate that the Feature Embedding method for unsupervised domain adaptation outperforms word embeddings and Brown clusters, showing the importance of embedding the entire feature space, rather than just individual words. Feature Embeddings also give better performance than spelling normalization, but the combination of the two methods is better still, yielding a 5% raw improvement in tagging accuracy on Early Modern English texts.
... Google Books 2-grams. This feature determines if term 1 forms a significant two-word phrase with att, more than term 2 does, based on the Google Books English Fiction data (Lin et al., 2012). The "significance" (s) of a two-word phrase is determined by comparing the smoothed log-likelihood of the individual unigrams to the smoothed log-likelihood of the phrase: s(term, att) = 10 + log 10 (#(term, att) + 1) − log 10 ((#(term) + 10 5 )(#(att) + 10 5 )) ...
Preprint
Full-text available
Luminoso participated in the SemEval 2018 task on "Capturing Discriminative Attributes" with a system based on ConceptNet, an open knowledge graph focused on general knowledge. In this paper, we describe how we trained a linear classifier on a small number of semantically-informed features to achieve an F1F_1 score of 0.7368 on the task, close to the task's high score of 0.75.
... In order to alleviate the homogenization problem, recent works [31], [10] proposed the use of Google Books Ngram Corpus (GBNC) [20] instead of WordNet and ConceptNet to obtain query expansions for candidate images collection. The Google Books Ngrams Corpus covers almost all related queries at the text level. ...
Preprint
Labelled image datasets have played a critical role in high-level image understanding. However, the process of manual labelling is both time-consuming and labor intensive. To reduce the cost of manual labelling, there has been increased research interest in automatically constructing image datasets by exploiting web images. Datasets constructed by existing methods tend to have a weak domain adaptation ability, which is known as the "dataset bias problem". To address this issue, we present a novel image dataset construction framework that can be generalized well to unseen target domains. Specifically, the given queries are first expanded by searching the Google Books Ngrams Corpus to obtain a rich semantic description, from which the visually non-salient and less relevant expansions are filtered out. By treating each selected expansion as a "bag" and the retrieved images as "instances", image selection can be formulated as a multi-instance learning problem with constrained positive bags. We propose to solve the employed problems by the cutting-plane and concave-convex procedure (CCCP) algorithm. By using this approach, images from different distributions can be kept while noisy images are filtered out. To verify the effectiveness of our proposed approach, we build an image dataset with 20 categories. Extensive experiments on image classification, cross-dataset generalization, diversity comparison and object detection demonstrate the domain robustness of our dataset.
... For semantic change computation, only those corpora that are designed to be informative of semantic change should be used, and only those semantic aspects that are contrastive in the corpora should be studied. Lin, Michel, Aiden, Orwant, Brockman, and Petrov, 2012), due to its size, range of time, types of language included in the corpus 8 and public availability. The corpus has been available at http://books.google.com/ngrams ...
Preprint
Full-text available
This paper reviews the state-of-the-art of semantic change computation, one emerging research field in computational linguistics, proposing a framework that summarizes the literature by identifying and expounding five essential components in the field: diachronic corpus, diachronic word sense characterization, change modelling, evaluation data and data visualization. Despite the potential of the field, the review shows that current studies are mainly focused on testifying hypotheses proposed in theoretical linguistics and that several core issues remain to be solved: the need for diachronic corpora of languages other than English, the need for comprehensive evaluation data for evaluation, the comparison and construction of approaches to diachronic word sense characterization and change modelling, and further exploration of data visualization techniques for hypothesis justification.
... We trained models on the 6 datasets described in Table 1, taken from Google N-Grams (Lin et al., 2012) and the COHA corpus (Davies, 2010). The Google N-Gram datasets are extremely large (comprising ≈6% of all books ever published), but they also contain many corpus artifacts due, e.g., to shifting sampling biases over time (Pechenick et al., 2015). ...
Preprint
Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.
... While the mechanisms underlying organismal evolution have been explored extensively, the forces responsible for language evolution remain unclear. Quantitative methods to infer evolutionary forces developed in population genetics have not been widely applied in linguistics, despite the recent availability of massive digital corpora [4][5][6][7] . ...
Preprint
Languages and genes are both transmitted from generation to generation, with opportunity for differential reproduction and survivorship of forms. Here we apply a rigorous inference framework, drawn from population genetics, to distinguish between two broad mechanisms of language change: drift and selection. Drift is change that results from stochasticity in transmission and it may occur in the absence of any intrinsic difference between linguistic forms; whereas selection is truly an evolutionary force arising from intrinsic differences -- for example, when one form is preferred by members of the population. Using large corpora of parsed texts spanning the 12th century to the 21st century, we analyze three examples of grammatical changes in English: the regularization of past-tense verbs, the rise of the periphrastic `do', and syntactic variation in verbal negation. We show that we can reject stochastic drift in favor of a selective force driving some of these language changes, but not others. The strength of drift depends on a word's frequency, and so drift provides an alternative explanation for why some words are more prone to change than others. Our results suggest an important role for stochasticity in language change, and they provide a null model against which selective theories of language evolution must be compared.
... The fourth embedding we validate uses word2vec on Google Ngram text from 2000-2012. The Google Ngram corpus is the product a massive project in text digitization, in collaboration with thousands of the world's libraries, which distills text from 6% of all books ever published (Lin et al. 2012;Michel et al. 2011) . Any sequence of five words that occurs more than 40 times over the entirety of the scanned texts appears in the collection of 5-grams, along with the number of times it occurred each year. ...
Preprint
We demonstrate the utility of a new methodological tool, neural-network word embedding models, for large-scale text analysis, revealing how these models produce richer insights into cultural associations and categories than possible with prior methods. Word embeddings represent semantic relations between words as geometric relationships between vectors in a high-dimensional space, operationalizing a relational model of meaning consistent with contemporary theories of identity and culture. We show that dimensions induced by word differences (e.g. man - woman, rich - poor, black - white, liberal - conservative) in these vector spaces closely correspond to dimensions of cultural meaning, and the projection of words onto these dimensions reflects widely shared cultural connotations when compared to surveyed responses and labeled historical data. We pilot a method for testing the stability of these associations, then demonstrate applications of word embeddings for macro-cultural investigation with a longitudinal analysis of the coevolution of gender and class associations in the United States over the 20th century and a comparative analysis of historic distinctions between markers of gender and class in the U.S. and Britain. We argue that the success of these high-dimensional models motivates a move towards "high-dimensional theorizing" of meanings, identities and cultural processes.
... All three used Word-Net, Wikipedia and other large corpora. In particular, Banea et al. (2012) obtained models from 6 million Wikipedia articles and more than 9.5 million hyperlinks; Bär et al. (2012) used Wiktionary 1 , which contains over 3 million entries; andŠarić et al. (2012) used The New York Times Annotated Corpus (Sandhaus, 2008), which contains over 1.8 million news articles, and Google n-grams (Lin et al., 2012), which consists of approximately 24GB of compressed text files. Our approach only uses WordNet, by far the smallest external resource with less than 120,000 synsets. ...
... Hamilton, Leskovec, and Jurafsky (2016) introduced HistWords to prove that more frequently used words exhibit less semantic change over time, and that polysemous words exhibit faster semantic change. We apply the ML-EAT to the English language HistWords embeddings trained using Word2Vec (SGNS) (Mikolov et al. 2013) on Google books (all genres) (Lin et al. 2012). • GPT-2 Language Models: GPT-2 ("Generative Pretrained Transformer") is a causally masked transformer (Vaswani et al. 2017) language model trained to predict the next word in a sequence (Radford et al. 2019). ...
Article
This research introduces the Multilevel Embedding Association Test (ML-EAT), a method designed for interpretable and transparent measurement of intrinsic bias in language technologies. The ML-EAT addresses issues of ambiguity and difficulty in interpreting the traditional EAT measurement by quantifying bias at three levels of increasing granularity: the differential association between two target concepts with two attribute concepts; the individual effect size of each target concept with two attribute concepts; and the association between each individual target concept and each individual attribute concept. Using the ML-EAT, this research defines a taxonomy of EAT patterns describing the nine possible outcomes of an embedding association test, each of which is associated with a unique EAT-Map, a novel four-quadrant visualization for interpreting the ML-EAT. Empirical analysis of static and diachronic word embeddings, GPT-2 language models, and a CLIP language-and-image model shows that EAT patterns add otherwise unobservable information about the component biases that make up an EAT; reveal the effects of prompting in zero-shot models; and can also identify situations when cosine similarity is an ineffective metric, rendering an EAT unreliable. Our work contributes a method for rendering bias more observable and interpretable, improving the transparency of computational investigations into human minds and societies.
... The Google Books Ngram corpus [24] contains data on frequencies of words and phrases from 8 languages over the past five centuries. The corpus is widely used in cultural and language evolution studies [14,44]. ...
Article
Full-text available
Valence of words in books reflects the situation in a society and allows one to assess the perception of life even in those countries and periods of time when direct research on well-being has not been conducted. We use the Google Books Ngram diachronic text corpus to analyze changes in the average valence of words in Russian books. We show that changes in the average valence correlate with the results of surveys on well-being. The average valence also responds to major historical events and social changes. Like other similar studies, quantitative data on the level of valence of words are based on calculations using dictionaries with word valence ratings. For the first time, we have carried out a comparative study based on a number of the most relevant Russian dictionaries. We have found that the obtained results depend on the applied meaning of such dictionaries and their lexical composition. This shows the need for careful selection of dictionaries for future research.
... It postulates that 'the most robust historical trends are associated with frequent n-grams.' The second version of GBN is described in detail in (Lin et al. 2012). The main difference between the first and second versions is the addition of syntactic tags. ...
Article
This article briefly summarizes primary publications that use Google Books Ngram (GBN) to study societal change. GBN is the most extensive tagged diachronic corpus available. Trends in societal evolution can be studied using year-by-year word frequency statistics. The development of individualism, changes in emotions and happiness, social psychology, and some other topics are among those examined in this article as research areas that have attracted the most interest. This paper discusses the specific findings and the research methodology, particularly its limitations. There are some examples of how GBN can be used to test existing scientific theories. New, unexpected, and scientifically significant findings are possible with GBN that would not be possible with other approaches.
... To ensure its validity, the corpus was validated using Word2Vec [56] to encode words into fixed-length vectors based on their context. Subsequently, MirasText and the Google N-gram dataset [92] were trained via Word2Vec to generate word clusters. Comparing these clusters revealed a high correlation, confirming MirasText's coherence and validity. ...
Article
Full-text available
The Persian language, also known as Farsi, is distinguished by its intricate morphological richness, yet it contends with a paucity of linguistic resources. With an estimated 110 million speakers, it finds prevalence across Iran, Tajikistan, Uzbekistan, Iraq, Russia, Azerbaijan, and Afghanistan. However, despite its widespread usage, scholarly investigations into Persian document retrieval remain notably scarce. This circumstance is primarily attributed to the absence of standardized test collections, which impedes the advancement of comprehensive research endeavors within this realm. As data corpora are the foundation of natural language processing applications, this work aims at Persian language datasets to address their availability and structure. Subsequently, we motivate a learning-based framework for the processing of Persian texts and their recognition, for which current state-of-the-art approaches from deep learning, such as deep neural networks, are further discussed. Our investigations highlight the challenges of realizing such a system while emphasizing its possible benefits for an otherwise rarely covered language.
... Hamilton, Leskovec, and Jurafsky (2016) introduced HistWords to prove that more frequently used words exhibit less semantic change over time, and that polysemous words exhibit faster semantic change. We apply the ML-EAT to the English language HistWords embeddings trained using Word2Vec (SGNS) (Mikolov et al. 2013) on Google books (all genres) (Lin et al. 2012). • GPT-2 Language Models: GPT-2 ("Generative Pretrained Transformer") is a causally masked transformer (Vaswani et al. 2017) language model trained to predict the next word in a sequence (Radford et al. 2019). ...
Preprint
Full-text available
This research introduces the Multilevel Embedding Association Test (ML-EAT), a method designed for interpretable and transparent measurement of intrinsic bias in language technologies. The ML-EAT addresses issues of ambiguity and difficulty in interpreting the traditional EAT measurement by quantifying bias at three levels of increasing granularity: the differential association between two target concepts with two attribute concepts; the individual effect size of each target concept with two attribute concepts; and the association between each individual target concept and each individual attribute concept. Using the ML-EAT, this research defines a taxonomy of EAT patterns describing the nine possible outcomes of an embedding association test, each of which is associated with a unique EAT-Map, a novel four-quadrant visualization for interpreting the ML-EAT. Empirical analysis of static and diachronic word embeddings, GPT-2 language models, and a CLIP language-and-image model shows that EAT patterns add otherwise unobservable information about the component biases that make up an EAT; reveal the effects of prompting in zero-shot models; and can also identify situations when cosine similarity is an ineffective metric, rendering an EAT unreliable. Our work contributes a method for rendering bias more observable and interpretable, improving the transparency of computational investigations into human minds and societies.
... Additionally, we lemmatize words for uniformity and exclude words under three characters to improve data quality. We then calculated the log-odds ratio (z-score) for each word between the pre and post-takeover corpora, using prior frequencies from the Google Books Ngram corpus (Lin et al., 2012). 5 This method identifies representative words unique to each corpus based on significance within each. ...
... N-grams with the term "offshore" in combination with another term ("offshore" + "X"), such as offshore wind, offshore aquaculture, offshore bank, offshore gas, as well as other combinations are shown. An N-gram is the result of breaking down a text into individual words or fragments, such as word combinations and counted in frequencies (Bohannon, 2010;Michel et al., 2010;Russell, 2011;Lin et al., 2012). There may be some inaccuracy in the Google N-gram Viewer data, especially since words from books are only counted if there are more than 40 entries. ...
Article
Full-text available
The terms “offshore” and “open ocean” have been used to describe aquaculture sites that are further from the coast or in higher energy environments. Neither term has been clearly defined in the scientific literature nor in a legal context, and the terms are often used interchangeably. These and other related terms (for example “exposed”, “high-energy”) variously refer to aspects of a site such as the geographic distance from shore or infrastructure, the level of exposure to large waves and strong currents, the geographic fetch, the water depth, or some combination of these parameters. The ICES Working Group (ICES, 2024) on Open Ocean Aquaculture (WGOOA) therefore identified a need to define the terminology to reduce ambiguity for these types of aquaculture sites or more precisely, to: (1) promote a common understanding and avoid misuse for different classifications; (2) enable regulators to identify the characteristics of a marine site; (3) allow farmers to be able to assess or quantitatively compare sites for development; (4) equip developers and producers to identify operational parameters in which the equipment and vessels will need to operate; (5) provide insurers and investors with the terminology to consistently assess risk and premiums; and (6) circumvent the emergence of narratives that root in different cognitive interpretations of the terminology in public discourse. This paper describes the evolution of the use of the term “offshore aquaculture” and define the most relevant parameters to shift to a more definitive and robust term “exposed aquaculture” that can inherently relay clearer information. Adoption of this more definitive definition of “exposed” will allow the user to define a site with more than just distance from shore. Key differences and the importance of these terms are discussed that affect various interest groups. Follow-up articles in this compilation from scientific members of the WGOOA as well as other scientists outside ICES are incorporated that develop a set of definitions and a rigorous exposure index.
... In addition, these databases are general and could be used for any future modeling need that requires the specified elements type (either rods or beams, for the case of this study). Therefore, similarly to the case for many databases available for the most diverse applications, such as the ImageNet dataset that is used for computer vision [57] or Google Books Ngrams dataset used for natural language processing [58], these databases of structural elements can also become standardized and available to the scientific community for use. ...
Preprint
Full-text available
This paper extends the finite element network analysis (FENA) to include a dynamic time-transient formulation. FENA was initially formulated in the context of the linear static analysis of 1D and 2D elastic structures. By introducing the concept of super finite network element, this paper provides the necessary foundation to extend FENA to linear time-transient simulations for both homogeneous and inhomogeneous domains. The concept of neural network concatenation, originally formulated to combine networks representative of different structural components in space, is extended to the time domain. Network concatenation in time enables training neural network models based on data available in a limited time frame and then using the trained networks to simulate the system evolution beyond the initial time window characteristic of the training data set. The proposed methodology is validated by applying FENA to the transient simulation of one-dimensional structural elements (such as rods and beams) and by comparing the results with either analytical or finite element solutions. Results confirm that FENA accurately predicts the dynamic response of the physical system and, while introducing an error on the order of 1% (compared to analytical or computational solutions of the governing differential equations), it is capable of delivering extreme computational efficiency.
... in which A is the corrected word, and B comprises the typed letters. The probability of the corrected word was estimated by the square-root log-frequency of the word's occurrence based on Google N Gram's 2018 database 130 . The probability of the typed letters given the corrected word was defined as 1 minus the Euclidean error between the two. ...
Article
Full-text available
A brain-computer interface (BCI) enables users to control devices with their minds. Despite advancements, non-invasive BCIs still exhibit high error rates, prompting investigation into the potential reduction through concurrent targeted neuromodulation. Transcranial focused ultrasound (tFUS) is an emerging non-invasive neuromodulation technology with high spatiotemporal precision. This study examines whether tFUS neuromodulation can improve BCI outcomes, and explores the underlying mechanism of action using high-density electroencephalography (EEG) source imaging (ESI). As a result, V5-targeted tFUS significantly reduced the error in a BCI speller task. Source analyses revealed a significantly increase in theta and alpha activities in the tFUS condition at both V5 and downstream in the dorsal visual processing pathway. Correlation analysis indicated that the connection within the dorsal processing pathway was preserved during tFUS stimulation, while the ventral connection was weakened. These findings suggest that V5-targeted tFUS enhances feature-based attention to visual motion.
... To construct semantic trajectories of words, we use the downloadable version of the Google Books Ngram Viewer database version 2, downloaded from https://storage.googleapis.com/ books/ngrams/books/datasetsv2.html, as time-labelled corpora available in multiple languages [8,36]. In this work, we included five languages from the Indo-European family with the most available data, English, French, German, Spanish, and Italian. ...
Article
Full-text available
How do words change their meaning? Although semantic evolution is driven by a variety of distinct factors, including linguistic, societal, and technological ones, we find that there is one law that holds universally across five major Indo-European languages: that semantic evolution is subdiffusive. Using an automated pipeline of diachronic distributional semantic embedding that controls for underlying symmetries, we show that words follow stochastic trajectories in meaning space with an anomalous diffusion exponent α = 0.45 ± 0.05 across languages, in contrast with diffusing particles that follow α = 1. Randomization methods indicate that preserving temporal correlations in semantic change directions is necessary to recover strongly subdiffusive behavior; however, correlations in change sizes play an important role too. We furthermore show that strong subdiffusion is a robust phenomenon under a wide variety of choices in data analysis and interpretation, such as the choice of fitting an ensemble average of displacements or averaging best-fit exponents of individual word trajectories.
... The Google Books Ngram viewer is a search engine that outputs the frequencies of search strings using n-grams in printed sources published between 1500 and 2019 in Google's text corpora in several languages. The total collection is far from complete and is claimed to contain more than 6% of all the books ever published [33]. The authors employed the terms comprising the Moral Foundations Dictionary (see Section 2.2 for a description of that lexicon) as search strings and found the resulting term frequencies for all the terms associated with the six moral foundations and derived the frequency of each moral foundation, restricting their analysis to books in English. ...
Article
Full-text available
Moral features are essential components of TV series, helping the audience to engage with the story, exploring themes beyond sheer entertainment, reflecting current social issues, and leaving a long-lasting impact on the viewers. Their presence shows through the language employed in the plot description. Their detection helps regarding understanding the series writers’ underlying message. In this paper, we propose an approach to detect moral features in TV series. We rely on the Moral Foundations Theory (MFT) framework to classify moral features and use the associated MFT dictionary to identify the words expressing those features. Our approach combines that dictionary with word embedding and similarity analysis through a deep learning SBERT (Sentence-Bidirectional Encoder Representations from Transformers) architecture to quantify the comparative prominence of moral features. We validate the approach by applying it to the definition of the MFT moral feature labels as appearing in general authoritative dictionaries. We apply our technique to the summaries of a selection of TV series representative of several genres and relate the results to the actual content of each series, showing the consistency of results.
... The history of the term burble begs the question: why did aerodynamicists stop using the term burble? The authors used the Google Ngram tool [226,227] The term burble has been used throughout its history as a provisional descriptor. A phenomenon relating to eddying flow is studied, given a transient name relating to burble, and generally is given a condition specific name if extended study is dedicated to the topic. ...
Conference Paper
Full-text available
The term burble has been in use in aerodynamic theory for over a century. While burble may be unfamiliar to most contemporary aerodynamicists, the word has a rich history based in aerodynamic theory and experimentation. The present paper outlines the fluidity of burble's meaning over time. From analyzing subsonic flow over an airfoil, to the implementation of stochastic turbulence in aircraft carrier landing simulations, the term burble has had a significant impact on the study of aerodynamics. The term burble has fallen out of use in aerodynamic engineering circles. Why did this happen? And what can be learned from the decline in use of the term burble?
... From the publication date of a whole work to the part-of-speech category of a word in a sentence, and various levels in between, can be considered. Possible metadata that can be useful for retrieval are evident in various tools and literature (e.g., Fenlon et al. 2014;Finlayson 2015;Lin et al. 2012;Underwood 2015). Five categories of metadata that can be used to capture additional data about texts have been identified and discussed in previous work (Ball 2020;Ball and Bothma 2022), namely, morphological, syntactic, semantic, functional and bibliographic. ...
Article
Full-text available
Despite the growth in digital text collections, the ability to retrieve words or phrases with specific attributes is limited, for example, to retrieve words with a specific meaning within a specific section of a text. Many systems work with coarse bibliographic metadata. To enable fine-grained retrieval, it is necessary to encode texts with granular metadata. Sample texts were encoded with granular metadata. Five categories of metadata that can be used to capture additional data about texts were used, namely, morphological, syntactic, semantic, functional and bibliographic. A prototype was developed to parse the encoded texts and store the information in a database. The prototype was used to test the extent to which words or phrases with specific attributes could be retrieved. Retrieval on a detailed level was possible through the prototype. Retrieval using all five categories of metadata was demonstrated, as well as advanced searches using metadata from different categories in a single search. This article demonstrates that when granular metadata is used to encode texts, retrieval is improved. Relevant information can be selected, and irrelevant information can be excluded, even within a text.
... It is the largest corpus of the Russian language, which makes it a valuable tool for studies of language evolution (Richey & Taylor 2020;Solovyev et al. 2020), although it is criticised by some as being unbalanced (Pechenick et al. 2015;Koplenig 2017). The second version of the GBN Russian subcorpus was released in 2012 (Lin et al. 2012). It included texts of 510,310 Russian books published between 1607-2009, with a total number of 67 billion words. ...
Article
Full-text available
We investigated diachrony of distributional semantics of two competing Russian colour terms ( ct s) for ‘brown’, buryj (11th century) and koričnevyj (17th century), using the Russian subcorpus of Google Books Ngram (2020) . Time-series analysis (1800–2019) of bigrams gauged each term’s frequencies of occurrence and changes in combinability with nouns for natural objects, artefacts, abstract concepts and figurative expressions. In frequency, koričnevyj overtook buryj in the 1920s, confirming its basic status in modern Russian. The perplexity index indicates that koričnevyj steadily increased the range of denoted objects, with artefacts being front runners in the buryj -to- koričnevyj transition. The results corroborate Rakhilina’s ( 2007a , 2007b , 2008 ) hypothesis that an incipient ct initially collocates with nouns denoting artefacts but gradually expands to the realm of natural objects supplanting an old ct . Moreover, koričnevyj and buryj are discerned by denotations and connotations. The present findings provide insights into general mechanisms of the linguistic evolution of an emergent basic ct .
... googleapis.com/books/ngrams/books/datasetsv3.html). The third version of the GBN database encompasses over 16,000,000 books in English obtained from diverse sources, including university libraries and publishers (Lin et al., 2012;Michel et al., 2011). The study focused on the period between 1920 and 2019, encompassing the initial usage of the term robot in a play published by Czech Karel � Capek in 1920 ( � Capek, 1920) until the latest available data in 2019. ...
Article
Human–robot interactions (HRIs) are significantly influenced by the personality of robots. However, research on robot personality from the perspective of big data text mining remains scarce. To address this gap, our study delves into the portrayal of Big Five personality traits in robots across millions of Google Books, spanning from 1920 to 2019. In this study, we identify intriguing trends in how robot personalities have been described over the years. Notably, we observe that the trait of openness has consistently been the most frequently cited Big Five personality factor throughout the twentieth century. Following closely are conscientiousness, agreeableness, extraversion, and neuroticism. However, a noteworthy shift occurs in the late twentieth century, where extraversion garners increasing attention, ultimately becoming the most prominent Big Five personality factor after 2010. Furthermore, our analysis uncovers a fascinating positivity bias in the portrayal of robot personality. Robots are more commonly depicted as extroverted rather than introverted and open rather than reserved. These trends also correlate with the evolution of core personality words. For instance, the term intellectual robot gradually yields to intelligent robot over the course of the twentieth century. Additionally, in the twenty-first century, social robot emerges as the most prevailing topic. Understanding the interplay between human records and their perception of robot personalities provides valuable insights into both real descriptions and ideal expectations of robots. This research serves as a critical reference for further advancements in robot personality studies, shedding light on the dynamic nature of HRIs.
... Word frequency is obtained from the version 3 of Google Book Corpus Ngrams dataset [33] (obtained from https://storage.googleapis.com/books/ngrams/books/datasetsv3.html), which is available for English and Spanish, among many other languages. Google Books includes over 200 billion words in 40 million documents, providing a good approximation for likelihood of encountering a word in print, even a very rare one. ...
Conference Paper
Full-text available
Modern large language models (LLMs), such as GPT-4 or ChatGPT, are capable of producing fluent text in natural languages, making their output hard to manually differentiate from human-written text. However, there are many real-world scenarios where this distinction needs to be made, raising the need for automatic solutions. Here we present our approach to the problem, implemented with the 'AuTexTification: Automated Text Identification' shared task. The core of our model is aimed at measuring 'predictability', i.e. how likely given text is according to several LLMs. This information, supplemented with features describing grammatical correctness, word frequency, linguistic patterns and combined with fine-tuned LLM representation is used to train a neural network on the provided datasets. The resulting model achieves the best performance among the submissions in subtask 1 (differentiating between human-and machine-generated text), both for English and Spanish. We also provide the results of our internal topic-based evaluation to show strengths and weaknesses of different variants of our contribution.
... in which A is corrected word, and B comprises the typed letters. The probability of the corrected 427 word was estimated by the square-root log-frequency of the word's occurrence based on Google 428 N Gram's 2018 database 78 . The probability of the typed letters given the corrected word was 429 defined as the Euclidean error between the two. ...
Preprint
Paralysis affects roughly 1 in 50 Americans. While there is no cure for the condition, brain-computer interfaces (BCI) can allow users to control a device with their mind, bypassing the paralyzed region. Non-invasive BCIs still have high error rates, which is hypothesized to be reduced with concurrent targeted neuromodulation. This study examines whether transcranial focused ultrasound (tFUS) modulation can improve BCI outcomes, and what the underlying mechanism of action might be through high-density electroencephalography (EEG)-based source imaging (ESI) analyses. V5-targeted tFUS significantly reduced the error for the BCI speller task. ESI analyses showed significantly increased theta activity in the tFUS condition at both V5 and downstream the dorsal visual processing pathway. Correlation analysis indicates that the dorsal processing pathway connection was preserved during tFUS stimulation, whereas extraneous connections were severed. These results suggest that V5-targeted tFUS’ mechanism of action is to raise the brain’s feature-based attention to visual motion.
... Google Books English-All Embeddings. The Google Books English-all data set (hereafter referred to simply as Books) is taken from the Google Books n-grams data set (second version; Lin et al., 2012), with approximately 850 billion words of all English books archived over 200 years from 1800 to 1999. Although not all books are included in the Books data set, the coverage is estimated to be approximately 4%-6% of all books ever published from 1800 to 1999 (Michel et al., 2011), providing a wide and diverse coverage of book-based text. ...
Article
Full-text available
The social world is carved into a complex variety of groups each associated with unique stereotypes that persist and shift over time. Innovations in natural language processing (word embeddings) enabled this comprehensive study on variability and correlates of change/stability in both manifest and latent stereotypes for 72 diverse groups tracked across 115 years of four English-language text corpora. Results showed, first, that group stereotypes changed by a moderate-to-large degree in manifest content (i.e., top traits associated with groups) but remained relatively more stable in latent structure (i.e., average cosine similarity of top traits’ embeddings and vectors of valence, warmth, or competence). This dissociation suggests new insights into how stereotypes and their consequences may endure despite documented changes in other aspects of group representations. Second, results showed substantial variability of change/stability across the 72 groups, with some groups revealing large shifts in manifest and latent content, but others showing near-stability. Third, groups also varied in how consistently they were stereotyped across texts, with some groups showing divergent content, but others showing near-identical representations. Fourth, this variability in change/stability across groups was predicted from a combination of linguistic (e.g., frequency of mentioning the group; consistency of group stereotypes across texts) and social (e.g., the type of group) correlates. Groups that were more frequently mentioned in text changed more than those rarely mentioned; sociodemographic groups changed more than other group types (e.g., body-related stigmas, mental illnesses, occupations), providing the first quantitative evidence of specific group features that may support historical stereotype change.
... In addition, many scientific terms connect to the environment and economy. Fig. 1 shows the frequency of occurrence of the ecological footprint (red), the circular economy (blue) and the sustainable development (green) expressions, which appear as the percentage of documents in the Google Books database (Lin et al., 2012). The explanation for choosing these expressions is the cited number for these expressions in the academic literature. ...
Article
Full-text available
In recent years, researchers have sought to specify precisely what is meant by the ecological footprint, and there are some methods for calculating it. This paper features a new calculation method for determining the ecological footprint (EFP). The basis of our model is the dynamic Leontief model. If our method is applied, one can determine that a dynamic ecological footprint is a sequence of footprints for periods. We also calculate the ecological footprint for both closed and open economies. Our model contains elements taken from the Leontief model: capital accumulation, and integration of exports and imports into the model through input-output panels. All periods are treated as interdependent, rather than as a series of stand-alone data. Most notably, the model separates capital accumulation from the final use of capital, i.e., investment and final consumption. We illustrate the results with numerical examples.
... Name3 contained either /u/ or /a/ in word-final position (Manu, Nina, and Lola). Regarding possible collocations of the selected names for each coordinate, there was no particular co-occurrence of two adjacent names (as in, e.g., "Bonnie and Clyde") in the dlexDB corpora (Heister et al. 2011) or in printed sources between 1500 and 2021, as ascertained by the Google Ngram Viewer (Lin et al. 2012). ...
Book
Full-text available
In spoken language comprehension, the hearer is faced with a more or less continuous stream of auditory information. Prosodic cues, such as pitch movement, pre-boundary lengthening, and pauses, incrementally help to organize the incoming stream of information into prosodic phrases, which often coincide with syntactic units. Prosody is hence central to spoken language comprehension and some models assume that the speaker produces prosody in a consistent and hierarchical fashion. While there is manifold empirical evidence that prosodic boundary cues are reliably and robustly produced and effectively guide spoken sentence comprehension across different populations and languages, the underlying mechanisms and the nature of the prosody-syntax interface still have not been identified sufficiently. This is also reflected in the fact that most models on sentence processing completely lack prosodic information. This edited book volume is grounded in a workshop that was held in 2021 at the annual conference of the Deutsche Gesellschaft für Sprachwissenschaft (DGfS). The five chapters cover selected topics on the production and comprehension of prosodic cues in various populations and languages, all focusing in particular on processing of prosody at structurally relevant prosodic boundaries. Specifically, the book comprises cross-linguistic evidence as well as evidence from non-native listeners, infants, adults, and elderly speakers, highlighting the important role of prosody in both language production and comprehension.
... Nowadays, the Google Books Ngram Corpus (GBNC) is a unique linguistic landscape that benefits from centuries of development of rich grammatical and lexical resources, as well as its cultural context [36,98]. Arguably, the lexicographical and historiographical study promises to articulate the ins and outs of scientific narratives by leveraging the capacity of these rich metadata corpora over four centuries (Fig. 3). ...
Chapter
Full-text available
Understanding naming conventions for strengthening the integrity of naming human diseases remains nominal rather than substantial in the medical community. Since the current nosology-based criteria for human diseases cannot provide a one-size-fits-all corrective mechanism, numerous idiomatic but erroneous names frequently appear in scientific literature and news outlets, at the cost of sociocultural repercussions. To mitigate such impacts, we examine the ethical oversight of current naming practices and introduce some ethical principles for formulating an improved naming scheme. Relatedly, we organize rich metadata to unveil the nosological evolution of anachronistic names and demonstrate the heuristic approaches to curate exclusive substitutes for inopportune nosology based on deep learning models and post-hoc explanations. Our findings indicate that the nosological evolution of anachronistic names may have societal consequences in the absence of a corrective mechanism. Arguably, as an exemplar, Rubella could serve as a destigmatized replacement for German measles. The illustrated rationales and approaches could provide hallmark references to the ethical introspection of naming practices and pertinent credit allocations.KeywordsDisease TaxonomyICD-11Health CommunicationCredit Allocation German measles Long COVIDMonkeypoxDeep Learning
Article
My paper with Patrick Hanks on PMI (pointwise mutual information) was the most successful paper I ever wrote, or ever will write. I believe the paper was successful because it appealed to a number of different audiences for a number of different purposes. Patrick Hanks was more interested in applications in lexicography and I was more interested in applications in engineering. The first section on background will discuss the role our PMI paper played in moving computational linguistics from Rationalism to Empiricism. The second section will connect the dots between PMI and much of the recent excitement in Artificial Intelligence over bots like DeepSeek and large language models (LLMs).
Article
Full-text available
Many research projects face legal restrictions on the publication of texts. In recent decades, several projects have circumvented these restrictions by both deleting some parts of the data and publishing derived data from the original files. We discuss the limitations of the commonly used ad-hoc solutions and the deprecation of the FAIR status that they cause. In contrast, we propose to model derived data in TEI, and present several variants with five corpora from different languages, genres, and periods. We also present the implementation of several features for publishing such data in the TextGrid Repository and the publication of derived data from a corpus of Spanish novels and a corpus of American plays.
Article
This work considers implementation of a diachronic predictor of valence, arousal and dominance ratings of English words. The estimation of affective ratings is based on data on word co-occurrence statistics in the large diachronic Google Books Ngram corpus. Affective ratings from the NRC VAD dictionary are used as target values for training. When tested on synchronic data, the obtained Pearson‘s correlation coefficients between human affective ratings and their machine ratings are 0.843, 0.779 and 0.792 for valence, aroused and dominance, respectively. We also provide a detailed analysis of the accuracy of the predictor on diachronic data. The main result of the work is creation of a diachronic affective dictionary of English words. Several examples are considered that illustrate jumps in the time series of affective ratings when a word gains a new meaning. This indicates that changes in affective ratings can serve as markers of lexical-semantic changes.
Article
Preprint at: https://osf.io/preprints/psyarxiv/3y9df Analyzing data from the verbal fluency task (e.g., “name all the animals you can in a minute”) is of interest to both memory researchers and clinicians due to its broader implications for memory search and retrieval. Recent work has proposed several computational models to examine nuanced differences in search behavior, which can provide insights into the mechanisms underlying memory search. A prominent account of memory search within the fluency task was proposed by Hills et al. (2012), where mental search is modeled after how animals forage for food in physical space. Despite the broad potential utility of these models to scientists and clinicians, there is currently no open-source program to apply and compare existing foraging models or clustering algorithms without extensive, often redundant programming. To remove this barrier to studying search patterns in the fluency task, we created forager, a Python package (https://github.com/thelexiconlab/forager) and web interface (https://forager.research.bowdoin.edu/). forager provides multiple automated methods to designate clusters and switches within a fluency list, implements a novel set of computational models that can examine the influence of multiple lexical sources (semantic, phonological, and frequency) on memory search using semantic embeddings, and also enables researchers to evaluate relative model performance at the individual and group level. The package and web interface cater to users with various levels of programming experience. In this work, we introduce forager’s basic functionality and use cases that demonstrate its utility with pre-existing behavioral and clinical data sets of the semantic fluency task.
Chapter
Full-text available
Words are the building blocks of phrases, sentences, and documents. Word representation is thus critical for natural language processing (NLP). In this chapter, we introduce the approaches for word representation learning to show the paradigm shift from symbolic representation to distributed representation. We also describe the valuable efforts in making word representations more informative and interpretable. Finally, we present applications of word representation learning to NLP and interdisciplinary fields, including psychology, social sciences, history, and linguistics.
Preprint
Ideology is frequently discussed as an important factor in maintaining the current capitalist economic order. To date, however, descriptions of long-term changes in capitalist ideology remain mostly qualitative. By using computational methods (word embeddings) on two large corpora of historical English texts (Google books & Corpus of Historical American English), this article is the first to provide quantitative data on such long-term changes between 1850 and 1999. It thus tests three prominent claims found in the existing literature and presents the following main findings: 1) The rise of neoliberalism is represented by a weaking discourse on economic regulation rather than by an increase in liberal discourse. 2) Meritocratic values are promoted most intensely in times of high inequality. 3) Intrinsic work motivation becomes more important over time, but extrinsic motivation remains dominant. These findings have various theoretical and practical implications.
Conference Paper
Full-text available
I examine what would be necessary to move part-of-speech tagging performance from its current level of about 97.3% token accuracy (56% sentence accuracy) to close to 100% accuracy. I suggest that it must still be possible to greatly increase tagging performance and examine some useful improvements that have recently been made to the Stanford Part-of-Speech Tagger. However, an error analysis of some of the remaining errors suggests that there is limited further mileage to be had either from better machine learning or better features in a discriminative sequence classifier. The prospects for further gains from semi-supervised learning also seem quite limited. Rather, I suggest and begin to demonstrate that the largest opportunity for further progress comes from improving the taxonomic basis of the linguistic resources from which taggers are trained. That is, from improved descriptive linguistics. However, I conclude by suggesting that there are also limits to this process. The status of some words may not be able to be adequately captured by assigning them to one of a small number of categories. While conventions can be used in such cases to improve tagging consistency, they lack a strong linguistic basis.
Conference Paper
Full-text available
We use search engine results to address a particularly difficult cross-domain language processing task, the adaptation of named entity recognition (NER) from news text to web queries. The key novelty of the method is that we submit a token with context to a search engine and use similar contexts in the search results as additional information for correctly classifying the token. We achieve strong gains in NER performance on news, in-domain and out-of-domain, and on web queries.
Article
Full-text available
Parsing algorithms that process the input from left to right and construct a single derivation have often been considered inadequate for natural language parsing because of the massive ambiguity typically found in natural language grammars. Nevertheless, it has been shown that such algorithms, combined with treebank-induced classifiers, can be used to build highly accurate disambiguating parsers, in particular for dependency-based syntactic representations. In this article, we first present a general framework for describing and analyzing algorithms for deterministic incremental dependency parsing, formalized as transition systems. We then describe and analyze two families of such algorithms: stack-based and list-based algorithms. In the former family, which is restricted to projective dependency structures, we describe an arc-eager and an arc-standard variant; in the latter family, we present a projective and a non-projective variant. For each of the four algorithms, we give proofs of correctness and complexity. In addition, we perform an experimental evaluation of all algorithms in combination with SVM classifiers for predicting the next parsing action, using data from thirteen languages. We show that all four algorithms give competitive accuracy, although the non-projective list-based algorithm generally outperforms the projective algorithms for languages with a non-negligible proportion of non-projective constructions. However, the projective algorithms often produce comparable results when combined with the technique known as pseudo-projective parsing. The linear time complexity of the stack-based algorithms gives them an advantage with respect to efficiency both in learning and in parsing, but the projective list-based algorithm turns out to be equally efficient in practice. Moreover, when the projective algorithms are used to implement pseudo-projective parsing, they sometimes become less efficient in parsing (but not in learning) than the non-projective list-based algorithm. Although most of the algorithms have been partially described in the literature before, this is the first comprehensive analysis and evaluation of the algorithms within a unified framework.
Article
Full-text available
The abstract for this document is available on CSA Illumina.To view the Abstract, click the Abstract button above the document title.
Conference Paper
Full-text available
This paper describes a system for extracting typed dependency parses of English sentences from phrase structure parses. In order to capture inherent relations occurring in corpus texts that can be critical in real-world applications, many NP relations are included in the set of grammatical relations used. We provide a comparison of our system with Minipar and the Link parser. The typed dependency extraction facility described here is integrated in the Stanford Parser, available for download.
Article
Full-text available
We constructed a corpus of digitized texts containing about 4% of all books ever printed. Analysis of this corpus enables us to investigate cultural trends quantitatively. We survey the vast terrain of ‘culturomics,’ focusing on linguistic and cultural phenomena that were reflected in the English language between 1800 and 2000. We show how this approach can provide insights about fields as diverse as lexicography, the evolution of grammar, collective memory, the adoption of technology, the pursuit of fame, censorship, and historical epidemiology. Culturomics extends the boundaries of rigorous quantitative inquiry to a wide array of new phenomena spanning the social sciences and the humanities.
Article
Full-text available
This paper describes the development of QuestionBank, a corpus of 4000 parse-annotated questions for (i) use in training parsers employed in QA, and (ii) evaluation of question parsing. We present a series of experiments to investigate the effectiveness of QuestionBank as both an exclusive and supplementary training resource for a state-of-the-art parser in parsing both question and non-question test sets. We introduce a new method for recovering empty nodes and their antecedents (capturing long distance dependencies) from parser output in CFG trees using LFG f-structure reentrancies. Our main findings are (i) using QuestionBank training data improves parser performance to 89.75% labelled bracketing f-score, an increase of almost 11% over the baseline; (ii) back-testing experiments on non-question data (Penn-II WSJ Section 23) shows that the retrained parser does not suffer a performance drop on non-question material; (iii) ablation experiments show that the size of training material provided by QuestionBank is sufficient to achieve optimal results; (iv) our method for recovering empty nodes captures long distance dependencies in questions from the ATIS corpus with high precision (96.82%) and low recall (39.38%). In summary, QuestionBank provides a useful new resource in parser-based QA research.
Article
Full-text available
Human language is based on grammatical rules. Cultural evolution allows these rules to change over time. Rules compete with each other: as new rules rise to prominence, old ones die away. To quantify the dynamics of language evolution, we studied the regularization of English verbs over the past 1,200 years. Although an elaborate system of productive conjugations existed in English's proto-Germanic ancestor, Modern English uses the dental suffix, '-ed', to signify past tense. Here we describe the emergence of this linguistic rule amidst the evolutionary decay of its exceptions, known to us as irregular verbs. We have generated a data set of verbs whose conjugations have been evolving for more than a millennium, tracking inflectional changes to 177 Old-English irregular verbs. Of these irregular verbs, 145 remained irregular in Middle English and 98 are still irregular today. We study how the rate of regularization depends on the frequency of word usage. The half-life of an irregular verb scales as the square root of its usage frequency: a verb that is 100 times less frequent regularizes 10 times as fast. Our study provides a quantitative analysis of the regularization process by which ancestral forms gradually yield to an emerging linguistic rule.
Conference Paper
Web-search queries are known to be short, but little else is known about their structure. In this paper we investigate the applicability of part-of-speech tagging to typical English-language web search-engine queries and the potential value of these tags for improving search results. We begin by identifying a set of part-of-speech tags suitable for search queries and quantifying their occurrence. We find that proper-nouns constitute 40% of query terms, and proper nouns and nouns together constitute over 70% of query terms. We also show that the majority of queries are noun-phrases, not unstructured collections of terms. We then use a set of queries manually labeled with these tags to train a Brill tagger and evaluate its performance. In addition, we investigate classification of search queries into grammatical classes based on the syntax of part-of-speech tag sequences. We also conduct preliminary investigative experiments into the practical applicability of leveraging query-trained part-of-speech taggers for information-retrieval tasks. In particular, we show that part-of-speech information can be a significant feature in machine-learned search-result relevance. These experiments also include the potential use of the tagger in selecting words for omission or substitution in query reformulation, actions which can improve recall. We conclude that training a part-of-speech tagger on labeled corpora of queries significantly outperforms taggers based on traditional corpora, and leveraging the unique linguistic structure of web-search queries can improve search experience.
Conference Paper
Marking up queries with annotations such as part-of-speech tags, capitalization, and segmentation, is an important part of many approaches to query processing and understanding. Due to their brevity and idiosyncratic structure, search queries pose a challenge to existing annotation tools that are commonly trained on full-length documents. To address this challenge, we view the query as an explicit representation of a latent information need, which allows us to use pseudo-relevance feedback, and to leverage additional information from the document corpus, in order to improve the quality of query annotation.
Conference Paper
When search is against structured documents, it is bene- ficial to extract information from user queries in a format that is consistent with the backend data structure. As one step toward this goal, we study the problem of query tag- ging which is to assign each query term to a pre-defined category. Our problem could be approached by learning a conditional random field (CRF) model (or other statistical models) in a supervised fashion, but this would require sub- stantial human-annotation effort. In this work, we focus on a semi-supervised learning method for CRFs that utilizes two data sources: (1) a small amount of manually-labeled queries, and (2) a large amount of queries in which some word tokens have derived labels, i.e., label information au- tomatically obtained from additional resources. We present two principled ways of encoding derived label information in a CRF model. Such information is viewed as hard evidence in one setting and as soft evidence in the other. In addition to the general methodology of how to use derived labels in semi-supervised CRFs, we also present a practical method on how to obtain them by leveraging user click data and an in-domain database that contains structured documents. Evaluation on product search queries shows the effectiveness of our approach in improving tagging accuracies.
Conference Paper
We present a novel approach to parse web search queries for the purpose of automatic tagging of the queries. We will define a set of probabilistic context-free rules, which generates bags (i.e. multi-sets) of words. Us- ing this new type of rule in combination with the traditional probabilistic phrase structure rules, we define a hybrid grammar, which treats each search query as a bag of chunks (i.e. phrases). A hybrid probabilistic parser is used to parse the queries. In order to take contextual information into account, a discriminative model is used on top of the parser to re-rank the n-best parse trees gen- erated by the parser. Experiments show that our approach outperforms a basic model, which is based on Conditional Random Fields.
Conference Paper
Determining the semantic intent of web queries not only involves identifying their semantic class, which is a primary focus of previous works, but also understanding their semantic structure. In this work, we formally define the semantic structure of noun phrase queries as comprised of intent heads and intent modifiers. We present methods that automatically identify these constituents as well as their semantic roles based on Markov and semi-Markov con- ditional random fields. We show that the use of semantic features and syntactic fea- tures significantly contribute to improving the understanding performance.
Article
To facilitate future research in unsupervised induction of syntactic structure and to standardize best-practices, we propose a tagset that consists of twelve universal part-of-speech categories. In addition to the tagset, we develop a mapping from 25 different treebank tagsets to this universal set. As a result, when combined with the original treebank data, this universal tagset and mapping produce a dataset consisting of common parts-of-speech for 22 different languages. We highlight the use of this resource via two experiments, including one that reports competitive accuracies for unsupervised grammar induction without gold standard part-of-speech tags.
Article
We describe new algorithms for training tagging models, as an alternative to maximum-entropy models or conditional random fields (CRFs). The algorithms rely on Viterbi decoding of training examples, combined with simple additive updates. We describe theory justifying the algorithms through a modification of the proof of convergence of the perceptron algorithm for classification problems. We give experimental results on part-of-speech tagging and base noun phrase chunking, in both cases showing improvements over results for a maximum-entropy tagger.
Penn parsed corpus of modern british english
  • A Kroch
  • B Santorini
  • A Diertani
Google Books (American English) Corpus (155 billion words
  • M Davies