BookPDF Available

Natural Language Processing with Python

Authors:
A preview of the PDF is not available
... Frequently, these are high-frequency words such as articles ('a', 'an', 'the'), prepositions ('by', 'of', 'on'), and pronouns ('he', 'she', 'him'), which contain little predictive power for statistical analysis of semantic content [52]. We use the English language stop list in the Natural Language Toolkit, which contains 153 words [53]. Additionally, we filtered words occurring five or fewer times, which both excludes uncommon words and infrequent non-words generated by OCR errors. ...
Preprint
Full-text available
We show how faceted search using a combination of traditional classification systems and mixed-membership topic models can go beyond keyword search to inform resource discovery, hypothesis formulation, and argument extraction for interdisciplinary research. Our test domain is the history and philosophy of scientific work on animal mind and cognition. The methods can be generalized to other research areas and ultimately support a system for semi-automatic identification of argument structures. We provide a case study for the application of the methods to the problem of identifying and extracting arguments about anthropomorphism during a critical period in the development of comparative psychology. We show how a combination of classification systems and mixed-membership models trained over large digital libraries can inform resource discovery in this domain. Through a novel approach of "drill-down" topic modeling---simultaneously reducing both the size of the corpus and the unit of analysis---we are able to reduce a large collection of fulltext volumes to a much smaller set of pages within six focal volumes containing arguments of interest to historians and philosophers of comparative psychology. The volumes identified in this way did not appear among the first ten results of the keyword search in the HathiTrust digital library and the pages bear the kind of "close reading" needed to generate original interpretations that is the heart of scholarly work in the humanities. Zooming back out, we provide a way to place the books onto a map of science originally constructed from very different data and for different purposes. The multilevel approach advances understanding of the intellectual and societal contexts in which writings are interpreted.
... The seven classification algorithms we compare are the following: Naive Bayes (GaussianNB) is a popular algorithm for binary classification (Bird et al., 2009), which is based on the Bayes theorem with strong independence assumptions between features. Compared to other classifiers it does not require a large training set. ...
Preprint
App stores include an increasing amount of user feedback in form of app ratings and reviews. Research and recently also tool vendors have proposed analytics and data mining solutions to leverage this feedback to developers and analysts, e.g., for supporting release decisions. Research also showed that positive feedback improves apps' downloads and sales figures and thus their success. As a side effect, a market for fake, incentivized app reviews emerged with yet unclear consequences for developers, app users, and app store operators. This paper studies fake reviews, their providers, characteristics, and how well they can be automatically detected. We conducted disguised questionnaires with 43 fake review providers and studied their review policies to understand their strategies and offers. By comparing 60,000 fake reviews with 62 million reviews from the Apple App Store we found significant differences, e.g., between the corresponding apps, reviewers, rating distribution, and frequency. This inspired the development of a simple classifier to automatically detect fake reviews in app stores. On a labelled and imbalanced dataset including one-tenth of fake reviews, as reported in other domains, our classifier achieved a recall of 91% and an AUC/ROC value of 98%. We discuss our findings and their impact on software engineering, app users, and app store operators.
... For both methods a standard text preprocessing is conducted: removing punctuation and lemmatizing words by NLTK [55]. Meaningless words in the context of image retrieval like 'image' and 'picture' and standard English stopwords are also removed. ...
Preprint
In order to retrieve unlabeled images by textual queries, cross-media similarity computation is a key ingredient. Although novel methods are continuously introduced, little has been done to evaluate these methods together with large-scale query log analysis. Consequently, how far have these methods brought us in answering real-user queries is unclear. Given baseline methods that compute cross-media similarity using relatively simple text/image matching, how much progress have advanced models made is also unclear. This paper takes a pragmatic approach to answering the two questions. Queries are automatically categorized according to the proposed query visualness measure, and later connected to the evaluation of multiple cross-media similarity models on three test sets. Such a connection reveals that the success of the state-of-the-art is mainly attributed to their good performance on visual-oriented queries, while these queries account for only a small part of real-user queries. To quantify the current progress, we propose a simple text2image method, representing a novel test query by a set of images selected from large-scale query log. Consequently, computing cross-media similarity between the test query and a given image boils down to comparing the visual similarity between the given image and the selected images. Image retrieval experiments on the challenging Clickture dataset show that the proposed text2image compares favorably to recent deep learning based alternatives.
... These raw texts are ready for natural language processing and the following steps use the Natural Language ToolKit (NLTK; Bird et al. 2009) tools extensively. ...
Preprint
The increase in the number of researchers coupled with the ease of publishing and distribution of scientific papers (due to technological advancements) has resulted in a dramatic increase in astronomy literature. This has likely led to the predicament that the body of the literature is too large for traditional human consumption and that related and crucial knowledge is not discovered by researchers. In addition to the increased production of astronomical literature, recent decades have also brought several advancements in computational linguistics. Especially, the machine-aided processing of literature dissemination might make it possible to convert this stream of papers into a coherent knowledge set. In this paper, we present the application of computational linguistics techniques to astronomy literature. In particular, we developed a tool that will find similar articles purely based on text content from an input paper. We find that our technique performs robustly in comparison with other tools recommending articles given a reference paper (known as recommender system). Our novel tool shows the great power in combining computational linguistics with astronomy literature and suggests that additional research in this endeavor will likely produce even better tools that will help researchers cope with the vast amounts of knowledge being produced.
... Each configuration has between five and twenty steps (and blocks). We present type and token statistics in Table 1, where we use NLTK's (Bird, Klein, and Loper 2009) treebank tokenizer. This yields higher token counts in previous works due to different assumptions about punctuation. ...
Preprint
In this paper, we study the problem of mapping natural language instructions to complex spatial actions in a 3D blocks world. We first introduce a new dataset that pairs complex 3D spatial operations to rich natural language descriptions that require complex spatial and pragmatic interpretations such as "mirroring", "twisting", and "balancing". This dataset, built on the simulation environment of Bisk, Yuret, and Marcu (2016), attains language that is significantly richer and more complex, while also doubling the size of the original dataset in the 2D environment with 100 new world configurations and 250,000 tokens. In addition, we propose a new neural architecture that achieves competitive results while automatically discovering an inventory of interpretable spatial operations (Figure 5)
... For this work, we used unigram models in Python, utilizing some components from NLTK [3]. Probability distributions were calculated using Witten-Bell smoothing [8]. ...
Preprint
There is an ever growing number of users with accounts on multiple social media and networking sites. Consequently, there is increasing interest in matching user accounts and profiles across different social networks in order to create aggregate profiles of users. In this paper, we present models for Digital Stylometry, which is a method for matching users through stylometry inspired techniques. We experimented with linguistic, temporal, and combined temporal-linguistic models for matching user accounts, using standard and novel techniques. Using publicly available data, our best model, a combined temporal-linguistic one, was able to correctly match the accounts of 31% of 5,612 distinct users across Twitter and Facebook.
... • Multi-Layer Information: To obtain media with dual-layer information (emotional and semantic), sentiment analysis frameworks [56,57] were used to select media that contained at least one sentimental tag in its user-defined tag list. This was particularly crucial for the image data, as they relied solely on user-defined tags. ...
Article
Full-text available
Cross-modality understanding is essential for AI to tackle complex tasks that require both deterministic and generative capabilities, such as correlating music and visual art. The existing state-of-the-art methods of audio-visual correlation often rely on single-dimension information, focusing either on semantic or emotional attributes, thus failing to capture the full depth of these inherently complex modalities. Addressing this limitation, we introduce a novel approach that perceives music–image correlation as multilayered rather than as a direct one-to-one correspondence. To this end, we present a pioneering dataset with two segments: an artistic segment that pairs music with art based on both emotional and semantic attributes, and a realistic segment that links music with images through affective–semantic layers. In modeling emotional layers for the artistic segment, we found traditional 2D affective models inadequate, prompting us to propose a more interpretable hybrid-emotional rating system that serves both experts and non-experts. For the realistic segment, we utilize a web-based dataset with tags, dividing tag information into semantic and affective components to ensure a balanced and nuanced representation of music–image correlation. We conducted an in-depth statistical analysis and user study to evaluate our dataset’s effectiveness and applicability for AI-driven understanding. This work provides a foundation for advanced explorations into the complex relationships between auditory and visual art modalities, advancing the development of more sophisticated cross-modal AI systems.
... This was necessary because there was substantial changes in the median contextual diversity scores over years due to changes in corpus sample sizes etc. We removed stop words using the available lists in Python's NLTK package (Bird et al., 2009). We follow Kim et al. (2014) and allow a buffer period for the historical word vectors to initialize; we use a buffer period of four decades from the first usable decade and only measure changes after this period. ...
Preprint
Understanding how words change their meanings over time is key to models of language and cultural evolution, but historical data on meaning is scarce, making theories hard to develop and test. Word embeddings show promise as a diachronic tool, but have not been carefully evaluated. We develop a robust methodology for quantifying semantic change by evaluating word embeddings (PPMI, SVD, word2vec) against known historical changes. We then use this methodology to reveal statistical laws of semantic evolution. Using six historical corpora spanning four languages and two centuries, we propose two quantitative laws of semantic change: (i) the law of conformity---the rate of semantic change scales with an inverse power-law of word frequency; (ii) the law of innovation---independent of frequency, words that are more polysemous have higher rates of semantic change.
Research
Full-text available
Over the past few decades, there has been significant progress in sentiment analysis, primarily focusing on analyzing text. However, the field of sentiment analysis linked to audio remains relatively undeveloped in the scientific community. This study aims to address this gap by introducing sentiment analysis applied to voice transcripts, specifically focusing on distinguishing emotions of individual speakers in conversations. The proposed research article seeks to develop a sentiment analysis system capable of rapidly interacting with multiple users and analyzing the sentiment of each user's audio input. Key components of this approach include speech recognition, Mel-frequency cepstral coefficients (MFCC), dynamic time warping (DTW), sentiment analysis, and speaker recognition. I. INTRODUCTION Sentiment analysis is used to analyze people's feelings or attitudes based on a conversation, topic, or general conversation. Sentiment analysis is used for various purposes such as in various applications and websites. We use our knowledge to create an assistant that understands and learns the human way of thinking based on holding each other. A machine that understands people's emotions/moods through these conversations and what keyword was used in the conversations. The combined sentiment analysis of the speaker and the speech is done with data and conversations extracted from previous conversations and various processes. Understanding people's thoughts and feelings has many applications. For example, technology that can understand and respond to a person's personal emotions will be important. Imagine a device that senses a person's mood and adjusts its settings based on the user's preferences and needs. Such innovations can improve user experience and satisfaction. In addition, research institutions are actively working to improve the quality and translation of audio content into text. This includes a variety of materials such as news reports, political speeches, social media, and music. By enabling this technology, they aim to make audio content more accessible and useful in many situations. Our research body has also worked on voice evaluation research [1,2,3] to study the conversation between the user and the assistance model and distinguish between each speaker and their emotions. Because there are several speakers in a conversation, it is difficult to analyze the text data of the recorded voice, so this paper proposes a model that can easily recognize the presence of different speakers and identify them as separate, and perform voice analysis; individuals speak and responds according to his feelings. We present an approach and viewpoint on investigating the hurdles and techniques involved in audio perception analysis of sound recordings through speech recognition. Our methodology utilizes a speech recognition model to interpret audio recordings, coupled with a suggested method for user discrimination based on a predetermined hypothesis to authenticate distinct speakers. Subsequently, each segment of speech data is scrutinized, enabling the system to accurately discern genuine emotions and topics of conversation. Part II, we delve into the underlying hypothesis regarding the speaker, explore speech recognition, and delve into sentiment analysis. Section III elaborates on the proposed system, while Section IV outlines the characteristics of the experimental configuration. Following that, Section V showcases the results obtained and provides an extensive analysis. Finally, Episode VI concludes the project.
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Article
Full-text available
In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
Conference Paper
Full-text available
It is a truism of literature that certain authors have a highly recognizable style. The concept of style under- lies the authorship attribution techniques that have been applied to tasks such as identifying which of sev- eral authors wrote a particular news article. In this paper, we explore whether the works of authors of classic literature can be correctly identied with ei- ther of two approaches to attribution, using a collec- tion of 634 texts by 55 authors. Our results show that these methods can be highly accurate, with errors pri- marily for authors where it might be argued that style is lacking. And did Marlowe write the works of Shake- speare? Our preliminary evidence suggests not.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Article
Full-text available
In this study, we developed an algorithmic method to analyze late contrast-enhanced (CE) magnetic resonance (MR) images, revealing the so-called hibernating myocardium. The algorithm is based on an efficient and robust image registration algorithm. Using ...
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Article
Full-text available
The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these measures, all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. An information-content-based measure proposed by Jiang and Conrath is found superior to those proposed by Hirst and St-Onge, Leacock and Chodorow, Lin, and Resnik. In addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness.
Conference Paper
Full-text available
We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbit rary combinations of lexical features, part-of-speech fea tures and dependency features. MaltParser is freely available for research and e ducational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to language documentation, such as printing out the contents of a lexicon as HTML. National Foreign Language Resource Center
Article
Bresnan et al. (2007) show that a statistical model can predict United States (US) English speakers’ syntactic choices with ‘give’-type verbs extremely accurately. They argue that these results are consistent with probabilistic models of grammar, which assume that grammar is quantitive, and learned from exposure to other speakers. Such a model would also predict syntactic differences across time and space which are reflected not only in the use of clear dialectal features or clear-cut changes in progress, but also in subtle factors such as the relative importance of conditioning factors, and changes over time in speakers’ preferences between equally well-formed variants. This paper investigates these predictions by comparing the grammar of phrases involving ‘give’ in New Zealand (NZ) and US English. We find that the grammar developed by Bresnan et al. for US English generalizes remarkably well to NZ English. NZ English is, however, subtly different, in that NZ English speakers appear to be more sensitive to the role of animacy. Further, we investigate changes over time in NZ English and find that the overall behavior of ‘give’ phrases has subtly shifted. We argue that these subtle differences in space and time provide support for the gradient nature of grammar, and are consistent with usage-based, probabilistic syntactic models.