BookPDF Available

Natural Language Processing with Python

Authors:
A preview of the PDF is not available
... We compute the term frequency-inverse document frequency (TF-IDF) representation for documents using TFIDFVectorizer [35]. The NLTK English stopword list [3], and words appearing in less than 5 documents, or more than 70% of the documents were removed. We use the tokenizer [a-zA-Z]+ and limit the vocabulary size to 5000. ...
... In this approach, we normalize each of the columns of the representation matrix to produceŜ. Now, in either approach, we apply a metric P which measures the association between the th topicdocuments associationŜ and the best ground truth subgroup-documents association, M I (here M I is the 3 In the 20 Newsgroups data set, each document belongs to only one subgroup. ...
Preprint
We propose several new models for semi-supervised nonnegative matrix factorization (SSNMF) and provide motivation for SSNMF models as maximum likelihood estimators given specific distributions of uncertainty. We present multiplicative updates training methods for each new model, and demonstrate the application of these models to classification, although they are flexible to other supervised learning tasks. We illustrate the promise of these models and training methods on both synthetic and real data, and achieve high classification accuracy on the 20 Newsgroups dataset.
... We tokenize the training data with the NLTK tokenizer (Bird et al., 2009) and count unigram and bigram frequencies. The greedy corrector processes a sequence from left to right. ...
... 10,000 randomly selected articles each, and the remaining as a training set. The articles were split into paragraphs, and development and test paragraphs further split into sentences with the NLTK sentence segmenter (Bird et al., 2009). All sequences were stripped from leading and trailing spaces, and empty sequences removed. ...
Preprint
We consider the following tokenization repair problem: Given a natural language text with any combination of missing or spurious spaces, correct these. Spelling errors can be present, but it's not part of the problem to correct them. For example, given: "Tispa per isabout token izaionrep air", compute "Tis paper is about tokenizaion repair". It is tempting to think of this problem as a special case of spelling correction or to treat the two problems together. We make a case that tokenization repair and spelling correction should and can be treated as separate problems. We investigate a variety of neural models as well as a number of strong baselines. We identify three main ingredients to high-quality tokenization repair: deep language models with a bidirectional component, training the models on text with spelling errors, and making use of the space information already present. Our best methods can repair all tokenization errors on 97.5% of the correctly spelled test sentences and on 96.0% of the misspelled test sentences. With all spaces removed from the given text (the scenario from previous work), the accuracy falls to 94.5% and 90.1%, respectively. We conduct a detailed error analysis.
... As we will focus on English lyrics in this study, we used the English stop words corpus from the Natural Language Toolkit (NLTK)[23] ...
Conference Paper
Full-text available
Psychology research has shown that song lyrics are a rich source of data, yet they are often overlooked in the field of MIR compared to audio. In this paper, we provide an initial assessment of the usefulness of features drawn from lyrics for various fields, such as MIR and Music Psychology. To do so, we assess the performance of lyric-based text features on 3 MIR tasks, in comparison to audio features. Specifically, we draw sets of text features from the field of Natural Language Processing and Psychology. Further , we estimate their effect on performance while statistically controlling for the effect of audio features, by using a hierarchical regression statistical model. Lyric-based features show a small but statistically significant effect, that anticipates further research. Implications and directions for future studies are discussed.
... An example for stop words: the, an, a, you, he, to, be. In this paper, we used the list of English stop words included with Natural Language Toolkit [12]. ...
Conference Paper
Security bug reports (SBR) depict potential security vulnerabilities in software systems. Bug tracking systems (BTS) usually contain huge numbers of bug reports including security-related ones. Malicious attackers could exploit these SBRs. Henceforth, it is very critical to pinpoint SBRs swiftly and correctly. In this work, we studied the security bug reports of the Chromium project. We looked into three main aspects of these bug reports, namely: frequencies of reporting them, how quickly they get fixed and is LDA effective in grouping these reports to known vulnerabilities types. We report our findings in these aspects.
... • Stemming: Stemming is the process of reducing a word to its base root form. We used Porter Stemmer 4 from NLTK (Steven Bird and Loper, 2009) to stem. Stemming is used in combination with the Naive Bayes classifier. ...
... Learning-based Approach: These approaches learn specific patterns that can be used to identify sentences that are questions. In this study, we evaluated the following methods: (i) A Naive Bayes Classifier which was trained on the NPS Chat Corpus that consists of over 10,000 posts from instant messaging sessions (Bird et al., 2009). As these posts have been labeled with dialogue act types, such as "Statement","ynQuestion", we used the classifier without any further training. ...
Preprint
In most clinical practice settings, there is no rigorous reviewing of the clinical documentation, resulting in inaccurate information captured in the patient medical records. The gold standard in clinical data capturing is achieved via "expert-review", where clinicians can have a dialogue with a domain expert (reviewers) and ask them questions about data entry rules. Automatically identifying "real questions" in these dialogues could uncover ambiguities or common problems in data capturing in a given clinical setting. In this study, we proposed a novel multi-channel deep convolutional neural network architecture, namely Quest-CNN, for the purpose of separating real questions that expect an answer (information or help) about an issue from sentences that are not questions, as well as from questions referring to an issue mentioned in a nearby sentence (e.g., can you clarify this?), which we will refer as "c-questions". We conducted a comprehensive performance comparison analysis of the proposed multi-channel deep convolutional neural network against other deep neural networks. Furthermore, we evaluated the performance of traditional rule-based and learning-based methods for detecting question sentences. The proposed Quest-CNN achieved the best F1 score both on a dataset of data entry-review dialogue in a dialysis care setting, and on a general domain dataset.
Preprint
Student reviews often make reference to professors' physical appearances. Until recently RateMyProfessors.com, the website of this study's focus, used a design feature to encourage a "hot or not" rating of college professors. In the wake of recent #MeToo and #TimesUp movements, social awareness of the inappropriateness of these reviews has grown; however, objectifying comments remain and continue to be posted in this online context. We describe two supervised text classifiers for detecting objectifying commentary in professor reviews. We then ensemble these classifiers and use the resulting model to track objectifying commentary at scale. We measure correlations between objectifying commentary, changes to the review website interface, and teacher gender across a ten-year period.
Article
Full-text available
Abstract Human sleep/wake cycles follow a stable circadian rhythm associated with hormonal, emotional, and cognitive changes. Changes of this cycle are implicated in many mental health concerns. In fact, the bidirectional relation between major depressive disorder and sleep has been well-documented. Despite a clear link between sleep disturbances and subsequent disturbances in mood, it is difficult to determine from self-reported data which specific changes of the sleep/wake cycle play the most important role in this association. Here we observe marked changes of activity cycles in millions of twitter posts of 688 subjects who explicitly stated in unequivocal terms that they had received a (clinical) diagnosis of depression as compared to the activity cycles of a large control group (n = 8791). Rather than a phase-shift, as reported in other work, we find significant changes of activity levels in the evening and before dawn. Compared to the control group, depressed subjects were significantly more active from 7 PM to midnight and less active from 3 to 6 AM. Content analysis of tweets revealed a steady rise in rumination and emotional content from midnight to dawn among depressed individuals. These results suggest that diagnosis and treatment of depression may focus on modifying the timing of activity, reducing rumination, and decreasing social media use at specific hours of the day.
Preprint
Full-text available
This thesis's goal is to present an algorithm that serves to augment the regular word frequency algorithms by introducing and incorporating the synonyms of the words in the process of counting frequencies. The thesis consists of three parts that serve to inform the reader about why this change could be an improvement, provide a description of the technologies used in the project as well as an in-depth explanation of the algorithm that was built and finally its application to real world data along with the analysis of the results.
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Conference Paper
Full-text available
It is a truism of literature that certain authors have a highly recognizable style. The concept of style under- lies the authorship attribution techniques that have been applied to tasks such as identifying which of sev- eral authors wrote a particular news article. In this paper, we explore whether the works of authors of classic literature can be correctly identied with ei- ther of two approaches to attribution, using a collec- tion of 634 texts by 55 authors. Our results show that these methods can be highly accurate, with errors pri- marily for authors where it might be argued that style is lacking. And did Marlowe write the works of Shake- speare? Our preliminary evidence suggests not.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Article
Full-text available
In this study, we developed an algorithmic method to analyze late contrast-enhanced (CE) magnetic resonance (MR) images, revealing the so-called hibernating myocardium. The algorithm is based on an efficient and robust image registration algorithm. Using ...
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Article
Full-text available
The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these measures, all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. An information-content-based measure proposed by Jiang and Conrath is found superior to those proposed by Hirst and St-Onge, Leacock and Chodorow, Lin, and Resnik. In addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness.
Conference Paper
Full-text available
We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbit rary combinations of lexical features, part-of-speech fea tures and dependency features. MaltParser is freely available for research and e ducational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to language documentation, such as printing out the contents of a lexicon as HTML. National Foreign Language Resource Center
Article
In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
Article
Bresnan et al. (2007) show that a statistical model can predict United States (US) English speakers’ syntactic choices with ‘give’-type verbs extremely accurately. They argue that these results are consistent with probabilistic models of grammar, which assume that grammar is quantitive, and learned from exposure to other speakers. Such a model would also predict syntactic differences across time and space which are reflected not only in the use of clear dialectal features or clear-cut changes in progress, but also in subtle factors such as the relative importance of conditioning factors, and changes over time in speakers’ preferences between equally well-formed variants. This paper investigates these predictions by comparing the grammar of phrases involving ‘give’ in New Zealand (NZ) and US English. We find that the grammar developed by Bresnan et al. for US English generalizes remarkably well to NZ English. NZ English is, however, subtly different, in that NZ English speakers appear to be more sensitive to the role of animacy. Further, we investigate changes over time in NZ English and find that the overall behavior of ‘give’ phrases has subtly shifted. We argue that these subtle differences in space and time provide support for the gradient nature of grammar, and are consistent with usage-based, probabilistic syntactic models.