BookPDF Available

Natural Language Processing with Python

Authors:
A preview of the PDF is not available
... For each unique sentence in these corpora, the total number of root verbs were assigned an index which was then used to match arguments to their respective verb. Using the NLTK library for Galician WordNet, a synset for each verbal lemma was identified and matched to its English PropBank role set (Bird et al., 2009;Palmer et al., 2005). This information was then used to assign one of five roles to each token: "r[idx]:root", "r[idx]:arg0", "r[idx]:arg1", "r[idx]:arg2" or "O"; where idx designates the verb index and O designates a noninvolved token. ...
... • samenoun: Retrieves k sentences that contain at least a noun in common with the input sentence. We identify nouns using NLTK (Bird et al., 2009) as done by Amalvy et al. (2023). ...
... We extracted n-grams (1 ≤ n ≤ 4) with high TF-IDF scores (Sparck Jones, 1988), as well as noun phrases (Loria, 2018), as the Mode 3 candidates. We take all Mode 3 candidate pairs that contain an appropriately aligned terminology, stem the English terms using nltk (Bird et al., 2009), and then randomly select of a maximum of ten term pairs for the Mode 3 glossary. ...
... We do not segment the IAs such that each IA contains a single word, because in a single fixation people can read a span of about 21 surrounding characters (Rayner, 1978), meaning that many short words are not fixated on, leading to difficulties with our desired analyses. Instead, we use the natural language processing toolkit (NLTK)'s stopwords list (Bird et al., 2009) to define each IA such that stopwords share an IA with the closest non-stopword. Specifically, each stopword is combined with the closest nonstopword, with non-stopwords to the right being preferred in the case of a tie. ...
... For synonym substitution (Syn.), we employ WordNet (Miller, 1995) and NLTK (Bird et al., 2009), to randomly select and replace words with their synonyms. ...
... This method is intentionally chosen as a lower quality, but still purposeful method of segmentation. The NLTK (Bird et al., 2009) implementation of TextTiling was used for this section. Window sizes for TextTiling were chosen arbitrarily to ensure at least two segments per file, with a value of 100 words used for the Choi 3-11 data set and a window of 20 words used for the Wiki-50 data set. ...
... To achieve these evaluations, we parse our pretraining set into three kinds of structure. The first are constituency trees extracted from CoreNLP (Manning et al., 2014) and binarised using NLTK (Bird et al., 2009). The two other kinds are purely right-branching and balanced binary trees, which we extract using standard algorithms. ...
... Concretely, we use a dictionary that contains derivational morphology mappings (Wu and Yarowsky, 2020) to get the base form of the event predicate. We then construct a set of words for the predicate by including the synonyms for the base form and original form (Bird et al., 2009 lap coefficient, without stop words). ...
... In the implementation, we employ several tools to handle each language. In detail, to tokenize a text, we use Natural Language Toolkit (NLTK) package (Bird et al., 2009) for English, French, Italian, and Indonesian; mecab-python3 package for Japanese; pymecabko package for Korean; splitting to characters for Chinese. Then, we use the same tool as the partof-speech (POS) tagger for each non-Chinese language, and we directly remove all 1-gram Chinese entities. ...
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Article
Full-text available
In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
Conference Paper
Full-text available
It is a truism of literature that certain authors have a highly recognizable style. The concept of style under- lies the authorship attribution techniques that have been applied to tasks such as identifying which of sev- eral authors wrote a particular news article. In this paper, we explore whether the works of authors of classic literature can be correctly identied with ei- ther of two approaches to attribution, using a collec- tion of 634 texts by 55 authors. Our results show that these methods can be highly accurate, with errors pri- marily for authors where it might be argued that style is lacking. And did Marlowe write the works of Shake- speare? Our preliminary evidence suggests not.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Article
Full-text available
In this study, we developed an algorithmic method to analyze late contrast-enhanced (CE) magnetic resonance (MR) images, revealing the so-called hibernating myocardium. The algorithm is based on an efficient and robust image registration algorithm. Using ...
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Article
Full-text available
The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these measures, all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. An information-content-based measure proposed by Jiang and Conrath is found superior to those proposed by Hirst and St-Onge, Leacock and Chodorow, Lin, and Resnik. In addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness.
Conference Paper
Full-text available
We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbit rary combinations of lexical features, part-of-speech fea tures and dependency features. MaltParser is freely available for research and e ducational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to language documentation, such as printing out the contents of a lexicon as HTML. National Foreign Language Resource Center
Article
Bresnan et al. (2007) show that a statistical model can predict United States (US) English speakers’ syntactic choices with ‘give’-type verbs extremely accurately. They argue that these results are consistent with probabilistic models of grammar, which assume that grammar is quantitive, and learned from exposure to other speakers. Such a model would also predict syntactic differences across time and space which are reflected not only in the use of clear dialectal features or clear-cut changes in progress, but also in subtle factors such as the relative importance of conditioning factors, and changes over time in speakers’ preferences between equally well-formed variants. This paper investigates these predictions by comparing the grammar of phrases involving ‘give’ in New Zealand (NZ) and US English. We find that the grammar developed by Bresnan et al. for US English generalizes remarkably well to NZ English. NZ English is, however, subtly different, in that NZ English speakers appear to be more sensitive to the role of animacy. Further, we investigate changes over time in NZ English and find that the overall behavior of ‘give’ phrases has subtly shifted. We argue that these subtle differences in space and time provide support for the gradient nature of grammar, and are consistent with usage-based, probabilistic syntactic models.