BookPDF Available

Natural Language Processing with Python

Authors:
A preview of the PDF is not available
... Recovery Rate (RR.) is the percentage of tokens in private dataset recovered by different attack methods. We filtered punctuations, special tokens and NLTK's stop words (Bird et al., 2009) in the private dataset. Attack Accuracy (Acc.) is the classification accuracy using evaluation classifier on inverted texts. ...
... The larger the distance is, the stronger correlation the two lists have. When ranking the tokens in each dataset, we filtered the special tokens, punctuations, and stop words in NLTK (Bird et al., 2009). From the results in Table 6, we can find the word ranking and frequency is close between the private and public datasets. ...
Preprint
Full-text available
Text classification has become widely used in various natural language processing applications like sentiment analysis. Current applications often use large transformer-based language models to classify input texts. However, there is a lack of systematic study on how much private information can be inverted when publishing models. In this paper, we formulate \emph{Text Revealer} -- the first model inversion attack for text reconstruction against text classification with transformers. Our attacks faithfully reconstruct private texts included in training data with access to the target model. We leverage an external dataset and GPT-2 to generate the target domain-like fluent text, and then perturb its hidden state optimally with the feedback from the target model. Our extensive experiments demonstrate that our attacks are effective for datasets with different text lengths and can reconstruct private texts with accuracy.
... Lemmatization is performed to generalize words with the same word stem. The mentioned methods and stop word list are part of the Python package nltk (Bird et al., 2009). ...
Preprint
Unsupervised sentiment analysis is traditionally performed by counting those words in a text that are stored in a sentiment lexicon and then assigning a label depending on the proportion of positive and negative words registered. While these "counting" methods are considered to be beneficial as they rate a text deterministically, their classification rates decrease when the analyzed texts are short or the vocabulary differs from what the lexicon considers default. The model proposed in this paper, called Lex2Sent, is an unsupervised sentiment analysis method to improve the classification of sentiment lexicon methods. For this purpose, a Doc2Vec-model is trained to determine the distances between document embeddings and the embeddings of the positive and negative part of a sentiment lexicon. These distances are then evaluated for multiple executions of Doc2Vec on resampled documents and are averaged to perform the classification task. For three benchmark datasets considered in this paper, the proposed Lex2Sent outperforms every evaluated lexicon, including state-of-the-art lexica like VADER or the Opinion Lexicon in terms of classification rate.
... Although Maki et al. make available a database of precomputed distances for around 50,000 word pairs, it covers only approximately 10-15% of most of the similarity datasets we set out to model here. We therefore opted to recompute Jiang-Conrath distances on WordNet using the implementation in NLTK (Natural Language Toolkit) version 3.2 (Bird et al., 2009), which covered 3776 word pairs (Word-Net distance M = 11.39, SD = 6.06). ...
Article
Full-text available
Experimental design and computational modelling across the cognitive sciences often rely on measures of semantic similarity between concepts. Traditional measures of semantic similarity are typically derived from distance in taxonomic databases (e.g. WordNet), databases of participant-produced semantic features, or corpus-derived linguistic distributional similarity (e.g. CBOW), all of which are theoretically problematic in their lack of grounding in sensorimotor experience. We present a new measure of sensorimotor distance between concepts, based on multidimensional comparisons of their experiential strength across 11 perceptual and action-effector dimensions in the Lancaster Sensorimotor Norms. We demonstrate that, in modelling human similarity judgements, sensorimotor distance has comparable explanatory power to other measures of semantic similarity, explains variance in human judgements which is missed by other measures, and does so with the advantages of remaining both grounded and computationally efficient. Moreover, sensorimotor distance is equally effective for both concrete and abstract concepts. We further introduce a web-based tool ( https://lancaster.ac.uk/psychology/smdistance ) for easily calculating and visualising sensorimotor distance between words, featuring coverage of nearly 800 million word pairs. Supplementary materials are available at https://osf.io/d42q6/ .
... 1. We filter out tokens that are ambiguous, i.e. tokens which can either be a noun or a verb. We used Wordnet (Miller, 1995) implemented in the NLTK library (Bird et al., 2009) in python 3 to check whether a word was not classified as a noun and a verb by checking whether there was no synset in the other category. ...
Preprint
Both humans and neural language models are able to perform subject-verb number agreement (SVA). In principle, semantics shouldn't interfere with this task, which only requires syntactic knowledge. In this work we test whether meaning interferes with this type of agreement in English in syntactic structures of various complexities. To do so, we generate both semantically well-formed and nonsensical items. We compare the performance of BERT-base to that of humans, obtained with a psycholinguistic online crowdsourcing experiment. We find that BERT and humans are both sensitive to our semantic manipulation: They fail more often when presented with nonsensical items, especially when their syntactic structure features an attractor (a noun phrase between the subject and the verb that has not the same number as the subject). We also find that the effect of meaningfulness on SVA errors is stronger for BERT than for humans, showing higher lexical sensitivity of the former on this task.
... Different from emotion and sentiment analysis, all utterances were used for topic modeling analysis. In addition to the stop words provided by NLTK python package [15], we added 48 stop words. The words in the data were transformed to their numeric representations using term-frequency inverse-document-frequency (TF-IDF). ...
Preprint
Transition to Adulthood is an essential life stage for many families. The prior research has shown that young people with intellectual or development disabil-ities (IDD) have more challenges than their peers. This study is to explore how to use natural language processing (NLP) methods, especially unsupervised machine learning, to assist psychologists to analyze emotions and sentiments and to use topic modeling to identify common issues and challenges that young people with IDD and their families have. Additionally, the results were compared to those obtained from young people without IDD who were in tran-sition to adulthood. The findings showed that NLP methods can be very useful for psychologists to analyze emotions, conduct cross-case analysis, and sum-marize key topics from conversational data. Our Python code is available at https://github.com/mlaricheva/emotion_topic_modeling.
Article
Randomized prospective studies represent the gold standard for experimental design. In this paper, we present a randomized prospective study to validate the benefits of combining rule-based and data-driven natural language understanding methods in a virtual patient dialogue system. The system uses a rule-based pattern matching approach together with a machine learning (ML) approach in the form of a text-based convolutional neural network, combining the two methods with a simple logistic regression model to choose between their predictions for each dialogue turn. In an earlier, retrospective study, the hybrid system yielded a nearly 50% error reduction on our initial data, in part due to the differential performance between the two methods as a function of label frequency. Given these gains, and considering that our hybrid approach is unique among virtual patient systems, we compare the hybrid system to the rule-based system by itself in a randomized prospective study. We evaluate 110 unique medical student subjects interacting with the system over 5,296 conversation turns, to verify whether similar gains are observed in a deployed system. This prospective study broadly confirms the findings from the earlier one but also highlights important deficits in our training data. The hybrid approach still improves over either rule-based or ML approaches individually, even handling unseen classes with some success. However, we observe that live subjects ask more out-of-scope questions than expected. To better handle such questions, we investigate several modifications to the system combination component. These show significant overall accuracy improvements and modest F1 improvements on out-of-scope queries in an offline evaluation. We provide further analysis to characterize the difficulty of the out-of-scope problem that we have identified, as well as to suggest future improvements over the baseline we establish here.
Chapter
Searchable Encryption schemes provide secure search over encrypted databases while allowing admitted information leakages. Generally, the leakages can be categorized into access and volume pattern. In most existing SE schemes, these leakages are caused by practical designs but are considered an acceptable price to achieve high search efficiency. Recent attacks have shown that such leakages could be easily exploited to retrieve the underlying keywords for search queries. Under the umbrella of attacking SE, we design a new Volume and Access Pattern Leakage-Abuse Attack (VAL-Attack) that improves the matching technique of LEAP (CCS ’21) and exploits both the access and volume patterns. Our proposed attack only leverages leaked documents and the keywords present in those documents as auxiliary knowledge and can effectively retrieve document and keyword matches from leaked data. Furthermore, the recovery performs without false positives. We further compare VAL-Attack with two recent well-defined attacks on several real-world datasets to highlight the effectiveness of our attack and present the performance under popular countermeasures. KeywordsSearchable encryptionAccess patternVolume patternLeakageAttack
Chapter
The COVID-19 pandemic brought upon a plethora of misinformation from fake news articles and posts on social media platforms. This necessitates the task of identifying whether a particular piece of information about COVID-19 is legitimate or not. However, with excessive misinformation spreading rapidly over the internet, manual verification of sources becomes infeasible. Several studies have already explored the use of machine learning towards automating COVID-19 misinformation detection. This paper will investigate COVID-19 misinformation detection in three parts. First, we identify the common themes found in COVID-19 misinformation data using Latent Dirichlet Allocation (LDA). Second, we use CatBoost as a classifier for detecting misinformation and compare its performance against other classifiers such as SVM, XGBoost, and LightGBM. Lastly, we highlight CatBoost’s most important features and decision-making mechanism using Shapley values.KeywordsCOVID-19Misinformation detectionCatBoostLatent Dirichlet AllocationExplainable AI
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Article
Full-text available
In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
Conference Paper
Full-text available
It is a truism of literature that certain authors have a highly recognizable style. The concept of style under- lies the authorship attribution techniques that have been applied to tasks such as identifying which of sev- eral authors wrote a particular news article. In this paper, we explore whether the works of authors of classic literature can be correctly identied with ei- ther of two approaches to attribution, using a collec- tion of 634 texts by 55 authors. Our results show that these methods can be highly accurate, with errors pri- marily for authors where it might be argued that style is lacking. And did Marlowe write the works of Shake- speare? Our preliminary evidence suggests not.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Article
Full-text available
In this study, we developed an algorithmic method to analyze late contrast-enhanced (CE) magnetic resonance (MR) images, revealing the so-called hibernating myocardium. The algorithm is based on an efficient and robust image registration algorithm. Using ...
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Article
Full-text available
The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these measures, all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. An information-content-based measure proposed by Jiang and Conrath is found superior to those proposed by Hirst and St-Onge, Leacock and Chodorow, Lin, and Resnik. In addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness.
Conference Paper
Full-text available
We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbit rary combinations of lexical features, part-of-speech fea tures and dependency features. MaltParser is freely available for research and e ducational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to language documentation, such as printing out the contents of a lexicon as HTML. National Foreign Language Resource Center
Article
Bresnan et al. (2007) show that a statistical model can predict United States (US) English speakers’ syntactic choices with ‘give’-type verbs extremely accurately. They argue that these results are consistent with probabilistic models of grammar, which assume that grammar is quantitive, and learned from exposure to other speakers. Such a model would also predict syntactic differences across time and space which are reflected not only in the use of clear dialectal features or clear-cut changes in progress, but also in subtle factors such as the relative importance of conditioning factors, and changes over time in speakers’ preferences between equally well-formed variants. This paper investigates these predictions by comparing the grammar of phrases involving ‘give’ in New Zealand (NZ) and US English. We find that the grammar developed by Bresnan et al. for US English generalizes remarkably well to NZ English. NZ English is, however, subtly different, in that NZ English speakers appear to be more sensitive to the role of animacy. Further, we investigate changes over time in NZ English and find that the overall behavior of ‘give’ phrases has subtly shifted. We argue that these subtle differences in space and time provide support for the gradient nature of grammar, and are consistent with usage-based, probabilistic syntactic models.