BookPDF Available

Natural Language Processing with Python

Authors:
A preview of the PDF is not available
... For retrieval of the definitions, we accessed WordNet through the Natural Language Toolkit (NLTK, Bird et al., 2009) and Merriam Webster Online as well as Dictionary.com through HTTP requests. ...
... Number of tokens t We also experimented with limiting the number of tokens within a given definition to see whether definitively gendered terms were more likely to be mentioned earlier in a given definition. The definitions were tokenized using NLTK (Bird et al., 2009). We took the first t tokens of each definition. ...
... The articles were then cleaned and tokenized into sentences using NLTK (Bird et al., 2009) and subsequently processed with SpaCy to obtain partof-speech (POS) tags for each word. All singular and plural nouns (POS-tags: NN, NNS) were then extracted and analysed for lexical gender. ...
Preprint
Full-text available
This paper presents a new method for automatically detecting words with lexical gender in large-scale language datasets. Currently, the evaluation of gender bias in natural language processing relies on manually compiled lexicons of gendered expressions, such as pronouns ('he', 'she', etc.) and nouns with lexical gender ('mother', 'boyfriend', 'policewoman', etc.). However, manual compilation of such lists can lead to static information if they are not periodically updated and often involve value judgments by individual annotators and researchers. Moreover, terms not included in the list fall out of the range of analysis. To address these issues, we devised a scalable, dictionary-based method to automatically detect lexical gender that can provide a dynamic, up-to-date analysis with high coverage. Our approach reaches over 80% accuracy in determining the lexical gender of nouns retrieved randomly from a Wikipedia sample and when testing on a list of gendered words used in previous research.
... Tokenization is the process of decomposing a sentence int words (a.k.a the tokens) [44]. In the next step, we perform stemming [45] and lemmatization [46]. In the process of stemming, we remove the suffixes from a word. ...
Article
Full-text available
This paper examines the transition in the cyber-security discipline induced by the ongoing COVID-19 pandemic. Using the classical information retrieval techniques, a more than twenty thousand documents are analyzed for the cyber content. In particular, we build the topic models using the Latent Dirichlet Allocation (LDA) unsupervised machine learning algorithm. The literature corpus is build through a uniform keyword search process made on the scholarly and the non-scholarly platforms filtered through the years 2010-2021. To qualitatively know the impact of COVID-19 pandemic on cyber-security, and perform a trend analysis of key themes, we organize the entire corpus into various (combination of) categories based on time period and whether the literature has undergone peer review process. Based on the weighted distribution of keywords in the aggregated corpus, we identify the key themes. While in the pre-COVID-19 period, the topics of cyber-threats to technology, privacy policy, blockchain remain popular, in the post-COVID-19 period, focus has shifted to challenges directly or indirectly brought by the pandemic. In particular, we observe post-COVID-19 cyber-security themes of privacy in healthcare, cyber insurance, cyber risks in supply chain gaining recognition. Few cyber-topics such as of malware, control system security remain important in perpetuity. We believe our work represents the evolving nature of the cyber-security discipline and reaffirms the need to tailor appropriate interventions by noting the key trends.
... The acoustic stimuli used in the experiment were single sentences presented in speech-shaped noise. The sentences were semantically unpredictable and were generated using Python's Natural Language Toolkit (Beysolow II 2018;Bird et al. 2009). Each sentence (e.g. ...
Thesis
Full-text available
Humans are highly skilled at the analysis of complex auditory scenes. In particular, the human auditory system is characterized by incredible robustness to noise and can nearly effortlessly isolate the voice of a specific talker from even the busiest of mixtures. However, neural mechanisms underlying these remarkable properties remain poorly understood. This is mainly due to the inherent complexity of speech signals and multi-stage, intricate processing performed in the human auditory system. Understanding these neural mechanisms underlying speech perception is of interest for clinical practice, brain-computer interfacing and automatic speech processing systems. In this thesis, we developed computational models characterizing neural speech processing across different stages of the human auditory pathways. In particular, we studied the active role of slow cortical oscillations in speech-in-noise comprehension through a spiking neural network model for encoding spoken sentences. The neural dynamics of the model during noisy speech encoding reflected speech comprehension of young, normal-hearing adults. The proposed theoretical model was validated by predicting the effects of non-invasive brain stimulation on speech comprehension in an experimental study involving a cohort of volunteers. Moreover, we developed a modelling framework for detecting the early, high-frequency neural response to the uninterrupted speech in non-invasive neural recordings. We applied the method to investigate the top-down modulation of this response by the listener's selective attention and linguistic properties of different words from a spoken narrative. We found that in both cases, the detected responses of predominantly subcortical origin were significantly modulated, which supports the functional role of feedback, between higher- and lower levels stages of the auditory pathways, in speech perception. The proposed computational models shed light on some of the poorly understood neural mechanisms underlying speech perception. The developed methods can be readily employed in future studies involving a range of experimental paradigms beyond these considered in this thesis.
... • NLTK [45] содержит реализации SnowballStemmer [6] и словарь стопслов. ...
Thesis
Full-text available
This paper considers the application of the natural language processing methods to subject work programs, project descriptions, vacancies, and other documents, in order to use them to comprise a student profile and make recommendations on choosing various activities. General methods used are keyword extraction and fuzzy string matching. Also, a benchmark was developed to estimate the "quality" of the extracted keywords. The resulting tool is integrated with Informational System "Partner’s Account".
Article
Multi-turn response selection is a key issue in retrieval-based chatbots and has attracted considerable attention in the NLP (Natural Language processing) field. So far, researchers have developed many solutions that can select appropriate responses for multi-turn conversations. However, these works are still suffering from the semantic mismatch problem when responses and context share similar words with different meanings. In this paper, we propose a novel chatbot model based on Semantic Awareness Matching, called SAM. SAM can capture both similarity and semantic features in the context by a two-layer matching network. Appropriate responses are selected according to the matching probability made through the aggregation of the two feature types. In the evaluation, we pick four widely-used datasets and compare SAM’s performance to that of twelve other models. Experiment results show that SAM achieves substantial improvements, with up to 1.5% R 10 @1 on Ubuntu Dialogue Corpus V2, 0.5% R 10 @1 on Douban Conversation Corpus, and 1.3% R 10 @1 on E-commerce Corpus.
Article
Full-text available
Bu araştırmada, fiziki ve çevrim içi seyahat deneyimlerinin, hatırlanma bağlamında karşılaştırılması ve deneyimler arasındaki farkların ortaya konulması amaçlanmıştır. Seyahatin dijitalleşmesi, bireylerin bir yeri fiziki ziyaret etmek yerine, dijital araçlar yardımıyla o yeri dijital alanda ziyaret etmesi olarak açıklanmaktadır. Nitel ve nicel araştırma yöntemlerinin birlikte kullanıldığı bu araştırmanın birinci kısmında, yürüyüş turu videolarından YouTube’da en çok izlenme sayısı alan iki videonun izleyici yorumları, sosyal ağ analiz yöntemlerinden duygu ve metin analizi yöntemlerine göre incelenmiştir. Araştırmanın ikinci kısmında ise nitel ve deneysel tasarıma sahip odak grup görüşmeleri yapılmıştır. Odak grup görüşmelerinde katılımcılar, örneklem olarak seçilen iki şehri fiziki seyahat edenler (kontrol grubu) ile çevrim içi seyahat edenlerden (deney grubu) oluşmaktadır. Duygu analizinde, pozitif ve nötr duygu ifadelerinin negatif duygu ifadesine göre daha fazla olduğu, metin analizinde ise bireylerin çevrim içi yürüyüş turunu seyahat deneyimi olarak gördükleri sonucuna ulaşılmıştır. Odak grup görüşmelerinde, çevrim içi seyahat deneyiminde, videonun görüntü kalitesi ile doğal sesin kullanılmasının seyahati hatırlamada etkili olduğu; fiziki seyahat deneyiminin anı içeriklerinin genellikle insan etkileşimleri, sosyal yaşama dair olduğu ortaya konulmuştur. Her iki seyahat deneyiminde de bireylerin mekâna ilişkin önceden medya aracılığıyla edindikleri bilgilerin, imgelerin seyahat deneyimlerini hatırlamalarını güçlendirdiği tespit edilmiştir.
Article
Introduction: Although previous studies have consistently demonstrated that physicians are more likely than non-physicians to experience work-related stressors prior to suicide, the specific nature of these stressors remains unknown. The current study aimed to better characterize job-related problems prior to physician suicide. Methods: The study utilized a mixed methods approach combining thematic analysis and natural language processing to develop themes representing death investigation narratives of 200 physician suicides with implicated job problems in the National Violent Death Reporting System database between 2003 and 2018. Results: Through thematic analysis, six overarching themes were identified: incapacity to work due to deterioration of physical health, substance use jeopardizing employment, interaction between mental health and work-related issues, relationship conflict affecting work, legal problems leading to work-related stress, and increased financial stress. Natural language processing analysis confirmed five of these themes and elucidated important subthemes. Conclusions: This is the first known study that integrated thematic analysis and natural language processing to characterize work-related stressors preceding physician suicide. The findings highlight the importance of bolstering systemic support for physicians experiencing job problems associated with their physical and mental health, substance use, relationships, legal matters, and finances in suicide prevention efforts.
Article
Full-text available
New ways of documenting and describing language via electronic media coupled with new ways of distributing the results via the World -Wide Web offer a degree of access to language resources that is unparalleled in history. At the same time, the proliferation of approaches to using these new technologies is causing serious problems relating to resource discovery and resource creation. This article describes the infras tructure that the Open Language Archives Community (OLAC) has built in order to address these problems. Its technical and usage infrastructures address problems of resource discovery by constructing a single virtual library of distributed resources. Its go vernance infrastructure addresses problems of resource creation by providing a mechanism through which the language-resource community can express its consensus on recommended best practices.
Conference Paper
Full-text available
It is a truism of literature that certain authors have a highly recognizable style. The concept of style under- lies the authorship attribution techniques that have been applied to tasks such as identifying which of sev- eral authors wrote a particular news article. In this paper, we explore whether the works of authors of classic literature can be correctly identied with ei- ther of two approaches to attribution, using a collec- tion of 634 texts by 55 authors. Our results show that these methods can be highly accurate, with errors pri- marily for authors where it might be argued that style is lacking. And did Marlowe write the works of Shake- speare? Our preliminary evidence suggests not.
Article
Full-text available
The process of documenting and describing the world's languages is undergoing radical transformation with the rapid uptake of new digital technologies for capture, storage, annotation and dissemination. However, uncritical adoption of new tools and technologies is leading to resources that are difficult to reuse and which are less portable than the conventional printed resources they replace. We begin by reviewing current uses of software tools and digital technologies for language documentation and description. This sheds light on how digital language documentation and description are created and managed, leading to an analysis of seven portability problems under the following headings: content, format, discovery, access, citation, preservation and rights. After characterizing each problem we provide a series of value statements, and this provides the framework for a broad range of best practice recommendations.
Article
Full-text available
In this study, we developed an algorithmic method to analyze late contrast-enhanced (CE) magnetic resonance (MR) images, revealing the so-called hibernating myocardium. The algorithm is based on an efficient and robust image registration algorithm. Using ...
Article
Full-text available
In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using three criteria that only require information about the candidate type itself and are independent of context: Abbreviations can be defined as a very tight collocation consisting of a truncated word and a final period, abbreviations are usually short, and abbreviations sometimes contain internal periods. We also show the potential of collocational evidence for two other important subtasks of sentence boundary disambiguation, namely the detection of initials and ordinal numbers. The proposed system has been tested extensively on eleven different languages and on different text genres. It achieves good results without any further amendments or language-specific resources. We evaluate its performance against three different baselines and compare it to other systems for sentence boundary detection proposed in the literature.
Article
Full-text available
The quantification of lexical semantic relatedness has many applications in NLP, and many different measures have been proposed. We evaluate five of these measures, all of which use WordNet as their central resource, by comparing their performance in detecting and correcting real-word spelling errors. An information-content-based measure proposed by Jiang and Conrath is found superior to those proposed by Hirst and St-Onge, Leacock and Chodorow, Lin, and Resnik. In addition, we explain why distributional similarity is not an adequate proxy for lexical semantic relatedness.
Conference Paper
Full-text available
We introduce MaltParser, a data-driven parser generator for dependency parsing. Given a treebank in dependency format, MaltParser can be used to induce a parser for the language of the treebank. MaltParser supports several parsing algorithms and learning algorithms, and allows user-defined feature models, consisting of arbit rary combinations of lexical features, part-of-speech fea tures and dependency features. MaltParser is freely available for research and e ducational purposes and has been evaluated empirically on Swedish, English, Czech, Danish and Bulgarian.
Article
Full-text available
This paper shows how fieldwork data can be managed using the program Toolbox together with the Natural Language Toolkit (NLTK) for the Python programming language. It provides background information about Toolbox and describes how it can be downloaded and installed. The basic functionality of the program for lexicons and texts is described, and its strengths and weaknesses are reviewed. Its underlying data format is briefly discussed, and Toolbox processing capabilities of NLTK are introduced, showing ways in which it can be used to extend the functionality of Toolbox. This is illustrated with a few simple scripts that demonstrate basic data management tasks relevant to language documentation, such as printing out the contents of a lexicon as HTML. National Foreign Language Resource Center
Article
In addition to ordinary words and names, real text contains non-standard “words" (NSWs), including numbers, abbreviations, dates, currency amounts and acronyms. Typically, one cannot find NSWs in a dictionary, nor can one find their pronunciation by an application of ordinary “letter-to-sound" rules. Non-standard words also have a greater propensity than ordinary words to be ambiguous with respect to their interpretation or pronunciation. In many applications, it is desirable to “normalize" text by replacing the NSWs with the contextually appropriate ordinary word or sequence of words. Typical technology for text normalization involves sets of ad hoc rules tuned to handle one or two genres of text (often newspaper-style text) with the expected result that the techniques do not usually generalize well to new domains. The purpose of the work reported here is to take some initial steps towards addressing deficiencies in previous approaches to text normalization. We developed a taxonomy of NSWs on the basis of four rather distinct text types—news text, a recipes newsgroup, a hardware-product-specific newsgroup, and real-estate classified ads. We then investigated the application of several general techniques including n-gram language models, decision trees and weighted finite-state transducers to the range of NSW types, and demonstrated that a systematic treatment can lead to better results than have been obtained by the ad hoc treatments that have typically been used in the past. For abbreviation expansion in particular, we investigated both supervised and unsupervised approaches. We report results in terms of word-error rate, which is standard in speech recognition evaluations, but which has only occasionally been used as an overall measure in evaluating text normalization systems.
Article
Bresnan et al. (2007) show that a statistical model can predict United States (US) English speakers’ syntactic choices with ‘give’-type verbs extremely accurately. They argue that these results are consistent with probabilistic models of grammar, which assume that grammar is quantitive, and learned from exposure to other speakers. Such a model would also predict syntactic differences across time and space which are reflected not only in the use of clear dialectal features or clear-cut changes in progress, but also in subtle factors such as the relative importance of conditioning factors, and changes over time in speakers’ preferences between equally well-formed variants. This paper investigates these predictions by comparing the grammar of phrases involving ‘give’ in New Zealand (NZ) and US English. We find that the grammar developed by Bresnan et al. for US English generalizes remarkably well to NZ English. NZ English is, however, subtly different, in that NZ English speakers appear to be more sensitive to the role of animacy. Further, we investigate changes over time in NZ English and find that the overall behavior of ‘give’ phrases has subtly shifted. We argue that these subtle differences in space and time provide support for the gradient nature of grammar, and are consistent with usage-based, probabilistic syntactic models.