
Sandra Kübler- PhD.
- Professor at Indiana University Bloomington
Sandra Kübler
- PhD.
- Professor at Indiana University Bloomington
About
145
Publications
38,861
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,585
Citations
Introduction
Sandra Kübler is a professor for Computational Linguistics at the Department of Linguistics, Indiana University Bloomington. Sandra does research in computational linguistics and computational corpus linguistics. She is also interested in the interface of computational/corpus linguistics and digital humanities.
Skills and Expertise
Current institution
Additional affiliations
August 2006 - present
Publications
Publications (145)
A vast amount of scholarly work is published daily, yet much of it remains inaccessible to the general public due to dense jargon and complex language. To address this challenge in science communication, we introduce a reinforcement learning framework that fine-tunes a language model to rewrite scholarly abstracts into more comprehensible versions....
Syntactic parsing is one of the areas in Natural Language Processing. The development of large-scale multilingual language models has enabled cross-lingual parsing approaches, which allows us to develop parsers for languages that do not have treebanks available. However, these approaches rely on the assumption that languages share orthographic repr...
A moral panic animated by conspiracy theories alleging ritual sex abuse swept through the United States in the 1980s. During that “Satanic Panic,” as it came to be known, people expressed fears of social change regarding gender and sexuality. Beginning in 2022, conservative politicians, pundits, and pastors in the United States levied similar accus...
A recent U.S. survey explored how media exposure impacts people’s beliefs in COVID-19 conspiracy theories and vaccine behavior. This study found that individuals frequently exposed to fringe social media tend to hold stronger beliefs in COVID-19 conspiracy theories and exhibit vaccine hesitancy. They also tend to possess dark personality traits, mi...
The “White Replacement” conspiracy theory, that governments and corporations are “replacing” white people, is linked to several mass shootings. Given its recent ubiquity in elite rhetoric, concerns have arisen about the popularity of this conspiracy theory among the United States mass public. Further, political scientists have noted a need to under...
We often assume that annotation tasks, such as annotating for the presence of conspiracy theories , can be annotated with hard labels, without definitions or guidelines. Our annotation experiments, comparing students and experts, show that there is little agreement on basic annotations even among experts. For this reason, we conclude that we need t...
In this position paper, we argue for a holistic perspective on threat analysis and other studies of state-sponsored or state-aligned eCrime groups. Specifically, we argue that understanding eCrime requires approaching it as a sociotechnical system and that studying such a system requires combining linguistic, regional, professional, and technical e...
Story retelling is a fundamental medium for the transmission of information between individuals and among social groups. Besides conveying factual information, stories also contain affective information. Though natural language processing techniques have advanced considerably in recent years, the extent to which machines can be trained to identify...
Native Language Identification is one of the growing subfields in Natural Language Processing (NLP). The task of Native Language Identification (NLI) is mainly concerned with predicting the native language of an author’s writing in a second language. In this paper, we investigate the performance of two types of features; content-based features vs....
This paper examines the effectiveness of different feature representations of audio data in accurately classifying discourse meaning in Spanish. The task involves determining whether an utterance is a declarative sentence, an interrogative, an imperative, etc. We explore how pitch contour can be represented for a discourse-meaning classification ta...
Understanding the individual-level characteristics associated with conspiracy theory beliefs is vital to addressing and combatting those beliefs. While researchers have identified numerous psychological and political characteristics associated with conspiracy theory beliefs, the generalizability of those findings is uncertain because they are typic...
This paper uses the tools of distributional semantics to investigate the semantic change of algo from a noun meaning ‘goods, possessions’ and an indefinite pronoun ‘something’ in the Medieval/Classical period of Spanish to an indefinite pronoun and degree adverb ‘a bit’ in contemporary Spanish. We compare the results of a previous corpus-based stud...
The overarching goal of the current study is thus to explore how malicious actors frame conspiratorial rhetoric where certain tropes and modes of emplotment formed in tweets cultivate a polarized echo chamber provoking affective responses of susceptible individuals. This study provides evidence linking past CTs with contemporary narratives broadly...
Word embeddings have recently been applied to detect and explore changes in word meaning on large historical corpora. While word embeddings are useful in many Natural Language Processing tasks, there are a number of questions that need to be addressed concerning accuracy and applicability of these methods for historical data. There is a scarce lite...
Abusive language detection has become an important tool for the cultivation of safe on-line platforms. We investigate the interaction of annotation quality and classifier performance. We use a new, fine-grained annotation scheme that allows us to distinguish between abusive language and colloquial uses of profanity that are not meant to harm. Our r...
Despite the tremendous recent progress on natural language inference (NLI), driven largely by large-scale investment in new datasets (e.g., SNLI, MNLI) and advances in modeling, most progress has been limited to English due to a lack of reliable datasets for most of the world's languages. In this paper, we present the first large-scale NLI dataset...
Translations are generally assumed to share universal features that distinguish them from texts that are originally written in the same language. Thus, we can argue that these translations constitute their own variety of a language, often called translationese. However, translations are also influenced by their source languages and thus show differ...
We present a new logic-based inference engine for natural language inference (NLI) called MonaLog, which is based on natural logic and the monotonicity calculus. In contrast to existing logic-based approaches, our system is intentionally designed to be as lightweight as possible, and operates using a small set of well-known (surface-level) monotoni...
This paper describes the UM-IU@LING's system for the SemEval 2019 Task 6: OffensEval. We take a mixed approach to identify and categorize hate speech in social media. In subtask A, we fine-tuned a BERT based classifier to detect abusive content in tweets, achieving a macro F1 score of 0.8136 on the test data, thus reaching the 3rd rank out of 103 s...
In this paper, we discuss our efforts to build a corpus for Laiholh, also called Hakha Chin. Laiholh is spoken in Chin State in Western Myanmar, in parts of India and Bangladesh, and in several Burmese refugee communities in the US. Indiana, for example, is home to about 25,000 Burmese refugees. The ultimate goal of our team is to contribute to the...
In this paper, we discuss our efforts to build a corpus for Laiholh, also called Hakha Chin. Laiholh is spoken in Chin State in Western Myanmar, in parts of India and Bangladesh, and in several Burmese refugee communities in the US. Indiana, for example, is home to about 25,000 Burmese refugees. The ultimate goal of our team is to contribute to the...
The Universal Morphology UniMorph project is a collaborative effort to improve how NLP handles complex morphology across the world's languages. The project releases annotated morphological data using a universal tagset, the UniMorph schema. Each inflected form is associated with a lemma, which typically carries its underlying lexical meaning, and a...
More than 20,000 Burmese refugees live in Indiana at the time of this writing. Most are originally from Chin State in western Myanmar and speak under-documented Tibeto-Burman languages from the Kuki-Chin branch. For the most widely spoken of them—Hakha Chin, also known as Laiholh or Hakha Lai—syntactic and morphological work exists but basic acoust...
This paper is concerned with the question of whether we can predict the future impact of a paper based on the text of the paper. We create a corpus of papers in computational linguistics, and we create gold standard impact annotations by using their Google Scholar citation counts. We use supervised classification approaches to automatically predict...
We present a machine learning approach to distinguish texts translated to Chinese (by humans) from texts originally written in Chinese, with a focus on a wide range of syntactic features. Using Support Vector Machines (SVMs) as classifier on a genre-balanced corpus in translation studies of Chinese, we find that constituent parse trees and dependen...
We investigate feature selection methods for machine learning approaches in sentiment analysis. More specifically, we use data from the cooking platform Epicurious and attempt to predict ratings for recipes based on user reviews. In machine learning approaches to such tasks, it is a common approach to use word or part-of-speech n -grams. This resul...
The CoNLL-SIGMORPHON 2017 shared task on supervised morphological generation required systems to be trained and tested in each of 52 typologically diverse languages. In sub-task 1, submitted systems were asked to predict a specific inflected form of a given lemma. In sub-task 2, systems were given a lemma and some of its specific inflected forms, a...
German is a language that is closely related to English but has a richer morphology and freer word order than English. Additionally, German has four existing major treebanks, which differ considerably in their syntactic annotation schemes. All treebanks use a combination of constituent structure and grammatical functions, but the decisions with reg...
This paper presents an overview of challenges and approaches in multilingual coreference resolution. We give an introduction to the terminology used in the field as well as to the basic approaches to resolving coreference relations and discuss the challenges that a system faces when new languages are added. Current systems show that multilinguality...
We investigate whether non-configurational languages, which display more word order variation than config-urational ones, require more training data for a phenomenon to be parsed successfully. We perform a tightly controlled study comparing the dative alternation for English (a configurational language), German, and Russian (both non-configurationa...
Accessing historical texts is often a challenge because readers either do not know the historical language, or they are challenged by the technological hurdle when such texts are available digitally. Merging corpus linguistic methods and digital technology can provide novel ways of representing historical texts digitally and providing a simpler acc...
In this article, we introduce the task of word-based language identification in multilingual texts, in which every word needs to be classified with regard to its language. This task is necessary for multilingual texts in which language switches can occur within sentences, often more than once, as is the case in the texts in The Chymistry of Isaac N...
We describe the IUCL+ system for the shared task of the First Workshop on Computational Approaches to Code Switching (Solorio et al., 2014), in which participants were challenged to label each word in Twitter texts as a named entity or one of two candidate languages. Our system combines character n-gram probabilities, lexical probabilities, word la...
This paper focuses on creating a historical parallel corpus of Old Occitan and English. The 13th century Occitan narrative poem Roman de Flamenca holds a unique position in Provençal literature, and is a “universally acknowledged masterpiece of Old Occitan narrative” (Fleischmann, 1995). We show how historical investigations may benefit from such a...
It is well known that word aligned parallel corpora are valuable linguistic resources. Since many factors affect automatic alignment quality, manual post-editing may be required in some applications. While there are several state-of-the-art word-aligners, such as GIZA++ and Berkeley, there is no simple visual tool that would enable correcting and e...
SAMAR is a system for subjectivity and sentiment analysis (SSA) for Arabic social media genres. Arabic is a morphologically rich language, which presents significant complexities for standard approaches to building SSA systems designed for the English language. Apart from the difficulties presented by the social media genres processing, the Arabic...
In this paper, we develop an approach to automatically predict user ratings for recipes at Epicurious.com, based on the recipes’ reviews. We investigate two distributional methods for feature selection, Information Gain and Bi-Normal Separation; we also compare distributionally selected features to linguistically motivated features and two types of...
We describe the Indiana University system for SemEval Task 5, the L2 writing assistant task, as well as some extensions to the system that were completed after the main evaluation. Our team submitted translations for all four language pairs in the evaluation, yielding the top scores for English-German. The system is based on combining several infor...
This paper describes an ongoing effort to digitize and annotate the corpus of Le Roman de Flamenca, a 13th-century romance written in Old Occitan. The goal of this project is twofold: The first objective is to digitize one of the earliest editions of the text and to create an interactive online database that will allow parallel access to a glossary...
This paper reports on the first shared task on statistical parsing of morphologically rich languages (MRLs). The task features data sets from nine languages, each available both in constituency and dependency annotation. We report on the preparation of the data sets, on the proposed parsing scenarios, and on the evaluation metrics for parsing MRLs...
Situated dialogue corpora are invaluable resources for understanding the complex relationships among language, perception, and action. Accomplishing shared goals in the real world can often only be achieved via dynamic negotiation processes based on the interactants' common ground. In this paper, we investigate ways of systematically capturing stru...
Parsing is a key task in natural language processing. It involves predicting, for each natural language sentence, an abstract representation of the grammatical entities in the sentence and the relations between these entities. This representation provides an interface to compositional semantics and to the notions of “who did what to whom.” The last...
This work introduces a machine learning approach to the identification of mention heads needed for multilingual coreference resolution (MCR). We evaluate the method and compare it to a heuristic baseline and a rule-based approach, which are widely used in coreference resolution systems. We use the CoNLL-2012 shared task data sets, which include dat...
Currently - the database is not open to the public use. If you would like to have datafile, please contact the authors.
We compare two different methods in domain adaptation applied to constituent parsing: parser combination and co-training, each used to transfer information from the source domain of news to the target domain of natural dialogs, in a setting without annotated data. Both methods outperform the baselines and reach similar results. Parser combination p...
In this paper, we present ASMA, a fast and efficient system for automatic segmentation and fine grained part of speech (POS) tagging of Modern Standard Arabic (MSA). ASMA performs segmentation both of agglutinative and of inflectional morphological boundaries within a word. In this work, we compare ASMA to two state of the art suites of MSA tools:...
This paper investigates incremental part of speech tagging for speech transcripts that contain multilingual intrasentential code-mixing, and compares the accuracy of a monolithic tagging model trained on a heterogeneous-language dataset to a model that switches between two homogeneous-language tagging models dynamically using word-by-word language...
This paper presents an investigation of part of speech (POS) tagging for Arabic as it occurs naturally, i.e. unvocalized text (without diacritics). We also do not assume any prior tokenization, although this was used previously as a basis for POS tagging. Arabic is a morphologically complex language, i.e. there is a high number of inflections per w...
This paper describes the implementation of a resource-light approach, cross-language transfer, to build and annotate a historical corpus for Old Occitan. Our approach transfers morpho-syntactic and syntactic annotation from resource-rich source languages, Old French and Catalan, to a genetically related target language, Old Occitan. The present cor...
The current work presents the participation of UBIU (Zhekova and Kübler, 2010) in the CoNLL-2012 Shared Task: Modeling Multilingual Unrestricted Coreference in OntoNotes (Pradhan et al., 2012). Our system deals with all three languages: Arabic, Chinese and English. The system results show that UBIU works reliably across all three languages, reachin...
Finding coordinations provides useful information for many NLP endeavors. However, the task has not received much attention in the literature. A major reason for that is that the annotation of major treebanks does not reliably annotate coordination. This makes it virtually impossible to detect coordinations in which two conjuncts are separated by p...
Effectively combining multiple (and complementary) sources of information is becoming one of the most promising paths for increased accuracy and more detailed analysis in numerous applications. Neuroscience, business analytics, military intelligence, and sociology are among the areas that could significantly benefit from
properly processing diverse...
We develop a system for predicting the level of language learners, using only a small amount of targeted language data. In particular, we focus on learners of Hebrew and predict level based on restricted placement exam exercises. As with many language teaching situations, a major problem is data sparsity, which we account for in our feature selecti...
DUE TO THE COMPLEXITY OF EMOTIONS IN SUICIDE NOTES AND THE SUBTLE NATURE OF SENTIMENTS, THIS STUDY PROPOSES A FUSION APPROACH TO TACKLE THE CHALLENGE OF SENTIMENT CLASSIFICATION IN SUICIDE NOTES: leveraging WordNet-based lexicons, manually created rules, character-based n-grams, and other linguistic features. Although our results are not satisfying...
In this paper, we introduce the syntactic annotation of the CReST corpus, a corpus of natural language dialogues obtained from humans performing a cooperative, remote search task. The corpus contains the speech signals as well as transcriptions of the dialogues, which are additionally annotated for dialogue structure, disfluencies, and for syntax....
We propose an unsupervised training method to guide the learning of Malay derivational morphology from a set of morphological segmentations produced by a naıve morphological analyzer. Using a morphology-based language model, we first estimate the probability of a given segmentation. We train the model with EM to find the segmentation that maximizes...
In many contexts, one is confronted with the problem of extract ing information from large amounts of different types soft data (e.g., text) and hard data (from e.g., physics-based sensing systems). In handling hard data, signal and data processing offers a wealth of methods related to modeling, estimation, tracking, and inference tasks. However, s...
This paper presents an empirical study on the influence of singletons on the evaluation of coreference resolution systems. We present results on two English data sets used in the SEMEVAL 2010 shared task 1 and the CONLL 2011 shared task using the scorers of both shared tasks. We show that singletons, both in the gold standard and in the system outp...
Part of speech tagging accuracy deteriorates severely when a tagger is used out of domain. We investigate a fast method for domain adaptation, which provides additional in-domain training data from an unannotated data set by applying POS taggers with different biases to the unannotated data set and then choosing the set of sentences on which the ta...
The standard ParsEval metrics alone are often not sufficient for evaluating parsers integrated in natural language understanding systems. We propose to augment intrinsic parser evaluations by extrinsic measures in the context of human-robot interaction using a corpus from a human cooperative search task. We compare a constituent with a dependency p...
Computational approaches to morphology and syntax provides an overview of the fields of computational morphology and syntax (i.e. parsing), covering both classic techniques and the state of the art. After a general introduction, the book divides cleanly into two halves, ‘Computational approaches to morphology’ and ‘Computational approaches to synta...
Research on opinion detection has shown that a large number of opinion-labeled data are necessary for capturing subtle opinions. However, opinion-labeled data, especially at the sub-document level, are often limited. This paper describes the application of Semi-Supervised Learning (SSL) to automatically produce more labeled data and explores the po...
The term Morphologically Rich Languages (MRLs) refers to languages in which significant information concerning syntactic units and relations is expressed at word-level. There is ample evidence that the application of readily available statistical parsing models to such languages is susceptible to serious performance degradation. The first workshop...
In this paper, we compare two novel methods for part of speech tagging of Arabic without the use of gold standard word segmentation but with the full POS tagset of the Penn Arabic Treebank. The first approach uses complex tags without any word segmentation, the second approach is segmention-based, using a machine learning segmenter. Surprisingly, w...
We present UBIU, a language independent system for detecting full coreference chains, composed of named entities, pronouns, and full noun phrases which makes use of memory based learning and a feature model following Rahman and Ng (2009). UBIU is evaluated on the task "Coreference Resolution in Multiple Languages" (SemEval Task 1 (Recasens et al.,...
The work reviewed here is a collection of thirteen chapters that document the creation, annotation, and investigation of a corpus of Xhosa English. All but two of the chapters are loosely based on previously published articles.
In the first two chapters, Vivian de Klerk describes the language situation in South Africa. Because of its recent history...