Eric K. Ringger

Eric K. Ringger
  • Ph.D., University of Rochester, Computer Science
  • Professor at Brigham Young University

About

81
Publications
29,410
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,932
Citations
Introduction
Eric Ringger is an Associate Professor of Computer Science at Brigham Young University. He is director of the Natural Language Processing (NLP) Lab (http://nlp.cs.byu.edu/) and is working to solve the problem of machine-assisted exploratory textual data analysis. His research contributes to NLP, text mining with topic models, lightly supervised machine learning -- including cost-conscious active learning -- and machine-assistance for human language annotation tasks.
Current institution
Brigham Young University
Current position
  • Professor
Additional affiliations
July 2005 - April 2011
Brigham Young University
Position
  • Professor (Assistant)
April 2011 - present
Brigham Young University
Position
  • Professor (Associate)
July 1997 - July 2005
Microsoft
Position
  • Researcher

Publications

Publications (81)
Article
Full-text available
The application of Deep Neural Networks for ranking in search engines may obviate the need for the extensive feature engineering common to current learning-to-rank methods. However, we show that combining simple relevance matching features like BM25 with existing Deep Neural Net models often substantially improves the accuracy of these models, indi...
Conference Paper
Full-text available
Corpus labeling projects frequently use low-cost workers from microtask marketplaces; however, these workers are often inexperienced or have misaligned incentives. Crowdsourcing models must be robust to the resulting systematic and non-systematic inaccuracies. We introduce a novel crowdsourcing model that adapts the discrete supervised topic model...
Conference Paper
Full-text available
In modern practice, labeling a dataset often involves aggregating annotator judgments obtained from crowdsourcing. State-of-the-art aggregation is performed via inference on probabilistic models, some of which are data-aware, meaning that they leverage features of the data (e.g., words in a document) in addition to annotator judgments. Previous wor...
Conference Paper
Full-text available
Return-on-Investment (ROI) is a cost-conscious approach to active learning (AL) that considers both estimates of cost and of benefit in active sample selection. We investigate the theoretical conditions for successful cost-conscious AL using ROI by examining the conditions under which ROI would optimize the area under the cost/benefit curve. We the...
Conference Paper
Full-text available
Data annotation in modern practice often involves multiple, imperfect human annotators. Multiple annotations can be used to infer estimates of the ground-truth labels and to estimate individual annotator error characteristics (or reliability). We introduce MOMRESP, a model that improves upon item response models to incorporate information from both...
Conference Paper
Full-text available
We describe an under-studied problem in language resource management: that of providing automatic assistance to annotators working in exploratory settings. When no satisfactory tagset already exists, such as in under-resourced or undocumented languages, it must be developed iteratively while annotating data. This process naturally gives rise to a s...
Conference Paper
Full-text available
The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word senses. Such CDL resources are essen...
Conference Paper
Full-text available
As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on...
Article
Full-text available
Machine assistance is vital to managing the cost of corpus annotation projects. Identifying effective forms of machine assistance through principled evaluation is particularly important and challenging in under-resourced domains and highly heterogeneous corpora, as the quality of machine assistance varies. We perform a fine-grained evaluation of tw...
Conference Paper
We present computational models capable of understanding and conveying concepts based on word associations. We discover word associations automatically using corpus-based semantic models with Wikipedia as the corpus. The best model effectively combines corpus-based models with preexisting databases of free association norms gathered from human volu...
Conference Paper
Full-text available
Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demon- strating how diversity...
Conference Paper
Full-text available
For noisy, historical documents, a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarizati...
Article
Supervised topic models are promising tools for text analytics that simultaneously model topical patterns in document collections and relationships between those topics and document metadata, such as timestamps. We examine empirically the effect of OCR noise on the ability of supervised topic models to produce high quality output through a series o...
Chapter
Despite popular use of Latent Dirichlet Allocation (LDA) for automatic discovery of latent topics in document corpora, such topics lack connections with relevant knowledge sources such as Wikipedia, and they can be difficult to interpret due to the lack of meaningful topic labels. Furthermore, the topic analysis suffers from a lack of identifiabili...
Conference Paper
Full-text available
Manual annotation of large textual corpora can be cost-prohibitive, especially for rare and under-resourced languages. One potential solution is pre-annotation: asking human annotators to correct sentences that have already been annotated, usually by a machine. Another potential solution is correction propagation: using annotator corrections to dyn...
Conference Paper
Full-text available
Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topic...
Conference Paper
Full-text available
Artifact-based research provides a mechanism whereby researchers may study the creation of software yet avoid many of the difficulties of direct observation and experimentation. However, there are still many challenges that can affect the quality of artifact-based studies, especially those studies examining software evolution. Large commits, which...
Conference Paper
Full-text available
We present an analysis of developer communication in the Apache HTTP Server project. Using topic modeling techniques we expose latent conceptual sub-communities arising from developer specialization within the greater developer population. However, we found that among the major contributors to the project, very little specialization exists. We pres...
Conference Paper
Full-text available
This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressi...
Conference Paper
Full-text available
Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the...
Conference Paper
Full-text available
Expert human input can contribute in various ways to facilitate automatic annotation of natural language text. For example, a part-of-speech tagger can be trained on labeled input provided offline by experts. In addition, expert input can be solicited by way of active learning to make the most of annotator expertise. However, hiring individuals to...
Conference Paper
Full-text available
We introduce CCASH (Cost-Conscious Annotation Supervised by Humans), an extensible web application framework for cost-efficient annotation. CCASH provides a framework in which cost-efficient annotation methods such as Active Learning can be explored via user studies and afterwards applied to large annotation projects. CCASH's architecture is d escr...
Conference Paper
Full-text available
Named entity recognition applied to scanned and OCRed historical documents can contribute to the discoverability of historical information. However, entity recognition from some historical corpora is much more difficult than from natively digital text because of the marked presence of word errors and absence of page layout information. How difficul...
Conference Paper
Full-text available
We are interested in diacritizing Semitic languages, especially Syriac, using only dia-critized texts. Previous methods have required the use of tools such as part-of-speech taggers, segmenters, morphological analyzers, and linguistic rules to produce state-of-the-art results. We present a low-resource, data-driven, and language-independent approac...
Conference Paper
Full-text available
Models of latent document semantics such as the mixture of multinomials model and Latent Dirichlet Allocation have received substantial attention for their ability to discover topical semantics in large collections of text. In an effort to apply such models to noisy optical character recognition (OCR) text output, we endeavor to understand the effe...
Conference Paper
Full-text available
We define a probabilistic morphological analyzer using a data-driven approach for Syriac in order to facilitate the creation of an annotated corpus. Syriac is an under-resourced Semitic language for which there are no available language tools such as morphological analyzers. We introduce novel probabilistic models for segmentation, dictionary linka...
Conference Paper
Full-text available
Individual optical character recognition (OCR) engines vary in the types of errors they commit in recognizing text, par- ticularly poor quality text. By aligning the output of mul- tiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the i...
Conference Paper
Full-text available
In this paper, we consider a sentiment regression problem: summarizing the overall sentiment of a review with a real-valued score. Empirical results on a set of labeled reviews show that real-valued sentiment modeling is feasible, as several algorithms improve upon baseline performance. We also analyze performance as the granularity of the classifi...
Conference Paper
Full-text available
Model-based algorithms are emerging as a preferred method for document clustering. As computing resources improve, methods such as Gibbs sampling have become more common for parameter estimation in these models. Gibbs sampling is well understood for many applications, but has not been extensively studied for use in document clustering. We explore t...
Conference Paper
Full-text available
Fixed, limited budgets often constrain the amount of expert annotation that can go into the construction of annotated corpora. Estimating the cost of annotation is the first step toward using annotation resources wisely. We present here a study of the cost of annotation. This study includes the participation of annotators at various skill levels an...
Conference Paper
Full-text available
Traditional Active Learning (AL) techniques assume that the annotation of each datum costs the same. This is not the case when anno- tating sequences; some sequences will take longer than others. We show that the AL tech- nique which performs best depends on how cost is measured. Applying an hourly cost model based on the results of an annotation u...
Conference Paper
Full-text available
In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual taggi...
Conference Paper
In corpus creation human annotation is expensive. Annotation costs can be minimized through machine learning and active learning, however there are many complex interactions among the machine learner, the active learning technique, the annotation cost, human annotation accuracy, the annotator user interface, and several other elements of the proces...
Conference Paper
Full-text available
In the construction of a part-of-speech annotated corpus, we are constrained by a fixed budget. A fully annotated corpus is required, but we can afford to label only a subset. We train a Maximum Entropy Markov Model tagger from a labeled subset and automatically tag the remainder. This paper addresses the question of where to focus our manual taggi...
Conference Paper
Full-text available
We develop dependency parsers for Ara- bic, English, Chinese, and Czech using Bayes Point Machines, a training algo- rithm which is as easy to implement as the perceptron yet competitive with large margin methods. We achieve results com- parable to state-of-the-art in English and Czech, and report the first directed depen- dency parsing accuracies...
Conference Paper
Full-text available
We present a prototype system, code-named Pulse, for min- ing topics and sentiment orientation jointly from free text customer feed- back. We describe the application of the prototype system to a database of car reviews. Pulse enables the exploration of large quantities of cus- tomer free text. The user can examine customer opinion \at a glance" or...
Conference Paper
Full-text available
Email has evolved from a mere communication system to a general tool for organizing workflow (Whittaker and Sidner, 1996; Cadiz et al., 2001; Bellotti et al., 2003). Widely used email clients address this issue by providing additional functionality, including contact management, calendaring and task lists (i.e., "to do" lists). However, some of the...
Article
Full-text available
We present several statistical models of syntactic constituent order for sentence realization. We compare several models, including simple joint models inspired by existing statistical parsing models, and several novel conditional models. The conditional models leverage a large set of linguistic features without manual feature selection. We apply a...
Article
Full-text available
We describe SmartMail, a prototype system for automatically identifying action items (tasks) in email messages. SmartMail presents the user with a task-focused summary of a message. The summary consists of a list of action items extracted from the message. The user can add these action items to their "to do" list.
Article
Full-text available
We describe the adaptation to French of a machine-learned sentence realization system called Amalgam that was originally developed to be as language independent as possible and was first implemented for German. We discuss the development of the French implementation with particular attention to the degree to which the original system could be reuse...
Article
We show that it is possible to learn the contexts for linguistic operations which map a semantic representation to a surface syntactic tree in sentence realization with high accuracy. We cast the problem of learning the contexts for the linguistic operations as classification tasks, and apply straightforward machine learning techniques, such as dec...
Article
Full-text available
We present an overview of Amalgam, a sentence realization module that combines machine-learned and knowledgeengineered components to produce natural language sentences from logical form inputs.
Conference Paper
Full-text available
We show that it is possible to learn the contexts for linguistic operations which map a semantic representation to a surface syntactic tree in sentence realization with high accuracy. We cast the problem of learning the contexts for the linguistic operations as classification tasks, and apply straightforward machine learning techniques, such as dec...
Conference Paper
Full-text available
We profile the occurrence of clausal extraposition in corpora from different domains and demonstrate that extraposition is a pervasive phenomenon in German that must be addressed in German sentence realization. We present two different approaches to the modeling of extraposition, both based on machine learned decision tree classifiers. The two appr...
Article
Full-text available
We describe a punctuation insertion model used in the sentence realization module of a natural language generation system for English and German. The model is based on a decision tree classifier that uses linguistically sophisticated features. The classifier outperforms a word n-gram model trained on the same data.
Article
Full-text available
The main goal of the present work is to explore the use of rich lexical information in language modelling. We reformulated the task of a language model from predicting the next word given its history to predicting simultaneously both the word and a tag encoding various types of lexical information. Using part-of-speech tags and syntactic/semantic f...
Article
Full-text available
We present a post-processing technique for correcting errors committed by an arbitrary continuous speechrecognizer. The technique leverages our observation that consistent recognition errors arising from mismatched training and usageconditions canbe modeledand corrected. We have implemented a post-processor called SPEECHPP to correct word-level err...
Article
Full-text available
Data sparseness has been regularly indicted as the primary problem in statistical language modelling. We go one step further to consider the situation when no text data is available for the target domain. We present two techniques for building efficient language models quickly for new domains. The first technique is based on using a context-free gr...
Article
Full-text available
This paper presents a new technique for overcoming several types of speech recognition errors by post-processing the output of a continuous speech recognizer. The post-processor output contains fewer errors, thereby making interpretation by higher-level modules, such as a parser, in a speech understanding system more reliable. The primary advantage...
Article
Full-text available
This paper describes a system that leads us to believe in the feasibility of constructing natural spoken dialogue systems in task-oriented domains. It specifically addresses the issue of robust interpretation of speech in the presence of recognition errors. Robustness is achieved by a combination of statistical error post-correction, syntactically-...
Conference Paper
Full-text available
A major hindrance to rendering spoken dialog systems capable of ongoing, continuous listening without requiring a push-to- talk device is the problem of distinguishing speech which is intended for the system from that which is overheard. We present a decision-theoretic approach to this problem that exploits Bayesian models of spoken dialog at four...
Article
Thesis (Ph. D.)--University of Rochester. Dept. of Computer Science, 2000. Simultaneously published in the Technical Report series. The focus of this thesis is to improve the ability of a computational system to understand spoken utterances in a dialogue with a human. Available computational methods for word recognition do not perform as well on sp...
Article
poken dialogue: The AGS demonstrator. In Proceedings of the International Conference on Spoken Language Processing, 1996. There are many many more integrated spoken dialogue systems, mainly in Eurospeech and ICSLP proceedings. 2 Relevant Journals and Conferences ffl Computational Linguistics, MIT Press. ffl International Journal for Human Computer...
Article
This paper describes a system that leads us to believe in the feasibility of constructing natural spoken dialogue systems in task-oriented domains. It specifically addresses the issue of robust interpretation of speech in the presence of recognition errors. Robustness is achieved by a combination of statistical error post-correction, syntactically-...
Article
Full-text available
We have implemented a post-processor called SPEECHPP to correct word-level errors committed by an arbitrary speech recognizer. Applying a noisy-channelmodel, SPEECHPPusesa Viterbi beam-search that employs language and channel models. Previous work demonstrated that a simple word-for-word channel model was sufficient to yield substantial increases i...
Article
Full-text available
This document describes the design and implementation of TRAINS-96, a prototype mixedinitiative planning assistant system. The TRAINS-96 system helps a human manager solve routing problems in a simple transportation domain. It interacts with the human using spoken, typed, and graphical input and generates spoken output and graphical map displays. T...
Conference Paper
Full-text available
This paper describes a system that leads us to believe in the feasibility of constructing natural spoken dialogue systems in task-oriented domains. It specifically addresses the issue of robust interpretation of speech in the presence of recognition errors. Robustness is achieved by a combination of statistical error post-correction, syntactically-...
Preprint
This paper describes a system that leads us to believe in the feasibility of constructing natural spoken dialogue systems in task-oriented domains. It specifically addresses the issue of robust interpretation of speech in the presence of recognition errors. Robustness is achieved by a combination of statistical error post-correction, syntactically-...
Technical Report
Full-text available
This document describes the design and implementation of TRAINS-96, a prototype mixed-initiative planning assistant system. The TRAINS-96 system helps a human manager solve routing problems in a simple transportation domain. It interacts with the human using spoken, typed, and graphical input and generates spoken output and graphical map displays....
Article
Full-text available
The focus of this thesis proposal is to improve the ability of a computational system to understand spoken utterances in a dialogue with a human. Available computational methods for word recognition do not perform as well on spontaneous speech as we would hope. Even a state of the art recognizer achieves slightly worse than 70% word accuracy on (ne...
Article
Full-text available
We describe the automatic conversion of English Penn Treebank (PTB) annotations into Language Neutral Syntax (LNS) (Campbell and Suzuki, 2002a,b). In this paper, we describe LNS and why it is useful, describe the conversion algorithm, present an evaluation of the conversion, and discuss some uses of the converted annotations and the potential for e...
Article
Full-text available
Named entity recognition from scanned and OCRed historical documents can contribute to historical research. However, entity recogni-tion from historical documents is more diffi-cult than from natively digital data because of the presence of word errors and the absence of complete formatting information. We ap-ply four extraction algorithms to vario...
Article
Full-text available
We present a series of models for doing statistical machine translation based on labeled semantic dependency graphs. We describe how these models were employed to augment an existing example-based MT system, and present results showing that doing so led to a significant improvement in translation quality as measured by the BLEU metric.
Article
Full-text available
This paper presents the French implementation of Amalgam, a machine-learned sentence realization system. It presents in some detail two of the machine-learned models employed in Amalgam and shows how linguistic intuition and knowledge can be combined with statistical techniques to improve the performance of the models.
Article
Full-text available
Topic models have been shown to reveal the semantic content in large corpora. Many individualized visualizations of topic models have been reported in the lit-erature, showing the potential of topic models to give valuable insight into a cor-pus. However, good, general tools for browsing the entire output of a topic model along with the analyzed co...
Article
Full-text available
This paper describes a method for conducting evaluations of Treebank and non-Treebank parsers alike against the English language U. Penn Treebank (Marcus et al., 1993) using a metric that focuses on the accuracy of relatively non-controversial aspects of parse structure. Our conjecture is that if we focus on maximal projections of heads (MPH), we a...

Network

Cited By