About
64
Publications
23,444
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
5,755
Citations
Citations since 2017
Introduction
Skills and Expertise
Publications
Publications (64)
Selecting a birth control method is a complex healthcare decision. While birth control methods provide important benefits, they can also cause unpredictable side effects and be stigmatized, leading many people to seek additional information online, where they can find reviews, advice, hypotheses, and experiences of other birth control users. Howeve...
Transformers allow attention between all pairs of tokens, but there is reason to believe that most of these connections - and their quadratic time and memory - may not be necessary. But which ones? We evaluate the impact of sparsification patterns with a series of ablation experiments. First, we compare masks based on syntax, lexical similarity, an...
Across many data domains, co-occurrence statistics about the joint appearance of objects are powerfully informative. By transforming unsupervised learning problems into decompositions of co-occurrence statistics, spectral algorithms provide transparent and efficient algorithms for posterior inference such as latent topic analysis and community dete...
We explore Boccaccio's Decameron to see how digital humanities tools can be used for tasks that have limited data in a language no longer in contemporary use: medieval Italian. We focus our analysis on the question: Do the different storytellers in the text exhibit distinct personalities? To answer this question, we curate and release a dataset bas...
Much of the progress in contemporary NLP has come from learning representations, such as masked language model (MLM) contextual embeddings, that turn challenging problems into simple classification tasks. But how do we quantify and explain this effect? We adapt general tools from computational learning theory to fit the specific characteristics of...
Through a computational reading of the online book reviewing community LibraryThing, we examine the dynamics of a collaborative tagging system and learn how its users refine and redefine literary genres. LibraryThing tags are overlapping and multi-dimensional, created in a shared space by thousands of users, including readers, bookstore owners, and...
Social scientists are using computational tools to expand their content research beyond what is humanly readable. This often requires filtering corpora for complex research concepts. The commonly used off-the-shelf filtering techniques are untested at this task. Dictionaries may not recognize language outside of investigators’ expectations and thre...
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations. Such granular annotation is rare, expensive, and unavailable in most domain-specific contexts. In contrast, unlabeled multi-image, multi-sentence documents are abundant. Can lexical grounding be learned from s...
Clustering token-level contextualized word representations produces output that shares many similarities with topic models for English text collections. Unlike clusterings of vocabulary-level word embeddings, the resulting models more naturally capture polysemy and can be used as a way of organizing documents. We evaluate token clusterings trained...
In this article we describe our experiences with computational text analysis involving rich social and cultural concepts. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of key questions that ca...
Objective
The pain numerical rating scale (NRS) is widely used in pain research and clinical settings to represent pain intensity. For an individual with chronic pain, NRS reporting requires representation of a complex subjective state as a numeral. To evaluate the process of NRS reporting, this study examined the relationship between reported pain...
Birth stories have become increasingly common on the internet, but they have received little attention as a computational dataset. These unsolicited, publicly posted stories provide rich descriptions of decisions, emotions, and relationships during a common but sometimes traumatic medical experience. These personal details can be illuminating for m...
In this article we describe our experiences with computational text analysis. We hope to achieve three primary goals. First, we aim to shed light on thorny issues not always at the forefront of discussions about computational text analysis methods. Second, we hope to provide a set of best practices for working with thick social and cultural concept...
Images and text co-occur everywhere on the web, but explicit links between images and sentences (or other intra-document textual units) are often not annotated by users. We present algorithms that successfully discover image-sentence relationships without relying on any explicit multimodal annotation. We explore several variants of our approach on...
Word embeddings are increasingly being used as a tool to study word associations in specific corpora. However, it is unclear whether such embeddings reflect enduring properties of language or if they are sensitive to inconsequential variations in the source documents. We find that nearest-neighbor distances are highly sensitive to small changes in...
Multimodal machine learning algorithms aim to learn visual-textual correspondences. Previous work suggests that concepts with concrete visual manifestations may be easier to learn than concepts with abstract ones. We give an algorithm for automatically computing the visual concreteness of words and topics within multimodal datasets. We apply the ap...
In this paper we study the effects of a radical right party entering a national parliament, on the parliament discourse. We follow the classification developed by Meguid (2008) and use a probabilistic topic model approach to analyze the 300,000 speeches delivered in the Swedish parliament between 1994 and 2017. Our results indicate that immigration...
Spectral topic modeling algorithms operate on matrices/tensors of word co-occurrence statistics to learn topic-specific word distributions. This approach removes the dependence on the original documents and produces substantial gains in efficiency and provable topic inference, but at a cost: the model can no longer provide information about the top...
The anchor words algorithm performs provably efficient topic model inference by finding an approximate convex hull in a high-dimensional word co-occurrence space. However, the existing greedy algorithm often selects poor anchor words, reducing topic quality and interpretability. Rather than finding an approximate convex hull in a high-dimensional s...
Duplicate documents are a pervasive problem in text datasets and can have a strong effect on unsupervised models. Methods to remove duplicate texts are typically heuristic or very expensive, so it is vital to know when and why they are needed. We measure the sensitivity of two latent semantic methods to the presence of different levels of document...
How can a single person understand what's going on in a collection of millions of documents? This is an increasingly common problem: sifting through an organization's e-mails, understanding a decade worth of newspapers, or characterizing a scientific field's research. Topic models are a statistical framework that help users understand large documen...
How can a single person understand what’s going on in a collection of millions of documents? This is an increasingly widespread problem: sifting through an organization’s e-mails, understanding a decade worth of newspapers, or characterizing a scientific field’s research. This monograph explores the ways that humans and computers make sense of docu...
Rule-based stemmers such as the Porter stemmer are frequently used to preprocess English corpora for topic modeling. In this work, we train and evaluate topic models on a variety of corpora using several different stemming algorithms. We examine several different quantitative measures of the resulting models, including likelihood, coherence, model...
Admixture models are a ubiquitous approach to capture latent population
structure in genetic samples. Despite the widespread application of admixture
models, little thought has been devoted to the quality of the model fit or the
accuracy of the estimates of parameters of interest for a particular study.
Here we develop methods for validating admixt...
External factors such as author gender, author nationality, and date of publication can affect both the choice of literary themes in novels and the expression of those themes, but the extent of this association is difficult to quantify. In this work, we apply statistical methods to identify and extract hundreds of topics (themes) from a corpus of 1...
Topic models provide a useful method for dimensionality reduction and
exploratory data analysis in large text corpora. Most approaches to topic model
inference have been based on a maximum likelihood objective. Efficient
algorithms exist that approximate this objective, but they have no provable
guarantees. Recently, algorithms have been introduced...
We present a hybrid algorithm for Bayesian topic models that combines the
efficiency of sparse Gibbs sampling with the scalability of online stochastic
inference. We used our algorithm to analyze a corpus of 1.2 million books (33
billion words) with thousands of topics. Our approach reduces the bias of
variational inference and generalizes to many...
Although fully generative models have been successfully used to model the
contents of text documents, they are often awkward to apply to combinations of
text data and document metadata. In this paper we propose a
Dirichlet-multinomial regression (DMR) topic model that includes a log-linear
prior on document-topic distributions that is a function of...
Concept taxonomies such as MeSH, the ACM Computing Classification System, and the NY Times Subject Headings are frequently used to help organize data. They typically consist of a set of concept names organized in a hierarchy. However, these names and structure are often not sufficient to fully capture the intended meaning of a taxonomy node, and pa...
More than a century of modern Classical scholarship has created a vast archive of journal publications that is now becoming available online. Most of this work currently receives little, if any, attention. The collection is too large to be read by any single person and mostly not of sufficient interest to warrant traditional close reading. This art...
A database of objects discovered in houses in the Roman city of Pompeii
provides a unique view of ordinary life in an ancient city. Experts have used
this collection to study the structure of Roman households, exploring the
distribution and variability of tasks in architectural spaces, but such
approaches are necessarily affected by modern cultural...
We develop a scalable algorithm for posterior inference of overlapping communi-ties in large networks. Our algorithm is based on stochastic variational inference in the mixed-membership stochastic blockmodel (MMSB). It naturally interleaves subsampling the network with estimating its community structure. We apply our algorithm on ten large, real-wo...
Latent variable models have the potential to add value to large document collections by discovering interpretable, low-dimensional subspaces. In order for people to use such models, however, they must trust them. Unfortunately, typical dimensionality reduction methods for text, such as latent Dirichlet allocation, often produce low-dimensional subs...
Real document collections do not fit the independence assumptions asserted by most statistical topic models, but how badly do they violate them? We present a Bayesian method for measuring how well a topic model fits a corpus. Our approach is based on posterior predictive checking, a method for diagnosing Bayesian models in user-defined ways. Our me...
Topic models provide a powerful tool for analyzing large text collections by representing high dimensional data in a low dimensional subspace. Fitting a topic model given a set of training documents requires approximate inference tech- niques that are computationally expensive. With today's large-scale, constantly expanding document collections, it...
A natural evaluation metric for statistical topic models is the probability of held-out documents given a trained model. While exact computation of this probability is in- tractable, several estimators for this prob- ability have been used in the topic model- ing literature, including the harmonic mean method and empirical likelihood method. In thi...
Topic models are a useful tool for analyz- ing large text collections, but have previ- ously been applied in only monolingual, or at most bilingual, contexts. Mean- while, massive collections of interlinked documents in dozens of languages, such as Wikipedia, are now widely available, calling for tools that can characterize con- tent in many langua...
Implementations of topic models typically use symmetric Dirichlet priors with fixed concentration parameters, with the implicit assumption that such "smoothing parameters" have little practical effect. In this paper, we explore several classes of structured priors for topic models. We find that an asymmetric Dirichlet prior over the document-topic...
As network-enabled scholarship produces huge quantities of formal and informal research outputs in a variety of formats and varying levels of access, it is "enhanced" science that will facilitate the discovery, selection, and analysis of information that are a necessary part of the scientific research cycle particularly among interdisciplinary rese...
An essential part of an expert-nding task, such as matching reviewers to submitted pa- pers, is the ability to model the expertise of a person based on documents. We evaluate several measures of the association between an author in an existing collection of research papers and a previously unseen document. We compare two language model based ap- pr...
The four-level pachinko allocation model (PAM) (Li & McCallum, 2006) represents correlations among topics using a DAG struc- ture. It does not, however, represent a nested hierarchy of topics, with some top- ical word distributions representing the vo- cabulary that is shared among several more specic topics. This paper presents hierar- chical PAM...
Large scale library digitization projects such as the Open Content Alliance are producing vast quantities of text, but little has been done to organize this data. Subject headings inherited from card catalogs are useful but limited, while full-text indexing is most appropriate for readers who al- ready know exactly what they want. Statistical topic...
When browsing a digital library of research papers, it is nat- ural to ask which authors are most inuential in a particular topic. We present a probabilistic model that ranks authors based on their inuence in particular areas of scientic re- search. This model combines several sources of information: citation information between documents as repres...
Databases constructed automatically through web mining and information extraction often overlap with databases constructed and curated by hand. These two types of databases are complementary: automatic extraction provides increased scope, while curated databases provide increased accu- racy. The uncertain nature of such integration tasks suggests t...
This paper describes several incunabular assumptions that impose upon early digital libraries the limitations drawn from print, and argues for a design strategy aimed at providing customization and person- alization services that go beyond the limiting models of print distribution, based on services and experiments developed for the Greco-Roman col...
Measurements of the impact and history of research literature provide a useful complement to scientific digital library collections. Bibliometric indicators have been extensively studied, mostly in the context of journals. However, journal-based metrics poorly capture topical distinctions in fast-moving fields, and are increasingly problematic with...
IFLA's Functional Requirements for Bibliographic Records (FRBR) lay the foundation for a new generation of cataloging systems that recognize the difference between a particular work (e.g., Moby Dick], diverse expressions of that work (e.g., translations into German, Japanese and other languages), different versions of the same basic text (e.g., the...
One of the criticisms library users often make of catalogs is that they rarely include information below the bibliographic level. It is generally impossible to search a catalog for the titles and subjects of particular chapters or volumes. There has been no way to add this information to catalog records without exponentially increasing the workload...
The SCALE - services for a customizable authority linking environment - project is developing tools to help integrate collections within the National Science Foundation's National Science Digital Library (NSDL) through terminological linking. These tools transparently provide reading support to NSDL collections by automatically linking phrases in t...
Previous work on probabilistic topic models has either focused on models with relatively simple conjugate priors that support Gibbs sampling or models with non-conjugate priors that typically require variational inference. Gibbs sampling is more accurate than variational inference and better supports the construction of composite models. We present...
Understanding the structure and dynamics of the job market is important both from the local perspective of individual job hunters and from the global perspective of economists and policy makers. In this paper, we explore such questions by analyzing the text of a corpus of resumes and their job transitions. We rst demonstrate the use of a statistica...
Large text collections are useful in social science research, but building reliable predictive models is difficult. Researchers must either deal directly with sparse, noisy, high dimensional language data or use latent variable models to infer more tractable lower dimensional patterns. For conclusions based on latent variable models to be reliable,...
Projects
Project (1)