
Rebecca Steorts- Duke University
Rebecca Steorts
- Duke University
About
51
Publications
5,219
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
877
Citations
Introduction
Current institution
Publications
Publications (51)
Entity resolution is the process of merging and removing duplicate records from multiple data sources, often in the absence of unique identifiers. Bayesian models for entity resolution allow one to include a priori information, quantify uncertainty in important applications, and directly estimate a partition of the records. Markov chain Monte Carlo...
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple d...
The availability of both structured and unstructured databases, such as electronic health data, social media data, patent data, and surveys that are often updated in real time, among others, has grown rapidly over the past decade. With this expansion, the statistical and methodological questions around data integration, or rather merging multiple d...
Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tra...
Entity resolution (record linkage or deduplication) is the process of identifying and linking duplicate records in databases. In this paper, we propose a Bayesian graphical approach for entity resolution that links records to latent entities, where the prior representation on the linkage structure is exchangeable. First, we adopt a flexible and tra...
Given that data is readily available at our fingertips, data scientists face a difficult task when integrating multiple databases. This task is commonly known as entity resolution, which combines structured databases that may refer to the same individual (business, etc.), where unique identifiers are not available, are only partially available, or...
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme—integrating information from multiple sources. Before such questions can be...
Entity resolution (ER), comprising record linkage and deduplication, is the process of merging noisy databases in the absence of unique identifiers to remove duplicate entities. One major challenge of analysis with linked data is identifying a representative record among determined matches to pass to an inferential or predictive task, referred to a...
The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long‐standing controversy, disagreements have re‐surfaced regarding the underlying MSE assumptions, the robustness of MSE methodo...
The quantification of modern slavery has received increased attention recently as organizations have come together to produce global estimates, where multiple systems estimation (MSE) is often used to this end. Echoing a long-standing controversy, disagreements have re-surfaced regarding the underlying MSE assumptions, the robustness of MSE methodo...
Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available a...
Statistical agencies are often asked to produce small area estimates (SAEs) for positively skewed variables. When domain sample sizes are too small to support direct estimators, effects of skewness of the response variable can be large. As such, it is important to appropriately account for the distribution of the response variable given available a...
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yie...
Entity resolution (ER; also known as record linkage or de-duplication) is the process of merging noisy databases, often in the absence of unique identifiers. A major advancement in ER methodology has been the application of Bayesian generative models, which provide a natural framework for inferring latent entities with rigorous quantification of un...
Often data analysts use probabilistic record linkage techniques to match records across two data sets. Such matching can be the primary goal, or it can be a necessary step to analyze relationships among the variables in the data sets. We propose a Bayesian hierarchical model that allows data analysts to perform simultaneous linear regression and pr...
Entity resolution (ER) is becoming an increasingly important task across many domains (e.g., official statistics, human rights, medicine, etc.), where databases contain duplications of entities that need to be removed for later inferential and prediction tasks. Motivated by scaling to large data sets and providing uncertainty propagation, we propos...
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can b...
Traditional Bayesian random partition models assume that the size of each cluster grows linearly with the number of data points. While this is appealing for some applications, this assumption is not appropriate for other tasks such as entity resolution, modeling of sparse networks, and DNA sequencing tasks. Such applications require models that yie...
Small area estimation is concerned with methodology for estimating population parameters associated with a geographic area defined by a cross-classification that may also include non-geographic dimensions. In this paper, we develop constrained estimation methods for small area problems: those requiring smoothness with respect to similarity across a...
Background. Identification of patients at risk of deteriorating during their hospitalization is an important concern. However, many off-shelf scores have poor in-center performance. In this article, we report our experience developing, implementing, and evaluating an in-hospital score for deterioration. Methods. We abstracted 3 years of data (2014–...
Appointment no-shows have a negative impact on patient health and have caused substantial loss in resources and revenue for health care systems. Intervention strategies to reduce no-show rates can be more effective if targeted to the subpopulations of patients with higher risk of not showing to their appointments. We use electronic health records (...
Entity resolution (ER) (record linkage or de-duplication) is the process of merging together noisy databases, often in the absence of a unique identifier. A major advancement in ER methodology has been the application of Bayesian generative models. Such models provide a natural framework for clustering records to unobserved (latent) entities, while...
This is a review article that unifies several important examples using constrained optimisation techniques. The basic tools are three simple mathematical optimisation results subject to certain constraints. Applications include calibration, benchmarking in small area estimation and imputation. A final illustration is constrained optimisation under...
Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference...
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce $k$-means locality sensitive hashing (KLSH), which is based upon the information retr...
Record linkage (entity resolution or de-deduplication) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from the data, many researchers are interested in performing inference, prediction or post-linkage analysis on the linked data, which we call the downstream task. Depending on...
Entity resolution seeks to merge databases as to remove duplicate entries where unique identifiers are typically unknown. We review modern blocking approaches for entity resolution, focusing on those based upon locality sensitive hashing (LSH). First, we introduce k-means locality sensitive hashing (KLSH), which is based upon the information retrie...
Record linkage (de-duplication or entity resolution) is the process of merging noisy databases to remove duplicate entities. While record linkage removes duplicate entities from such databases, the downstream task is any inferential, predictive, or post-linkage task on the linked data. One goal of the downstream task is obtaining a larger reference...
In decision theory, we start with the end in mind – how does one use inference results to make decisions and what consequences do these decisions have? Such statistical decision problems are becoming increasingly important as data are readily available at our fingertips. We outline the fundamental ideas of loss for decision theoretic tasks, Bayesia...
Entity resolution identifies and removes duplicate entities in large, noisy databases and has grown in both usage and new developments as a result of increased data availability. Nevertheless, entity resolution has tradeoffs regarding assumptions of the data generation process, error rates, and computational scalability that make it a difficult tas...
Record linkage involves merging records in large, noisy databases to remove duplicate entities. It has become an important area because of its widespread occurrence in bibliometrics, public health, official statistics production, political science, and beyond. Traditional linkage methods directly linking records to one another are computationally i...
Most generative models for clustering implicitly assume that the number of data points in each cluster grows linearly with the total number of data points. Finite mixture models, Dirichlet process mixture models, and Pitman--Yor process mixture models make this assumption, as do all other infinitely exchangeable clustering models. However, for some...
A plethora of networks is being collected in a growing number of fields, including disease transmission, international relations, social interactions, and others. As data streams continue to grow, the complexity associated with these highly multidimensional connectivity data presents new challenges. In this paper, we focus on time-varying interconn...
Most generative models for clustering implicitly assume that the number of
data points in each cluster grows linearly with the total number of data
points. Finite mixture models, Dirichlet process mixture models, and
Pitman--Yor process mixture models make this assumption, as do all other
infinitely exchangeable clustering models. However, for some...
Databases often contain corrupted, degraded, and noisy data with duplicate entries across and within each database. Such problems arise in citations, medical databases, genetics, human rights databases, and a variety of other applied settings. The target of statistical inference can be viewed as an unsupervised problem of determining the edges of a...
Estimation of death counts and associated standard errors is of great
importance in armed conflict such as the ongoing violence in Syria, as well as
historical conflicts in Guatemala, Per\'u, Colombia, Timor Leste, and Kosovo.
For example, statistical estimates of death counts were cited as important
evidence in the trial of General Efra\'in R\'ios...
We congratulate the authors for a stimulating and valuable manuscript, providing a careful review of the state-of-the-art in cross-sectional and time-series benchmarking procedures for small area estimation. They develop a novel two-stage benchmarking method for hierarchical time series models, where they evaluate their procedure by estimating mont...
We develop constrained Bayesian estimation methods for small area problems:
those requiring smoothness with respect to similarity across areas, such as
geographic proximity or clustering by covariates; and benchmarking constraints,
requiring (weighted) means of estimates to agree across levels of aggregation.
We develop methods for constrained esti...
Bayesian entity resolution merges together multiple, noisy databases and
returns the minimal collection of unique individuals represented, together with
their true, latent record values. Bayesian methods allow flexible generative
models that share power across databases as well as principled quantification
of uncertainty for queries of the final, r...
Databases often contain corrupted, degraded, and noisy data with duplicate
entries across and within each database. Such problems arise in citations,
medical databases, human rights databases, and a variety of other applied
settings. The target of statistical inference can be viewed as an unsupervised
problem of determining the edges of a bipartite...
Record linkage seeks to merge databases and to remove duplicates when unique
identifiers are not available. Most approaches use blocking techniques to
reduce the computational complexity associated with record linkage. We review
traditional blocking techniques, which typically partition the records
according to a set of field attributes, and consid...
We congratulate the authors for a stimulating and valuable manuscript,
providing a careful review of the state-of the-art in cross-sectional and
time-series benchmarking procedures for small area estimation. They develop a
novel two-stage benchmarking method for hierarchical time series models, where
they evaluate their procedure by estimating mont...
Carroll describes an innovative model developed in Zhang et al. (2011) for
estimating dietary consumption patterns in children, and a successful Bayesian
solution for inferring the features of the model. The original authors went to
great lengths to achieve valid frequentist inference via a Bayesian analysis
that simplified the computational comple...
We propose a novel unsupervised approach for linking records across
arbitrarily many files, while simultaneously detecting duplicate records within
files. Our key innovation is to represent the pattern of links between records
as a {\em bipartite} graph, in which records are directly linked to latent true
individuals, and only indirectly linked to...
Functional neuroimaging measures how the brain responds to complex stimuli.
However, sample sizes are modest compared to the vast numbers of parameters to
be estimated, noise is substantial, and stimuli are high-dimensional. Hence,
direct estimates are inherently imprecise and call for regularization. We
compare a suite of approaches which regulari...
There has been recent growth in small area estimation due to the need for
more precise estimation of small geographic areas, which has led to groups such
as the U.S. Census Bureau, Google, and the RAND corporation utilizing small
area estimation procedures. We develop novel two-stage benchmarking methodology
using a single weighted squared error lo...
The PITCHf/x database has allowed the statistical analysis of of Major League
Baseball (MLB) to flourish since its introduction in late 2006. Using PITCHf/x,
pitches have been classified by hand, requiring considerable effort, or using
neural network clustering and classification, which is often difficult to
interpret. To address these issues, we u...
We consider benchmarked empirical Bayes (EB) estimators under the basic
area-level model of Fay and Herriot while requiring the standard benchmarking
constraint. In this paper we determine the excess mean squared error (MSE) from
constraining the estimates through benchmarking. We show that the increase due
to benchmarking is O(m^{-1}), where m is...
It is well-known that small area estimation needs explicit or at least implicit use of models (cf. Rao in Small Area Estimation,
Wiley, New York, 2003). These model-based estimates can differ widely from the direct estimates, especially for areas with very low sample sizes.
While model-based small area estimates are very useful, one potential diffi...