About
247
Publications
69,099
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
54,153
Citations
Citations since 2017
Introduction
If you'd like copies of any of my publications, papers, talks, patents, and more please have a look at my website at http://GaryKing.org. I do not use ResearchGate other than this note.
Gary King
Publications
Publications (247)
To deter gerrymandering, many state constitutions require legislative districts to be “compact.” Yet, the law offers few precise definitions other than “you know it when you see it,” which effectively implies a common understanding of the concept. In contrast, academics have shown that compactness has multiple dimensions and have generated many con...
Significance
To the best of our knowledge, this is the first prediction of child anthropometric failure estimates for 597,121 villages—the smallest local governance unit—in India. While prior child nutrition policies and programs in India focused on districts for planning, implementation, and monitoring, we highlight that a majority of the geograph...
While digital trace data from sources like search engines hold enormous potential for tracking and understanding human behavior, these streams of data lack information about the actual experiences of those individuals generating the data. Moreover, most current methods ignore or under-utilize human processing capabilities that allow humans to solve...
Data sharing, research ethics, and incentives must improve
While digital trace data from sources like search engines hold enormous potential for tracking and understanding human behavior, these streams of data lack information about the actual experiences of those individuals generating the data. Moreover, most current methods ignore or under-utilize human processing capabilities that allow humans to solve...
Researchers who generate data often optimize efficiency and robustness by choosing stratified over simple random sampling designs. Yet, all theories of inference proposed to justify matching methods are based on simple random sampling. This is all the more troubling because, although these theories require exact matching, most matching applications...
The Chinese government has long been suspected of hiring as many as 2 million people to surreptitiously insert huge numbers of pseudonymous and other deceptive writings into the stream of real social media posts, as if they were the genuine opinions of ordinary people. Many academics, and most journalists and activists, claim that these so-called 5...
We provide an overview of PSI ("a Private data Sharing Interface"), a system we are developing to enable researchers in the social sciences and other fields to share and explore privacy-sensitive datasets with the strong privacy protections of differential privacy.
A ciência política é uma empreitada coletiva. A comunidade dos cientistas políticos empiricamente orientados precisa ter acesso aos conjuntos de dados necessários para reproduzir os estudos existentes de modo a entender, avaliar e, especialmente, se desenvolver a partir destes trabalhos. Infelizmente, as regras que nós seguimos atualmente não encor...
Although social scientists devote considerable effort to mitigating measurement error during data collection, they often ignore the issue during data analysis. And although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dep...
We extend a unified and easy-to-use approach to measurement error and missing data. In our companion article, Blackwell, Honaker, and King give an intuitive overview of the new technique, along with practical suggestions and empirical applications. Here, we offer more precise technical details, more sophisticated measurement error model specificati...
We offer an evaluation of the Social Security Administration demographic and financial forecasts used to assess the long-term solvency of the Social Security Trust Funds. This same forecasting methodology is also used in evaluating policy proposals put forward by Congress to modify the Social Security program. Ours is the first evaluation to compar...
The vast majority of social science research uses small (megabyte- or gigabyte-scale) datasets. These fixed-scale datasets are commonly downloaded to the researcher’s computer where the analysis is performed. The data can be shared, archived, and cited with well-established technologies, such as the Dataverse Project, to support the published resul...
Censorship of social media in China
Figuring out how many and which social media comments are censored by governments is difficult because those comments, by definition, cannot be read. King et al. have posted comments to social media sites in China and then waited to see which of these never appeared, which appeared and were then removed, and whic...
We thank Broniatowski, Paul, and Dredze for giving us the opportunity to reemphasize the potential of big data and make the more obvious point that not all big data projects have the problems currently plaguing Google Flu Trends (GFT), nor are these problems inherent to the field in general.
Our
“Robust standard errors” are used in a vast array of scholarship to correct standard errors for model misspecification. However,
when misspecification is bad enough to make classical and robust standard errors diverge, assuming that it is nevertheless
not so bad as to bias everything else requires considerable optimism. And even if the optimism is...
Large errors in flu prediction were largely avoidable, which offers lessons for the use of big data.
In response to its poor performance during the 2012-2013 flu season, Google Flu Trends (GFT) engineers announced a redesign of the GFT algorithm. Two changes were made: (1) dampening anomalous media spikes and (2) using ElasticNet, rather than regression, for estimation. This paper identifies several problems that persist in the new algorithm. Firs...
As the push towards electronic storage, publication, curation, and discoverability of research data collected in multiple research domains has grown, so too have the massive numbers of small to medium datasets that are highly distributed and not easily discoverable - a region of data that is sometimes referred to as the long tail of science. The ra...
We marshal discoveries about human behavior and learning from social science research and show how these can be used to improve teaching and learning. The discoveries are easily stated as three social science generalizations: (1) social connections motivate, (2) teaching teaches the teacher, and (3) instant feedback improves learning. We show how t...
A method for selecting clusterings to classify a predetermined data set of numerical data comprises five steps. First, a plurality of known clustering methods are applied, one at a time, to the data set to generate clusterings for each method. Second, a metric space of clusterings is generated using a metric that measures the similarity between two...
We offer the first large scale, multiple source analysis of the outcome of what may be the most extensive effort to selectively censor human expression ever implemented. To do this, we have devised a system to locate, download, and analyze the content of millions of social media posts originating from nearly 1,400 different social media services al...
Abstract The social sciences are undergoing a dramatic transformation from studying problems to solving them; from making do with a small number of sparse data sets to analyzing increasing quantities of diverse, highly informative data; from isolated scholars toiling away on their own to larger scale, collaborative, interdisciplinary, lab-style res...
We discuss a method for improving causal inferences called "Coarsened Exact Matching" (CEM), and the new "Monotonie Imbalance Bounding" (MIB) class of matching methods from which CEM is derived. We summarize what is known about CEM and MIB, derive and illustrate several new desirable statistical properties of CEM,and then propose a variety of usefu...
In the election for President of the United States, the Electoral College
is the body whose members vote to elect the President directly. Each state sends
a number of delegates equal to its total number of representatives and senators
in Congress; all but two states (Nebraska and Maine) assign electors pledged
to the candidate that wins the state’s...
The financial viability of Social Security, the single largest U.S. government program, depends on accurate forecasts of the solvency of its intergenerational trust fund. We begin by detailing information necessary for replicating the Social Security Administration's (SSA's) forecasting procedures, which until now has been unavailable in the public...
Classic (or ‘‘cumulative’’) case-control sampling designs do not admit inferences about quantities of interest other than risk ratios and then only by making the rare events assumption. Probabilities, risk differences, number needed
Matching methods for causal inference selectively prune observations from the data in order to reduce model dependence. They are successful when simultaneously maximizing balance (between the treated and control groups on the pre-treatment covariates) and the number of observations remaining in the data set. However, ex-isting matching methods eith...
Amelia II is a complete R package for multiple imputation of missing data. The package implements a new expectation-maximization with bootstrapping algorithm that works faster, with larger numbers of variables, and is far easier to use, than various Markov chain Monte Carlo approaches, but gives essentially the same answers. The program also improv...
Convexity of the log likelihood function.
(DOC)
A sample of 5 deconvolved influenza symptomatic incidence curves (as described in section S3) against the original one (black). Symptom profiles (2b), Method 2.
(TIF)
Adding an “unobserved” non-flu outbreak with atypical symptoms (as described in section S3). Symptomatic influenza incidence (black), “regular” non-flu incidence (red) and “outbreak” non-flu incidence (dashed red).
(TIF)
Symptom profile distribution for PCR-negative, symptomatic household contacts.
(DOC)
Influenza incidence proxy on the University of Michigan campus.
(DOC)
Two samples of 5 deconvolved influenza symptomatic incidence curves (as described in section S3) against the original one (black). (A) Method 1 deconvolution. (B) Method 2 deconvolution.
(TIF)
Symptom profile distribution for PCR negative, symptomatic household contacts from [12] (A). Non-flu symptom profile distribution from the main body of the text (B).
(TIF)
Weekly incidence proxy on the University of Michigan campus, inferred from survey and virological testing data in
[21]
.
(TIF)
Influenza A and B symptom profiles.
(DOC)
Numerical simulations for the deconvolution process.
(DOC)
The effect of an “unobserved” non-flu outbreak with atypical symptoms (as described in section S3) on the deconvolution process. A sample of 5 deconvolved influenza symptomatic incidence curves against the original one (black). Symptom profiles (2b), Method 2. Flu and non-flu incidence curves given by Figure S3.
(TIF)
We introduce a method for estimating incidence curves of several co-circulating infectious pathogens, where each infection has its own probabilities of particular symptom profiles. Our deconvolution method utilizes weekly surveillance data on symptoms from a defined population as well as additional data on symptoms from a sample of virologically co...
Population mortality forecasts are widely used for allocating public health expenditures, setting research priorities, and evaluating the viability of public pensions, private pensions, and health care financing systems. Although we know a great deal about patterns in and causes of mortality, most forecasts are still based on simple linear extrapol...
MatchIt implements the suggestions of Ho, Imai, King, and Stuart (2007) for improving parametric statistical models by preprocessing data with nonparametric matching methods. MatchIt implements a wide range of sophisticated matching methods, making it possible to greatly reduce the dependence of causal inferences on hard-to-justify, but commonly ma...
Social scientists typically devote considerable effort to mitigating measurement error dur-ing data collection but then ignore the issue during data analysis. Although many statistical methods have been proposed for reducing measurement error-induced biases, few have been widely used because of implausible assumptions, high levels of model dependen...
We introduce a new “Monotonic Imbalance Bounding” (MIB) class of matching methods for causal inference with a surprisingly large number of attractive statistical properties. MIB generalizes and extends in several new directions the only existing class, “Equal Percent Bias Reducing” (EPBR), which is designed to satisfy weaker properties and only in...
We develop a computer-assisted method for the discovery of insightful conceptualizations, in the form of clusterings (i.e., partitions) of input objects. Each of the numerous fully automated methods of cluster analysis proposed in statistics, computer science, and biology optimize a different objective function. Almost all are well defined, but how...
Massive increases in the availability of informative social science data are making dramatic progress possible in analyzing,
understanding, and addressing many major societal problems. Yet the same forces pose severe challenges to the scientific infrastructure
supporting data sharing, data management, informatics, statistical methodology, and resea...
We highlight common problems in the application of random treatment assignment in large-scale program evaluation. Random assignment is the defining feature of modern experimental design, yet errors in design, implementation, and analysis often result in real-world applications not benefiting from its advantages. The errors discussed here cover the...
When respondents use the ordinal response categories of standard survey questions in different ways, the validity of analyses based on the resulting data can be biased. Anchoring vignettes is a survey design technique intended to correct for some of these problems. The anchors package in R includes methods for evaluating and choosing anchoring vign...
Verbal autopsy analyses are widely used for estimating cause-specific mortality rates (CSMR) in the vast majority of the world without high-quality medical death registration. Verbal autopsies -- survey interviews with the caretakers of imminent decedents -- stand in for medical examinations or physical autopsies, which are infeasible or culturally...
We report the results of several randomized survey experiments designed to evaluate two intended improvements to anchoring
vignettes, an increasingly common technique used to improve interpersonal comparability in survey research. This technique
asks for respondent self-assessments followed by assessments of hypothetical people described in vignett...
The recent subprime mortgage crisis has brought to the forefront the possibility of discrim-inatory lending. Many – including groups like the NAACP – claim that women and minorities were disproportionately awarded subprime rates on their home purchase loans. Using the mil-lions of observations collected by the federal government in 2006 through the...
Additional tables, which contains Tables S1-S4.
Abstract Applications of modern methods for analyzing data with missing values, based primarily on multiple imputation, have in the last half-decade become common in American politics and political behavior. Scholars in these fields have thus increasingly avoided the biases and ineciencies,caused by ad hoc methods like listwise deletion and best gu...
Incomplete information on death certificates makes recorded cause-of-death data less useful for public health monitoring and planning. Certifying physicians sometimes list only the mode of death without indicating the underlying disease or diseases that led to the death. Inconsistent cause-of-death assignment among cardiovascular causes of death is...
The increasing availability of digitized text presents enormous opportunities for social scientists. Yet hand coding many blogs, speeches, government records, newspapers, or other sources of unstructured text is infeasible. Although computer scientists have methods for automated content analysis, most are optimized to classify individual documents,...
This paper shows that increases in the minimum wage rate can have ambiguous effects on the working hours and welfare of employed workers in competitive labor markets. The reason is that employers may not comply with the minimum wage legislation and instead pay a lower subminimum wage rate. If workers are risk neutral, we prove that working hours an...
Social science data are an unusual part of the past, present, and future of digital preservation. They are both an unqualified success, due to long-lived and sustainable archival organizations, and in need of further development because not all digital content is being preserved. This article is about the Data Preservation Alliance for the Social S...
We address a major discrepancy in matching methods for causal inference in observational data. Since these data are typically plentiful, the goal of matching is to reduce bias and only secondar-ily to keep variance low. However, most matching methods seem designed for the opposite goal, guaranteeing sample size ex ante but limiting bias by controll...
Rejoinder to "The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation" [arXiv:0910.3752] Comment: Published in at http://dx.doi.org/10.1214/09-STS274REJ the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.ims...
This program is designed to improve causal inference via a method of matching that is widely applicable in observational data and easy to understand and use (if you understand how to draw a histogram, you will understand this method). The program implements the coarsened exact matching (CEM) algorithm, described below. CEM may be used alone or in c...
We assessed aspects of Seguro Popular, a programme aimed to deliver health insurance, regular and preventive medical care, medicines, and health facilities to 50 million uninsured Mexicans.
We randomly assigned treatment within 74 matched pairs of health clusters-ie, health facility catchment areas-representing 118 569 households in seven Mexican s...
A field is emerging that leverages the capacity to collect and analyze data at a scale that may reveal patterns of individual and group behaviors. Government Version of Record
A basic feature of many field experiments is that investigators are only able to randomize clusters of individuals-such as households, communities, firms, medical practices, schools or classrooms-even when the individual is the unit of interest. To recoup the resulting efficiency loss, some studies pair similar clusters and randomize treatment with...
In this article, we introduce a Stata implementation of coarsened exact matching, a new method for improving the estimation of causal effects by reduc- ing imbalance in covariates between treated and control groups. Coarsened exact matching is faster, is easier to use and understand, requires fewer assumptions, is more easily automated, and possess...
Conflict scholars have devoted considerable attention to the natural resource curse, and specifically to connections between natural resources, state weakness, and civil war. Many have posited a state weakness mechanism– that significant oil production causes state weakness, and state weakness consequently increases the likelihood of civil war onse...
»Maβnahmen zur Langzeitarchivierung von Umfragedaten in den USA«. Social science data collected in the United States, both historically and at present, have often not been placed in any public archive - even when the data collection was supported by government grants. The availability of the data for future use is, therefore, in jeopardy. Enforcing...
In response to the data-based measures of model dependence proposed in King and Zeng (2006), Sambanis and Michaelides (2008)
propose alternative measures that rely upon assumptions untestable in observational data. If these assumptions are correct,
then their measures are appropriate and ours, based solely on the empirical data, may be too conserva...
We introduce a new ``Monotonic Imbalance Bounding'' (MIB) class of matching methods for causal inference that satisfies several important in-sample properties. MIB generalizes and extends in several new directions the only existing class, ``Equal Percent Bias Reducing'' (EPBR), which is designed to satisfy weaker properties and only in expectation....
The enormous Nazi voting literature rarely builds on modern statistical or economic research. By adding these approaches, we find that the most widely accepted existing theories of this era cannot distinguish the Weimar elections from almost any others in any country. Via a retrospective voting account, we show that voters most hurt by the depressi...
Mexican immigrants were historically clustered in a few cities, mainly in California and Texas. During the past 15 years, however, arrivals from Mexico established sizeable immigrant communities in many “new” cities. We explore the causes and consequences of the widening geographic diffusion of Mexican immigrants. A combination of demand-pull and s...
We describe some progress toward a common framework for statistical analysis and software development built on and within the R language, including R's numerous existing packages. The framework we have developed offers a simple unified structure and syntax that can encompass a large fraction of statistical procedures already implemented in R, witho...
Deliberative theorists emphasize that citizensÕ capacity to be-come informed when given a motive and the opportunity to participate in politics is important for democratic citizenship. We assess this capacity among citizens using a deliberative field experiment. In the summer of 2006, we conducted a field experiment in which we recruited twelve cur...
If ignored, non-compliance with a treatment and nonresponse on outcome measures can bias estimates of treatment effects in a randomized experiment. To identify treatment effects in the case where compliance and response are conditioned on unobservables, we propose the parametric generalized endoge-nous treatment (GET) model. As a multilevel random...
Verbal autopsy procedures are widely used for estimating cause-specific mortality in areas without medical death certification. Data on symptoms reported by caregivers along with the cause of death are collected from a medical facility, and the cause-of-death distribution is estimated in the population where only symptom data are available. Current...
At one point during the 1988 campaign, Michael Dukakis was ahead in the public opinion polls by 17 percentage points, but he eventually lost the election by 8 percent. Walter Mondale was ahead in the polls by 4 percent during the 1984 campaign but lost the election in a landslide. During June and July of 1992, Clinton, Bush, and Perot each had turn...
Political scientists estimate average causal effects with regression or matching techniques, but both techniques require the user to choose a set of matching or conditioning variables. In this paper, we show that the standard advice from both frameworks on how to choose an identifying set of variables is often insufficient and at times misleading....
We attempt to clarify, and suggest how to avoid, several serious misunderstandings about and fallacies of causal inference. These issues concern some of the most fundamental advantages and disadvantages of each basic research design. Problems include improper use of hypothesis tests for covariate balance between the treated and control groups, and...
King, Alt, Burns, and Laver (1990) proposed and estimated a unified model in which cabinet durations depended on seven explanatory variables reflecting features of the cabinets and the bargaining environments in which they formed, along with a stochastic component in which the risk of a cabinet falling was treated as a constant across its tenure. T...
The dramatic increase in the electoral advantage of incumbency has sparked widespread interest among congressional researchers over the last 15 years. Although many scholars have studied the advantages of incumbency for incumbents, few have analyzed its effects on the underlying electoral system. We examine the influence of the incumbency advantage...
We derive a unified statistical method with which one can produce substantially improved definitions and estimates of almost any feature of two-party electoral systems that can be defined based on district vote shares. Our single method enables one to calculate more efficient estimates, with more trustworthy assessments of their uncertainty, than e...
As most political scientists know, the outcome of the American presidential election can be predicted within a few percentage points (in the popular vote), based on information available months before the election. Thus, the general campaign for president seems irrelevant to the outcome (except in very close elections), despite all the media covera...
We demonstrate that the expected value and variance commonly given for a well-known probability distribution are incorrect. We also provide corrected versions and report changes in a computer program to account for the known practical uses of this distribution.
The probability distribution in question, named the continuous parameter binomial (CPB)...
Receiving five serious reviews in this symposium is gratifying and confirms our belief that research design should be a priority for our discipline. We are pleased that our five distinguished reviewers appear to agree with our unified approach to the logic of inference in the social sciences, and with our fundamental point: that good quantitative a...
Ecological inference, as traditionally defined, is the process of using aggregate (i.e., ecological) data to infer discrete individual-level relationships of interest when individual-level data are not available. Existing methods of ecological inference generate very inaccurate conclusions about the empirical world- which thus gives rise to the eco...