ArticlePDF Available

Better Estimates from Binned Income Data: Interpolated CDFs and Mean-Matching

Authors:

Abstract

Researchers often estimate income statistics from summaries that report the number of incomes in bins such as $0-10,000, $10,001-20,000,...,$200,000+. Some analysts assign incomes to bin midpoints, but this treats income as discrete. Other analysts fit a continuous parametric distribution, but the distribution may not fit well. We implement nonparametric continuous distributions that fit the bin counts perfectly by interpolating the cumulative distribution function (CDF). We also show how both midpoints and interpolated CDFs can be constrained to reproduce the mean of income when it is known. We compare the methods' accuracy in estimating the Gini coefficients of all 3,221 US counties. Fitting parametric distributions is very slow. Fitting interpolated CDFs is much faster and slightly more accurate. Both interpolated CDF estimates and midpoint Estimates improve dramatically if constrained to match a known mean. We have implemented interpolated CDFs in the binsmooth package for R. We have implemented the midpoint method in the rpme command for Stata. Both implementations can be constrained to match a known mean.
arXiv:1709.09705v1 [stat.ME] 27 Sep 2017
Better estimates from binned income data:
Interpolated CDFs and mean-matching
Paul T. von Hippel
LBJ School of Public Affairs, University of Texas at Austin
and
David J. Hunter, McKalie Drown
Department of Mathematics and Computer Science, Westmont College
September 29, 2017
Abstract
Researchers often estimate income statistics from summaries that report the num-
ber of incomes in bins such as $0-10,000, $10,001-20,000,. . . ,$200,000+. Some ana-
lysts assign incomes to bin midpoints, but this treats income as discrete. Other
analysts fit a continuous parametric distribution, but the distribution may not fit
well.
We implement nonparametric continuous distributions that fit the bin counts
perfectly by interpolating the cumulative distribution function (CDF). We also show
how both midpoints and interpolated CDFs can be constrained to reproduce the
mean of income when it is known.
We compare the methods’ accuracy in estimating the Gini coefficients of all 3,221
US counties. Fitting parametric distributions is very slow. Fitting interpolated CDFs
is much faster and slightly more accurate. Both interpolated CDF estimates and
midpoint Estimates improve dramatically if constrained to match a known mean.
We have implemented interpolated CDFs in the binsmooth package for R. We
have implemented the midpoint method in the rpme command for Stata. Both im-
plementations can be constrained to match a known mean.
Keywords: Gini, inequality, income brackets, grouped data
Drown is grateful for support from a Tensor Grant of the Mathematical Association of America.
1
1 Introduction
Surveys often ask respondents to report income in brackets or bins, such as $0-10,000,
$10,000-20,000,. . . ,$200,000+. Even if respondents report exact incomes, surveys may bin
incomes before publication, either to protect privacy or to summarize the income distribu-
tion compactly with the number of incomes in each bin. Table 1 gives a binned summary
of household incomes in Nantucket, the richest county in the US.
Table 1: Household incomes in Nantucket
Min Max Households Cumulative distribution
$0 $10,000 165 5%
$10,000 $15,000 109 8%
$15,000 $20,000 67 9%
$20,000 $25,000 147 13%
$25,000 $30,000 114 17%
$30,000 $35,000 91 19%
$35,000 $40,000 148 23%
$40,000 $45,000 44 24%
$45,000 $50,000 121 28%
$50,000 $60,000 159 32%
$60,000 $75,000 358 42%
$75,000 $100,000 625 59%
$100,000 $125,000 338 69%
$125,000 $150,000 416 80%
$150,000 $200,000 200 86%
$200,000 521 100%
Note. Each bin’s population is estimated from a 1-in-8 sample of
households who took the American Community Survey in 2006-10.
Incomes are in 2010 dollars.
Binning presents challenges to investigators who want to estimate simple summary
statistics such as the mean, median, or standard deviation, or inequality statistics such as
2
the Gini coefficient, the Theil index, the coefficient of variation, or the mean log devia-
tion. Researchers have implemented several methods for calculating estimates from binned
incomes.
The simplest and most popular approach is to assign each case to the midpoint of
its bin—using a robust pseudo-midpoint for the top bin, whose upper bound is typically
undefined (e.g., Table 1). The midpoint approach is easy to implement and runs quickly.
Midpoint estimates are also “bin-consistent” (von Hippel et al., 2015) in the sense that they
approach their estimands if the bins are sufficiently numerous and narrow. The weakness
of the midpoint approach is that it treats income as a discrete variable.
Another approach is to fit the bin counts to a continuous parametric distribution. Pop-
ular distributions include 2-, 3-, and 4-parameter distributions from the generalized beta
family, which includes the Pareto, lognormal, Weibull, Dagum, and other distributions
(McDonald and Ransom, 2008). One implementation fits up to 10 distributions and selects
the one that fits best according to an information criterion (AIC or BIC). An alternative is
to use the AIC or BIC to calculated a weighted average of income statistics across several
candidate distributions (von Hippel et al., 2015).
A strength of the parametric approach is that it treats income as continuous. A weakness
is that even the best-fitting parametric distribution may not fit the bin counts particularly
well. If the fit is poor, the parametric approach is not bin-consistent. Even with an infinite
number of bins, each infinitesimally narrow, a parametric distribution may produce poor
estimates if it is not a good fit to the underlying distribution of income.
A practical weakness of the parametric approach is that it is typically implemented using
iterative methods which can be slow. The speed of a parametric fit may be acceptable if
you fit a single distribution to a single binned dataset, but runtimes of hours are possible
if you fit several distributions to thousands of binned datasets—such as every county or
school district in the US. Other computational issues include nonconvergence and undefined
estimates. These issues are rare but inevitable when you run thousands of binned datasets
(von Hippel et al., 2015).
Neither method is uniformly better than the other. With many bins the midpoint
method is better because of bin-consistency, while with fewer than 8 bins the parametric
3
approach is better because of its smoothness. Empirically, the parametric and midpoint
approaches produce similarly accurate estimates from typical US income data with 15 to 25
bins. Both methods typically estimate the Gini within a few percentage points of its true
value. This is accurate enough for many purposes, but can lead to errors when estimating
small differences or changes such as the 5% increase in the Gini of US family income that
occurred between 1970 and 1980 (von Hippel et al., 2015).
A potential improvement is to fit binned incomes to a flexible nonparametric curve.
Like parametric methods, a nonparametric curve treats income as continuous. Like the
midpoint method, a nonparametric curve can be bin-consistent and fit the bin counts as
closely as we like.
Unfortunately, past nonparametric approaches have been disappointing. A nonparamet-
ric approach using kernel density estimation had substantial bias under some circumstances
(Minoiu and Reddy, 2012). An approach using a spline to model the log of the density
(Kooperberg and Stone, 1992) had even greater bias (von Hippel et al., 2012), though it is
not clear whether the bias came from the method or its implementation in software.
In this paper, we implement and test a nonparametric continuous method that outper-
forms its predecessors. The method, which we call CDF interpolation, is straightforward
and runs quickly because it simply connects points on the empirical cumulative distribution
function (CDF). The method can connect the points using line segments or cubic splines.
When cubic splines are used, the method is similar to “histospline” or “histopolation”
methods which fit a spline to a histogram (Wahba, 1976; Morandi and Costantini, 1989).
But histosplines are limited to histograms which have bins of equal width (Wang., 2015).
CDF interpolation is a more general approach that can handle income data where the bins
have unequal width and the top bins has no upper bound (e.g., Table 1).
We have implemented CDF interpolation in our R package binsmooth (Hunter and Drown,
2016), which is available for download from the Comprehensive R Archive Network (CRAN).
Our results will show that statistics estimated with CDF interpolation are slightly more
accurate than estimates obtained using midpoints or parametric distributions. However,
the difference between methods is dwarfed by the improvement we get if we use the grand
mean of income, which the US Census Bureau often reports alongside the bin counts. If we
4
constrain either the interpolated CDF or the midpoint method to match a known mean,
we get dramatically better estimates of the Gini. Our binsmooth package can constrain an
interpolated CDF to match a known mean, and our new release of the rpme command for
Stata can constrain the midpoint method to match a known mean as well.
In the rest of this paper, we define the midpoint, parametric, and interpolated CDF
methods more precisely, then compare the accuracy of estimates in binned data summariz-
ing household incomes within US counties. We also show how much estimates can improve
if we have the mean as well as the bin counts.
2 Methods
2.1 Binned data
A binned data set, such as Table 1, consists of counts n1, n2,...,nBspecifying the number
of cases in each of Bbins. Each bin bis defined by an interval [lb, ub), b = 1,...,B,
where lband ubare the lower and upper bound. The bottom bin often starts at zero
(l1= 0), and the top may have no upper bound (uB=). The total number of cases is
T=n1+n2+···+nB.
2.2 A midpoint method
The oldest and simplest way to analyze binned data is the midpoint method, which within
each bin b= 1,...,B assigns the nbincomes to the bin midpoint mb= (lb+ub)/2 (Heitjan,
1989). Then statistics such as the Gini can be calculated by applying sample formulas to
the midpoints mbweighted by the counts nb.
For the top bin, which has no upper bound, we must define some pseudo-midpoint. The
traditional choice is µB=lBα/(α1), which would define the arithmetic mean of the top
bin if top-bin incomes followed a Pareto distribution with shape parameter α > 1 (Henson,
1967). The problem with this choice is that the arithmetic mean of a Pareto distribution
is undefined if α1 and grows arbitrarily large as αapproaches 1 from above.
A more robust choice is the harmonic mean hB=lB(1 + 1), which is defined for all
α > 0 (von Hippel et al., 2015). We use the harmonic mean in this article. We estimate α
5
by fitting a Pareto distribution to the top two bins and calculating the maximum likelihood
estimate from the following formula (Quandt, 1966):
ˆα=ln((nB1+nB)/nB)
ln(lB/lB1)(1)
If the survey provides the grand mean µ, then we need not assume that top-bin incomes
follow a Pareto distribution. Instead, we can calculate
ˆµB=1
n(T µ
B1
X
b=1
nBmB) (2)
which would be the mean of the top bin if the means of the lower bins were the midpoints
mb. Then ˆµBcan serve as a pseudo-midpoint for the top bin.
In 4 percent of US counties, it happens that ˆµBis slightly less than lB. This is infelici-
tous, but ˆµBcan still be used, and the resulting Gini estimates are not necessarily bad. (It
might be a little better to set ˆµB=lBand move the other midpoints slightly the the left,
but this would affect only 4 percent of counties, and those only slightly since the affected
counties typically have few cases in the top bin.)
The midpoint methods described in this section are implemented by the rpme command
for Stata (Duan and von Hippel, 2015), where rpme stands for “robust Pareto midpoint
estimator” (von Hippel and Powers, 2017). Except for mean-matching, the approach is
also implemented in the binequality package for R (Scarpino et al., 2017).
2.3 Fitting parametric distributions
The weakness of the midpoint method is that it treats income as discrete. An alternative
is to model income as a continuous variable Xthat fits some parametric CDF F(X|θ).
Here θis a vector of parameters, which can be estimated by iteratively maximizing the log
likelihood:
l(θ|X) = ln
B
Y
b=1
(P(lb< X < ub))nb(3)
6
where P(lb< X < ub) = F(ub)F(lb) is the probability, according to the fitted distribu-
tion, that an income is in the bin [lb, ub).1
While any parametric distribution can be considered, in practice it is hard to fit a distri-
bution unless the number of parameters is small compared to the number of well-populated
bins. Most investigators favor 2-, 3-, and 4-parameter distributions from the generalized
beta family (McDonald and Ransom, 2008), which includes the following 10 distributions:
the log normal, the log logistic, the Pareto (type 2), the gamma and generalized gamma,
the beta 2 and generalized beta (type 2), the Dagum, the Singh-Maddala, and the Weibull.
A priori it is hard to know which distribution, if any, will fit well. In fact, every
distribution from the generalized beta can often be rejected by the following goodness-of-
fit likelihood ratio statistic (von Hippel et al., 2015)
G2=2(ˆ
l
B
X
b=1
nbln(nb/T ) (4)
Under the null hypothesis that the fitted distribution is the true distribution of income,
G2would follow a chi-square distribution with Bkdegrees of freedom, where B=
min(B>0, B 1) and B>0is the number of bins with nonzero counts. We reject the null
hypothesis if G2is extreme with respect to the null distribution, and in empirical daa it is
common to reject every distribution in the generalized beta family (von Hippel et al., 2015).
In addition, some distributions may fail to converge, or may converge on parameter values
that imply that the mean or variance of the distribution is undefined (von Hippel et al.,
2015).
A solution is to fit all 10 distributions, screen out any with undefined moments, and
among those remaining select the best fitting according to the Akaike or Bayes information
criterion:
AIC = 2k2ˆ
l
BI C =ln(T)k2ˆ
l
1This formula works for the top bin B; then uB=and F(uB) = 1. Some older articles add the
following constant to the likelihood: ln(T!) PB
b=1 ln(nb!). This is not wrong, but it is unnecessary since
adding a constant does not change the parameter values at which the likelihood is maximized (Edwards,
1972).
7
where kis the number of parameters and ˆ
lis the maximized log likelihood.
Instead of selecting a single best-fitting distribution, one can average estimates across
several candidate distributions weighted proportionately to exp(AIC/2) or exp(BI C/2)—
an approach known as model averaging. In general, model averaging produces better es-
timates than model selection (Burnham and Anderson, 2004), but when modeling binned
incomes the advantage of model averaging is negligible (von Hippel et al., 2015).
Although it sounds broad-minded to fit 10 different distributions, there is limited di-
versity in the generalized beta family. All the distributions in the generalized beta family
are unimodal and skewed to the right. Some distributions are quite similar (e.g., Dagum
and generalized beta), and others rarely fit well (e.g., log normal, log logistic, Pareto).
So in practice the range of viable and contrasting distributions in the generalized beta
family is small; you can fit just 3 well-chosen distributions (e.g., Dagum, gamma, and gen-
eralized gamma) and get estimates almost as good as those obtained from fitting all 10
(von Hippel et al., 2015).
Some income statistics are functions of the distributional parameters θ, but the functions
for some statistics are unknown or hard to calculate. As a general solution, it is easier to
calculate income statistics by numeric integration.
When the grand mean is available, the parameters could in theory be constrained to
match the grand mean as well as approximate the bin counts. This would be difficult,
though, since for distributions in the generalized beta family the mean is a complicated
nonlinear function of the parameters. We have not attempted to constrain our parametric
distributions to match a known mean.
The methods discussed in this section are implemented in the binequality package for
R (Scarpino et al., 2017) and the mgbe command for Stata (Duan and von Hippel, 2015).
Here mgbe stands for “multi-model generalized beta estimator” (von Hippel et al., 2015).
2.4 Interpolated CDFs
Since parametric distributions may fit poorly, an alternative is to define a flexible nonpara-
metric curve that fits the bin counts exactly.
As illustrated in the last column of Table 1, the binned data define Bdiscrete points on
8
the empirical cumulative distribution function (CDF)—i.e., (ub,ˆ
F(ub)), b = 1,··· , B 1,
where ˆ
F(ub) = (n1+n2+···+nb)/T is the fraction of incomes less than ub, and F(0) = 0.2
To estimate a continuous CDF ˆ
F(x), x > 0, we just “connect the dots.” That is, we
define a continuous nondecreasing function that interpolates between the Bpoints of the
empirical CDF.
The estimated probability density function (PDF) is just the derivative of the interpo-
lated CDF ˆ
F(x). Note that the estimated PDF “preserves areas” (Morandi and Costantini,
1989)—i.e., preserves bin counts—in the sense that the model probability that an income
will fall in each bin ( ˆ
P(lb< x < ub) = ˆ
F(ub)ˆ
F(lb)) is equal to the observed fraction of
incomes that are in that bin (nb/T ).
The shape of the PDF depends on the function that interpolates the CDF.
If the CDF is interpolated by line segments, then the CDF is polygonal, and the PDF
is a step function that is discontinuous at the bin boundaries.
If the CDF is interpolated by a monotone cubic spline, then the CDF is smooth,
and the PDF is piecewise quadratic i.e., continuous at the bin boundaries and
quadratic between them.
There remains a question of how to shape the CDF in the top bin, which typically has
no upper bound.
If the grand mean of income is known, then the shape of the CDF in the top bin can
be constrained to match the known mean. The CDF of the top bin can be rectangular,
exponential, or Pareto.3
In 4 percent of US counties we cannot reproduce the known mean simply by constrain-
ing the tail, because the known mean is already less than the mean of the lower B1 bins
without the tail. In that case, we make an ad hoc adjustment by shrinking the bin bound-
aries toward the origin that is, by replacing (lb, ub) with (slb, sub), where the shrinkage
2If the binned incomes come from a sample (as they do in Table 1), then the estimate ˆ
F(ub) may differ
from the true CDF F(ub) because of sampling error.
3In our implementation the exponential and Pareto tails are approximated by a sequence of rectangles
of decreasing heights.
9
factor s < 1 is chosen so that a small tail can be added, as described above, to reproduce
the grand mean. The shrinkage factor is rarely less than .995.
If the mean income is not known, we use an ad hoc estimate. We obtain that estimmate
by temporarily setting the upper bound of the top bin to uB= 2lB, calculating the mean
of a step PDF fit to all Bbins. Then we unbound the top bin and proceed as though the
mean were known.
Income statistics are estimated by applying numerical integration to functions of the
fitted PDF or CDF.
The methods in this section are implemented by our binsmooth package for R (Hunter and Drown,
2016). Within the binsmooth package, the stepbins function implements a step-function
PDF (and polygonal CDF), while the splinebins function implements a cubic spline CDF
(and piecewise quadratic PDF).
2.5 Recursive subdivision
Another way to obtain a smooth PDF estimate that preserves bin areas is to subdivide
the bins into smaller bins, and then adjust the heights of the subdivided bins to shorten
the jumps at the bin boundaries. This method, recursive subdivision, is implemented by
the recbin command in the binsmooth package for R (Hunter and Drown, 2016). Recursive
subdivision is slower and more computationally intensive than CDF interpolation and yields
PDF estimates that are practically identical. We discuss the details of recursive subdivision
in the Appendix.
3 Data and Results
Between 2006 and 2010, the American Community Survey (ACS) took a 1-in-8 sample of US
households (Census Bureau, 2014). Household incomes were inflated to 2010 dollars and
summarized in binned income tables for each of the 3,221 US counties. The published bin
counts are estimates of the population counts. We can approximate the sample counts by
dividing the population counts by 8. Dividing counts by a constant makes no difference to
any of statistics, except for the BIC and G2statistics that are used when fitting parametric
10
distributions.
The Census also published means and Gini coefficients for each county (Bee, 2012).
These statistics were estimated from exact incomes before binning, and so are more accu-
rate than any estimate that could be calculated from the binned data. They are sample
estimates which may differ from population values, but they remain a useful standard of
comparison for our binned-data estimates.
3.1 Results for Nantucket
Table 1 summarized the binned incomes for Nantucket County. Figure 1 fits several distri-
butions to Nantucket. The purple curve is the Dagum distribution, which fits Nantucket
better than alternatives in the generalized beta family. Yet the Dagum distribution fails
the G2goodness-of-fit test, and visually the Dagum distribution fits the bin counts poorly.
Its fit is reasonable for incomes less than $70,000, but between $70,000 and $150,000 it
underestimates the density, and above $150,000 it overestimates the density.
Figure 1: Different PDFs for the Nantucket data.
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
The Dagum distribution is drawn in purple, the quadratic spline in blue, and the step function in black.
Gray spikes illustrate the midpoint method.
The midpoint method is illustrated by gray spikes at the bin midpoints, whose heights
are proportional to the bin counts. The black step function is the PDF implied by a linear
11
interpolation of the CDF, and the blue curve is the piecewise quadratic PDF implied by a
cubic spline interpolation the CDF. Both fit the bin counts perfectly; in fact, their jagged
appearance suggests they may overfit the data—a possibility that we will return to later.
The step PDF looks slightly less volatile than the piecewise quadratic PDF, suggesting
that the step PDF may be less overfit.
Except for the Dagum distribution, all the methods in Figure 1 are calibrated to repro-
duce the grand mean.
Table 2 summarizes the Nantucket estimates. The true mean is $137,000 and the true
Gini is .547. All the methods yield underestimates. When fit without knowledge of the
true mean, every method underestimates the mean by 12 to 20 percent and the Gini by
15 to 21 percent. Remarkably, the simple midpoint method comes closer than its more
sophisticated competitors.
Table 2: Estimates for Nantucket
Estimate Mean Gini
True $137,811 .547
Without true mean Midpoint (RPME) $121,506 .464
Parametric (MGBE) $112,960 .453
Histospline (linear CDF) $110,419 .438
Histospline (cubic CDF) $110,419 .433
Matching true mean Midpoint .510
Histospline (linear CDF) .537
Histospline (cubic CDF) .525
When given the true mean, the methods do much better. They still underestimate the
Gini, but only by 2 to 7 percent. The closest estimate is obtained by linear interpolation
of the CDF. A smoother cubic spline interpolation does a little worse, but still better than
the midpoint method.
Although the estimates for Nantucket are less accurate than the estimates for most
other counties, the relative performance of different methods in Nantucket is typical of
what we will see elsewhere.
12
3.2 Results for all US counties
Figure 2 evaluates all county Gini estimates graphically by plotting the estimated Gini
ˆ
θjof each county against the published Gini θj. In the bottom row, where the methods
are constrained to match the published mean, the Gini estimates are close to a diagonal
reference line (ˆ
θj=θ) indicating nearly perfect estimation. In the top row, where the
methods do not match the mean, the estimates are more scattered, indicating lower accu-
racy. The parametric estimates are about as good as the interpolated CDF estimates. The
midpoint estimates are noticeably worse when the mean is unknown, but very accurate
when constrained to match a known mean.
Figure 2: Accuracy of Gini coefficients estimated by different methods, with and without
mean-matching.
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
Midpoint estimates
without mean matching
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
rpme_truemean
with mean matching
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
Parametric estimates
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
Interp. CDF (linear)
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
stepbins_truemean
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
Interp. CDF (cubic)
.2 .3 .4 .5 .6 .7
.2 .3 .4 .5 .6 .7
splinebins_truemean
Estimates
True Ginis
from 3,221 US counties
Accuracy of Gini estimates
We can summarize the accuracy of Gini estimates in several ways. For a single county
j, the percent estimation error is ej= 100 ×(ˆ
θjθj)j. Then across all counties, the
percent relative bias is the mean of ej, the percent root mean squared error (RMSE) is the
13
square root of the mean of e2
j, and the reliability is the squared correlation between θjand
ˆ
θj.
Table 3 summarizes our findings. If the estimators ignore the published county means,
then estimated Ginis have biases between 0% and -3%, RMSEs between 3% and 4%, and
reliabilities between 82% and 88%. The interpolated CDF estimates have the best bias,
the best RMSE, and the second best reliability, and they are just as good with linear
interpolation as with cubic spline interpolation. The parametric estimates have the best
reliability, but the worst bias, the worst RMSE, and by far the worst runtime, at 4.5 hours.
Table 3: Speed and accuracy of Gini estimates for 3,221 US counties.
Estimator Bias RMSE Reliability Runtime
Without true mean Midpoint -2% 4% 82% 4 sec
Parametric -3% 4% 89% 4.5 hr
CDF interpolation (linear) 0% 3% 88% 40 sec
CDF interpolation (cubic spline) -1% 3% 88% 2 min
Matching true mean Midpoint -1% 2% 98% 7 sec
CDF interpolation (linear) 0% 1% 99% 36 sec
CDF interpolation (cubic spline) -1% 1% 99% 2 min
Note. RMSE=root mean squared error. Runtimes on an Intel i7 Core processor with a speed of 3.6
GHz.
When the methods are constrained to match the published county means, the estimates
improve dramatically. The bias shrinks to 0-1%, the RMSE shrinks to 1-2%, and the
reliability grows to 98-99%. The midpoint estimates are excellent, and the interpolated
CDF estimates even better, and just as good with linear interpolation as with cubic spline
interpolation.
The differences among the methods are much smaller than the improvement that comes
from constraining any method to match the mean. Of course, this observation is only helpful
when the mean is known.
14
4 Conclusion
CDF interpolation produces estimates that are at least a little better than midpoint or
parametric estimates, whether the true mean is known or not. And CDF interpolation
runs much faster than parametric estimation, thought not as fast as midpoint estimation.
We initially suspected that cubic spline interpolation would improve on simple linear
interpolation, but empirically this turns out to be false. In estimating county Ginis, linear
CDF interpolation was at least as accurate as cubic spline interpolation.
The accuracy of linear CDF interpolation is remarkable, since it implies a step function
for the PDF. Step PDFs seem clearly unrealistic, especially in the top and bottom bins
where the step function is flat while the true distribution likely has an upward or downward
slope (Cloutier, 1995). Our implementation of the step PDF permits a downward Pareto
or exponential slope in the top bin, but this makes little difference to the Gini estimate.
It would be straightforward to also permit an upward slope in the bottom bin (Jargowsky,
1996; Cloutier, 1995), but perhaps this would make little difference, either. After all, the
cubic spline CDF yields a bottom bin that slopes upward, yet its Gini estimates are no
better.
The differences in accuracy among the methods are small, and they are dwarfed by
the improvement in accuracy that comes from knowing the grand mean. By constraining
binned-data methods to match a known mean, we can typically get county Gini estimates
that are typically within 1-2% of the estimates we would get if the data were not binned.
Our binsmooth package for R can constrain interpolated CDFs to match a known mean,
and our rpme command for Stata can constrain the top-bin midpoint to match a known
mean as well. We have not constrained our parametric distributions to match a known
mean, and we believe it would be difficult.
While the mean-constrained estimates are very accurate, there may be room for im-
provement when the mean is unknown. Perhaps the most promising idea for improvement
is smoothing. As we noticed in Figure fig:comparepdfs, interpolated CDFs can be a bit
jagged and may “overfit” the sample in the sense that they find nooks and crannies that
might not appear in another sample or in the population. Likewise interpolated CDFs may
be overfit to a specific set of bin boundaries. If the fitted CDF were a little smoother and
15
did not quite preserve the counts of the least populous bins, it might fit the population and
other samples (perhaps with different bin boundaries) a little better.
References
Bee, A. (2012, nov). Multimodel inference. Technical Report 2.
Burnham, K. P. and D. R. Anderson (2004, nov). Multimodel inference. Sociological
Methods & Research 33 (2), 261–304.
Cloutier, N. R. (1995). Lognormal extrapolation and income estimation for poor black
families. Journal of Regional Science 35 (1), 165–171.
Duan, Y. and P. T. von Hippel (2015). mgbe Multimodel Generalized Beta Estimator.
Stata command version 1.0.
Edwards, A. W. F. (1972). Likelihood. Johns Hopkins University Press.
Heitjan, D. F. (1989, may). [inference from grouped continuous data: A review]: Rejoinder.
Statistical Science 4 (2), 182–183.
Henson, M. (1967). Trends in the income of families and persons in the United States,
1947-1964. U. S. Dept. of Commerce, Bureau of the Census.
Hunter, D. J. and M. Drown (2016). binsmooth: Generate PDFs and CDFs from Binned
Data. R package version 0.1.0.
Jargowsky, P. A. (1996, dec). Take the money and run: Economic segregation in u.s.
metropolitan areas. American Sociological Review 61 (6), 984.
Kooperberg, C. and C. J. Stone (1992, dec). Logspline density estimation for censored
data. Journal of Computational and Graphical Statistics 1 (4), 301.
McDonald, J. B. and M. Ransom (2008). The generalized beta distribution as a model for
the distribution of income: Estimation of related measures of inequality. In Modeling
Income Distributions and Lorenz Curves, pp. 147–166. Springer New York.
16
Minoiu, C. and S. G. Reddy (2012, apr). Kernel density estimation on grouped data: the
case of poverty assessment. The Journal of Economic Inequality 12 (2), 163–189.
Morandi, R. and P. Costantini (1989, mar). Piecewise monotone quadratic histosplines.
SIAM Journal on Scientific and Statistical Computing 10 (2), 397–406.
Quandt, R. E. (1966, dec). Old and new methods of estimation and the pareto distribution.
Metrika 10 (1), 55–82.
Scarpino, S. V., P. T. von Hippel, and I. Holas (2017). binequality: Methods for Analyzing
Binned Income Data. R package version 1.02.
United States Census Bureau (2014). American FactFinder. Retrieved from
http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml on 6/16/2014.
von Hippel, P. T., I. Holas, and S. V. Scarpino (2012). Estimation with binned data.
von Hippel, P. T. and D. A. Powers (2017). rpme Robust Pareto midpoint estimator.
Stata command version 2.0.
von Hippel, P. T., S. V. Scarpino, and I. Holas (2015). Robust estimation of inequality
from binned incomes. Sociological Methodology.
Wahba, G. (1976). Histosplines with knots which are order statistics. Journal of the Royal
Statistical Society. Series B (Methodological) 38 (2), 140–151.
Wang., B. (2015). bda: Density Estimation for Grouped Data. R package version 5.1.6.
17
A Appendix: PDF smoothing by recursive subdivi-
sion
Recursive subdivision is another way to smooth the fitted PDF. It is a little more com-
putationally intensive than fitting a spline to the CDF and produces very similar results.
Recursive subdivision is implemented by the recbins function in our binsmooth package for
R.
A slight change of notation will be helpful. Since the upper bound ubof each bin is
equal to the lower bound lb+1 of the next, we can think the bins as having a set of “edges”
e0, e1,...,eB, where e0= 0, and the other eb=ubare the upper bounds of bins 1,...,B.
Start by fitting a step PDF. Let hbbe the height of the step PDF in the bin [eb, eb+1).
Given parameters ε1(0,0.5) and ε2(0,1), the subdivision process begins by introducing
new bin edges land rbetween eband eb+1 such that (l+r)/2 = (eb+eb+1)/2 and rl=
(eb+1 eb)ε2. The height of the new bin on the left with edges eband lis then shifted
horizontally by (hb1hb)ε1, while the height of the new bin on the right with edges rand
eb+1 is shifted horizontally by (hb+1 hb)ε1. Finally, the new middle bin with edges land r
is shifted horizontally so that the area of the three new bins equals the area of the original
bin.4See Figure 3.
In order for the above formulas to apply to the top and bottom bins, we create pseudo-
bins above and below them with heights of zero. This ensures that the subdivided PDF
will tend toward a height of zero at the lower edge of the bottom bin and the upper edge
of the top bin.
The smoothed PDF is obtained from the step PDF by applying the subdivision process
to each bin, then applying the process again to each subdivided bin, and so on, until the
desired level of smoothness is reached. In practice, three rounds of subdivision are sufficient
to produce a reasonably smooth PDF, and we found that choosing ε1= 0.25 and ε2= 0.75
produced nicely smoothed PDF’s from most empirical data sets. Figure 4 shows the result
of recursive subdivision in Nantucket.
Unfortunately, if the original step PDF was constrained to match a known mean, the
4We do not divide the bin in those rare cases where this subdivision algorithm yields a middle bin with
negative height.
18
Figure 3: Bin subdivision. Each original bin is replaced by three new bins (bold, dashed)
such that the bin area is preserved.
ebeb+1
l r
(eb+1 eb)"2
(hb+1 hb)"1
j(hb1hb)"1j
subdivision process may cause the mean to deviate slightly. But the estimated Gini typically
remains quite accurate.
19
Figure 4: Recursively subdivided step function for the Nantucket data set. The original
step function is shown in the background. Notice that the smoothing process preserves the
area in each bin.
0 50000 100000 150000 200000 250000 300000
20
... Each c.d.f. F h (y) is constructed by interpolating a monotonic cubic smoothing spline through income frequency data from Graph 1 of Australian Bureau of Statistics (2022c), using the method described in the 'Interpolated CDFs' section of vonHippel et al. (2017). ...
Preprint
Full-text available
The use of big data in official statistics and the applied sciences is accelerating, but statistics computed using only big data often suffer from substantial selection bias. This leads to inaccurate estimation and invalid statistical inference. We rectify the issue for a broad class of linear and nonlinear statistics by producing estimating equations that combine big data with a probability sample. Under weak assumptions about an unknown superpopulation, we show that our integrated estimator is consistent and asymptotically unbiased with an asymptotic normal distribution. Variance estimators with respect to both the sampling design alone and jointly with the superpopulation are obtained at once using a single, unified theoretical approach. A surprising corollary is that strategies minimising the design variance almost minimise the joint variance when the population and sample sizes are large. The integrated estimator is shown to be more efficient than its survey-only counterpart if dependence between sample membership indicators is small and the finite population is large. We illustrate our method for quantiles, the Gini index, linear regression coefficients and maximum likelihood estimators where the sampling design is stratified simple random sampling without replacement. Our results are illustrated in a simulation of individual Australian incomes.
... Mizala & Torche, 2012;Valenzuela et al., 2014), with the exception of families in the top band, where the harmonic mean of a Pareto distribution fit to the top two bands was used. Using the midpoints and the harmonic mean of the Pareto distribution (top band) makes it possible to estimate accurate income statistics; for example, this method estimates mean incomes that are 99% reliable with less than 2% bias (von Hippel et al., 2017). To ensure that similar groups were being compared, at least in terms of equal number of families in each category, the mean family income was used to estimate income quartiles. ...
Article
A common criticisms of school choice programs is that, instead of improving student achievement, they would increase school segregation. Parents may use different criteria to choose a school, such as proximity, school quality, or the school's ethnic/racial composition. As a result, the system would be segregated based on the parent's preferences. This research examines the school preferences of indigenous parents and whether ethnic discrimination influences their decision-making process. Longitudinal national-level data from Chile were analyzed using OLS with fixed effects. The results show that indigenous students, particularly those who have suffered ethnic discrimination in middle school, prefer high schools with a higher percentage of indigenous students. Furthermore, it was found that the level of acts of discrimination occurring in middle schools increases as the percentage of indigenous students rises. However, when the proportion of indigenous and non-indigenous students is similar, indigenous students are less likely to face discrimination.
... On the other hand, identifying a reasonable upper-end value for the final class is more difficult, particularly when thinking of very rich income recipients. In this case, it is estimated a pseudomidpoint for the top bin by using the "robust Pareto midpoint estimator" described by von Hippel et al. (2016Hippel et al. ( , 2017. This approach involves assuming that the top two bins follow a Pareto distribution with shape parameter > 0. Under the Pareto distributional assumption, the harmonic mean of the top income bracket , given by: ...
Preprint
Full-text available
The COVID-19 has exposed people to different risks, increasing the chances of losing jobs and income, worsening well-being levels, as well as developing serious health problems or death. In Brazil and US, at least in the first phases, the respective governments underestimated the extent of the pandemic, employing policies not adequately pointed at reducing the virus diffusion and at improving the public health system. This attitude, progressively mitigated, probably assumed an important role on the pandemic impact dimension on the local population. Employing novel econometric methods on available microdata, emerges that a growth in COVID-19 prevalence significantly increases economic disparities. Also, the impact of COVID-19 on inequality increases over time, suggesting that this negative impact has intensified over the time. In the U.S., the results suggest that working from home, being unable to work or being prevented from seeking work significantly increases inequalities. Although more data are needed to validate the hypothesis, it is concluded that based on this preliminary evidence, the pandemic has significantly contributed to the inequality growth in two countries already characterized by increasing polarization and high degrees of social disparities. JEL codes: D63, N30, P36
... Rather, the block group population in each category was apportioned among constituent Census blocks according to the 2010 population distribution and aggregated to the 2017 ESRI ZIP code polygon containing the block centroid (Quist et al., 2022b). We estimated ZIP code median household income using the R package binsmooth to fit mean-constrained cumulative distribution functions (CDFs) to household income categories, where mean household income was obtained by dividing aggregate ZIP code income by the number of households (von Hippel et al., 2017). The degree of ZIP code rurality was represented by a continuous isolation scale, which indicates access to the resources necessary for human thriving (such as food, healthcare, and economic opportunities afforded by internet access) that receive disproportionate capital investment in more urban settings in the United States (Doogan et al., 2018). ...
Article
Full-text available
An increasing share of urinary tract infections (UTIs) are caused by extraintestinal pathogenic Escherichia coli (ExPEC) lineages that have also been identified in poultry and hogs with high genetic similarity to human clinical isolates. We investigated industrial food animal production as a source of uropathogen transmission by examining relationships of hog and poultry density with emergency department (ED) visits for UTIs in North Carolina (NC). ED visits for UTI in 2016–2019 were identified by ICD-10 code from NC's ZIP code-level syndromic surveillance system and livestock counts were obtained from permit data and aerial imagery. We calculated separate hog and poultry spatial densities (animals/km²) by Census block with a 5 km buffer on the block perimeter and weighted by block population to estimate mean ZIP code densities. Associations between livestock density and UTI incidence were estimated using a reparameterized Besag-York-Mollié (BYM2) model with ZIP code population offsets to account for spatial autocorrelation. We excluded metropolitan and offshore ZIP codes and assessed effect measure modification by calendar year, ZIP code rurality, and patient sex, age, race/ethnicity, and health insurance status. In single-animal models, hog exposure was associated with increased UTI incidence (rate ratio [RR]: 1.21, 95 % CI: 1.07–1.37 in the highest hog-density tertile), but poultry exposure was associated with reduced UTI rates (RR: 0.86, 95 % CI: 0.81–0.91). However, the reference group for single-animal poultry models included ZIP codes with only hogs, which had some of the highest UTI rates; when compared with ZIP codes without any hogs or poultry, there was no association between poultry exposure and UTI incidence. Hog exposure was associated with increased UTI incidence in areas that also had medium to high poultry density, but not in areas with low poultry density, suggesting that intense hog production may contribute to increased UTI incidence in neighboring communities.
... Data from 2006 was used over earlier Census data due to availability. Next, a health geographer calculated the Gini coefficient using a previously established method (29). All computations for the Gini coefficient were completed in the R library using the binsmooth package (30). ...
Article
Background: Several studies have linked neighbourhood environment to preschool-aged children's behavioural problems. Income inequality is an identified risk factor for mental health among adolescents, however, little is known as to whether this relationship extends to younger children. Objective: To explore the association between neighbourhood-level income inequality and general psychopathology problems among preschool-aged children. Methods: We analyzed data from the All Our Families (AOF) longitudinal cohort located in Calgary, Canada at 3-years postpartum. The analytical sample consisted of 1615 mother-preschooler dyads nested within 184 neighbourhoods. Mothers completed the National Longitudinal Survey of Children and Youth Child Behaviour Checklist (NLSCY-CBCL), which assessed internalizing and externalizing symptoms. Income inequality was assessed via the Gini coefficient, which quantifies the unequal distribution of income in society. Mixed effects linear regression assessed the relationship between neighbourhood income inequality and preschooler's general psychopathology. Results: The mean Gini coefficient across the 184 neighbourhoods was 0.33 (SD = 0.05; min, max: 0.20-0.56). In the fully adjusted model income inequality was not associated with general psychopathology in children β = 0.07 (95%CI: -0.29, 0.45). Neighbourhood environment accounted for 0.5% of the variance in psychopathology in children. Conclusion: The lack of significant findings may be due to a lack of statistical power in the study. Future studies should investigate this relationship with appropriately powered studies, and over time, to assess if income inequality is a determinant of preschooler psychopathology in Canada.
Article
Full-text available
Researchers must often estimate inequality from data that provide only the number of cases (individuals, families, or households) whose incomes fall in bins such as $0-10,000, $10,000-15,000, or $200,000+. A variety of distributions have been fit to binned income, but no single distribution is uniformly best, and some distributions can yield undefined estimates of statistics such as the mean, coefficient of variation, or Gini coefficient. To overcome the limitations of fitting single distributions, we propose multimodel estimators that fit 10 distributions from the generalized beta family, eliminate any distributions with undefined estimates, and select or average estimates from the remaining distributions. We implement the proposed estimators in R. In an evaluation using binned incomes from every US county, we find that the multimodel estimators can estimate the mean and median income with 99% reliability and can estimate the Gini coefficient with 87-88% reliability. Multimodel estimates that are nearly as accurate can be obtained by using just three rather than ten distributions. The multimodel estimates are most accurate if the number of cases and bins is large, and less accurate if the county is exceptionally rich or unequal. The multimodel estimates are much more accurate than estimates obtained by simply assigning cases to the midpoint of their respective income bins.
Article
Full-text available
Variables such as household income are sometimes binned, so that we only know how many households fall in each of several bins such as $0-10,000, $10,000-15,000, or $200,000+. We provide a SAS macro that estimates the mean and variance of binned data by fitting the extended generalized gamma (EGG) distribution, the power normal (PN) distribution, and a new distribution that we call the power logistic (PL). The macro also implements a "best-of-breed" estimator that chooses from among the EGG, PN, and PL estimates on the basis of likelihood and finite variance. We test the macro by estimating the mean family and household incomes of approximately 13,000 US school districts between 1970 and 2009. The estimates have negligible bias (0-2%) and a root mean squared error of just 3-6%. The estimates compare favorably with estimates obtained by fitting the Dagum, generalized beta (GB2), or logspline distributions.
Article
This chapter gives results from some illustrative exploration of the performance of information-theoretic criteria for model selection and methods to quantify precision when there is model selection uncertainty. The methods given in Chapter 4 are illustrated and additional insights are provided based on simulation and real data. Section 5.2 utilizes a chain binomial survival model for some Monte Carlo evaluation of unconditional sampling variance estimation, confidence intervals, and model averaging. For this simulation the generating process is known and can be of relatively high dimension. The generating model and the models used for data analysis in this chain binomial simulation are easy to understand and have no nuisance parameters. We give some comparisons of AIC versus BIC selection and use achieved confidence interval coverage as an integrating metric to judge the success of various approaches to inference.
Article
A method is presented for constructing a quadratic spline function satisfying area-matching conditions and local monotonicity constraints, according to the frequencies on the class intervals and to the shape of a given histogram. Such a function is “as close as possible” to the quadratic spline that satisfies the area-matching conditions and the minimum curvature property and generally exhibits a visually pleasing graph.
Article
S ummary The histospline considered here is a quadratic spline density estimate analogous to the Boneva–Kendall–Stefanov (BKS) histospline, with their equally spaced knots replaced by knots at every k n th‐order statistic. This estimate can be expected to be relatively more “flexible” where the density is greatest. We obtain “mean square error at a point” convergence rates. The rates obtained are uniform over classes of densities which have first (alternatively second or third) derivatives in a bounded set in . If k n is chosen optimally, then this estimate shares the same “near optimal” convergence rates of certain BKS estimates when the knot spacing is chosen optimally, kernel estimates when the scale factor is chosen optimally, and orthogonal series estimates when the length of the series is chosen optimally.
Article
Logspline density estimation is developed for data that may be right censored, left censored, or interval censored. A fully automatic method, which involves the maximum likelihood method and may involve stepwise knot deletion and either the Akaike information criterion (AIC) or Bayesian information criterion (BIC), is used to determine the estimate. In solving the maximum likelihood equations, the Newton–Raphson method is augmented by occasional searches in the direction of steepest ascent. Also, a user interface based on S is described for obtaining estimates of the density function, distribution function, and quantile function and for generating a random sample from the fitted distribution.
Article
In studies of urban and state income distribution, the estimation of percentile incomes within intervals reported by the Bureau of the Census has often been done under the assumption that incomes are uniformly distributed. This paper shows that income may be significantly underestimated if the assumption is applied to lower-level percentiles in the black family income distribution. Consequently, conclusions about the level and determinants of the relative income of poor black families may be misleading.
Chapter
The generalized beta (GB) is considered as a model for the distribution of income. It is well known that its special cases include Dagum’s distribution along with the Singh-Maddala distribution. Related measures of inequality such as the Gini Coefficient, Pietra Index, or Theil Index are expressed in terms of the parameters of the generalized beta. This paper also explores the use of numerical integration techniques for calculating inequality indexes. Numerical integration may be useful since in some cases it may be computationally very difficult to evaluate the equations that have been derived or the equations are not available. We provide examples from the distribution of family income in the United States for the year 2000.
Article
Compared to racial segregation, economic segregation has received little attention in recent empirical literature. Yet a heated debate has arisen concerning Wilson's hypothesis (1987) that increasing economic segregation plays a role in the formation of urban ghettos. This paper presents a methodological critique of the measure of economic segregation used by Massey and Eggers (1990) and finds that it confounds changes in the income distribution with spatial changes. I develop a "pure" measure of economic segregation and present findings on all U.S. metropolitan areas from 1970 to 1990. There have been steady increases in economic segregation for whites, blacks, and Hispanics in both the 1970s and 1980s, but the increases have been particularly large and widespread for blacks and Hispanics in the 1980s. The causes of these changes are explored in a reduced form, fixed-effects model. Social distance theory and structural economic transformations do affect economic segregation, but the large increases in economic segregation among minorities in the 1980s cannot be fully explained within the model. These rapid increases in economic segregation, especially in the context of recent, albeit small, declines in racial segregation, have important implications for urban policy, poverty policy, and the stability of urban communities.