Available via license: CC BY 4.0

Content may be subject to copyright.

arXiv:1709.09705v1 [stat.ME] 27 Sep 2017

Better estimates from binned income data:

Interpolated CDFs and mean-matching

Paul T. von Hippel

LBJ School of Public Aﬀairs, University of Texas at Austin

and

David J. Hunter, McKalie Drown∗

Department of Mathematics and Computer Science, Westmont College

September 29, 2017

Abstract

Researchers often estimate income statistics from summaries that report the num-

ber of incomes in bins such as $0-10,000, $10,001-20,000,. . . ,$200,000+. Some ana-

lysts assign incomes to bin midpoints, but this treats income as discrete. Other

analysts ﬁt a continuous parametric distribution, but the distribution may not ﬁt

well.

We implement nonparametric continuous distributions that ﬁt the bin counts

perfectly by interpolating the cumulative distribution function (CDF). We also show

how both midpoints and interpolated CDFs can be constrained to reproduce the

mean of income when it is known.

We compare the methods’ accuracy in estimating the Gini coeﬃcients of all 3,221

US counties. Fitting parametric distributions is very slow. Fitting interpolated CDFs

is much faster and slightly more accurate. Both interpolated CDF estimates and

midpoint Estimates improve dramatically if constrained to match a known mean.

We have implemented interpolated CDFs in the binsmooth package for R. We

have implemented the midpoint method in the rpme command for Stata. Both im-

plementations can be constrained to match a known mean.

Keywords: Gini, inequality, income brackets, grouped data

∗Drown is grateful for support from a Tensor Grant of the Mathematical Association of America.

1

1 Introduction

Surveys often ask respondents to report income in brackets or bins, such as $0-10,000,

$10,000-20,000,. . . ,$200,000+. Even if respondents report exact incomes, surveys may bin

incomes before publication, either to protect privacy or to summarize the income distribu-

tion compactly with the number of incomes in each bin. Table 1 gives a binned summary

of household incomes in Nantucket, the richest county in the US.

Table 1: Household incomes in Nantucket

Min Max Households Cumulative distribution

$0 $10,000 165 5%

$10,000 $15,000 109 8%

$15,000 $20,000 67 9%

$20,000 $25,000 147 13%

$25,000 $30,000 114 17%

$30,000 $35,000 91 19%

$35,000 $40,000 148 23%

$40,000 $45,000 44 24%

$45,000 $50,000 121 28%

$50,000 $60,000 159 32%

$60,000 $75,000 358 42%

$75,000 $100,000 625 59%

$100,000 $125,000 338 69%

$125,000 $150,000 416 80%

$150,000 $200,000 200 86%

$200,000 521 100%

Note. Each bin’s population is estimated from a 1-in-8 sample of

households who took the American Community Survey in 2006-10.

Incomes are in 2010 dollars.

Binning presents challenges to investigators who want to estimate simple summary

statistics such as the mean, median, or standard deviation, or inequality statistics such as

2

the Gini coeﬃcient, the Theil index, the coeﬃcient of variation, or the mean log devia-

tion. Researchers have implemented several methods for calculating estimates from binned

incomes.

The simplest and most popular approach is to assign each case to the midpoint of

its bin—using a robust pseudo-midpoint for the top bin, whose upper bound is typically

undeﬁned (e.g., Table 1). The midpoint approach is easy to implement and runs quickly.

Midpoint estimates are also “bin-consistent” (von Hippel et al., 2015) in the sense that they

approach their estimands if the bins are suﬃciently numerous and narrow. The weakness

of the midpoint approach is that it treats income as a discrete variable.

Another approach is to ﬁt the bin counts to a continuous parametric distribution. Pop-

ular distributions include 2-, 3-, and 4-parameter distributions from the generalized beta

family, which includes the Pareto, lognormal, Weibull, Dagum, and other distributions

(McDonald and Ransom, 2008). One implementation ﬁts up to 10 distributions and selects

the one that ﬁts best according to an information criterion (AIC or BIC). An alternative is

to use the AIC or BIC to calculated a weighted average of income statistics across several

candidate distributions (von Hippel et al., 2015).

A strength of the parametric approach is that it treats income as continuous. A weakness

is that even the best-ﬁtting parametric distribution may not ﬁt the bin counts particularly

well. If the ﬁt is poor, the parametric approach is not bin-consistent. Even with an inﬁnite

number of bins, each inﬁnitesimally narrow, a parametric distribution may produce poor

estimates if it is not a good ﬁt to the underlying distribution of income.

A practical weakness of the parametric approach is that it is typically implemented using

iterative methods which can be slow. The speed of a parametric ﬁt may be acceptable if

you ﬁt a single distribution to a single binned dataset, but runtimes of hours are possible

if you ﬁt several distributions to thousands of binned datasets—such as every county or

school district in the US. Other computational issues include nonconvergence and undeﬁned

estimates. These issues are rare but inevitable when you run thousands of binned datasets

(von Hippel et al., 2015).

Neither method is uniformly better than the other. With many bins the midpoint

method is better because of bin-consistency, while with fewer than 8 bins the parametric

3

approach is better because of its smoothness. Empirically, the parametric and midpoint

approaches produce similarly accurate estimates from typical US income data with 15 to 25

bins. Both methods typically estimate the Gini within a few percentage points of its true

value. This is accurate enough for many purposes, but can lead to errors when estimating

small diﬀerences or changes such as the 5% increase in the Gini of US family income that

occurred between 1970 and 1980 (von Hippel et al., 2015).

A potential improvement is to ﬁt binned incomes to a ﬂexible nonparametric curve.

Like parametric methods, a nonparametric curve treats income as continuous. Like the

midpoint method, a nonparametric curve can be bin-consistent and ﬁt the bin counts as

closely as we like.

Unfortunately, past nonparametric approaches have been disappointing. A nonparamet-

ric approach using kernel density estimation had substantial bias under some circumstances

(Minoiu and Reddy, 2012). An approach using a spline to model the log of the density

(Kooperberg and Stone, 1992) had even greater bias (von Hippel et al., 2012), though it is

not clear whether the bias came from the method or its implementation in software.

In this paper, we implement and test a nonparametric continuous method that outper-

forms its predecessors. The method, which we call CDF interpolation, is straightforward

and runs quickly because it simply connects points on the empirical cumulative distribution

function (CDF). The method can connect the points using line segments or cubic splines.

When cubic splines are used, the method is similar to “histospline” or “histopolation”

methods which ﬁt a spline to a histogram (Wahba, 1976; Morandi and Costantini, 1989).

But histosplines are limited to histograms which have bins of equal width (Wang., 2015).

CDF interpolation is a more general approach that can handle income data where the bins

have unequal width and the top bins has no upper bound (e.g., Table 1).

We have implemented CDF interpolation in our R package binsmooth (Hunter and Drown,

2016), which is available for download from the Comprehensive R Archive Network (CRAN).

Our results will show that statistics estimated with CDF interpolation are slightly more

accurate than estimates obtained using midpoints or parametric distributions. However,

the diﬀerence between methods is dwarfed by the improvement we get if we use the grand

mean of income, which the US Census Bureau often reports alongside the bin counts. If we

4

constrain either the interpolated CDF or the midpoint method to match a known mean,

we get dramatically better estimates of the Gini. Our binsmooth package can constrain an

interpolated CDF to match a known mean, and our new release of the rpme command for

Stata can constrain the midpoint method to match a known mean as well.

In the rest of this paper, we deﬁne the midpoint, parametric, and interpolated CDF

methods more precisely, then compare the accuracy of estimates in binned data summariz-

ing household incomes within US counties. We also show how much estimates can improve

if we have the mean as well as the bin counts.

2 Methods

2.1 Binned data

A binned data set, such as Table 1, consists of counts n1, n2,...,nBspecifying the number

of cases in each of Bbins. Each bin bis deﬁned by an interval [lb, ub), b = 1,...,B,

where lband ubare the lower and upper bound. The bottom bin often starts at zero

(l1= 0), and the top may have no upper bound (uB=∞). The total number of cases is

T=n1+n2+···+nB.

2.2 A midpoint method

The oldest and simplest way to analyze binned data is the midpoint method, which within

each bin b= 1,...,B assigns the nbincomes to the bin midpoint mb= (lb+ub)/2 (Heitjan,

1989). Then statistics such as the Gini can be calculated by applying sample formulas to

the midpoints mbweighted by the counts nb.

For the top bin, which has no upper bound, we must deﬁne some pseudo-midpoint. The

traditional choice is µB=lBα/(α−1), which would deﬁne the arithmetic mean of the top

bin if top-bin incomes followed a Pareto distribution with shape parameter α > 1 (Henson,

1967). The problem with this choice is that the arithmetic mean of a Pareto distribution

is undeﬁned if α≤1 and grows arbitrarily large as αapproaches 1 from above.

A more robust choice is the harmonic mean hB=lB(1 + 1/α), which is deﬁned for all

α > 0 (von Hippel et al., 2015). We use the harmonic mean in this article. We estimate α

5

by ﬁtting a Pareto distribution to the top two bins and calculating the maximum likelihood

estimate from the following formula (Quandt, 1966):

ˆα=ln((nB−1+nB)/nB)

ln(lB/lB−1)(1)

If the survey provides the grand mean µ, then we need not assume that top-bin incomes

follow a Pareto distribution. Instead, we can calculate

ˆµB=1

n(T µ −

B−1

X

b=1

nBmB) (2)

which would be the mean of the top bin if the means of the lower bins were the midpoints

mb. Then ˆµBcan serve as a pseudo-midpoint for the top bin.

In 4 percent of US counties, it happens that ˆµBis slightly less than lB. This is infelici-

tous, but ˆµBcan still be used, and the resulting Gini estimates are not necessarily bad. (It

might be a little better to set ˆµB=lBand move the other midpoints slightly the the left,

but this would aﬀect only 4 percent of counties, and those only slightly since the aﬀected

counties typically have few cases in the top bin.)

The midpoint methods described in this section are implemented by the rpme command

for Stata (Duan and von Hippel, 2015), where rpme stands for “robust Pareto midpoint

estimator” (von Hippel and Powers, 2017). Except for mean-matching, the approach is

also implemented in the binequality package for R (Scarpino et al., 2017).

2.3 Fitting parametric distributions

The weakness of the midpoint method is that it treats income as discrete. An alternative

is to model income as a continuous variable Xthat ﬁts some parametric CDF F(X|θ).

Here θis a vector of parameters, which can be estimated by iteratively maximizing the log

likelihood:

l(θ|X) = ln

B

Y

b=1

(P(lb< X < ub))nb(3)

6

where P(lb< X < ub) = F(ub)−F(lb) is the probability, according to the ﬁtted distribu-

tion, that an income is in the bin [lb, ub).1

While any parametric distribution can be considered, in practice it is hard to ﬁt a distri-

bution unless the number of parameters is small compared to the number of well-populated

bins. Most investigators favor 2-, 3-, and 4-parameter distributions from the generalized

beta family (McDonald and Ransom, 2008), which includes the following 10 distributions:

the log normal, the log logistic, the Pareto (type 2), the gamma and generalized gamma,

the beta 2 and generalized beta (type 2), the Dagum, the Singh-Maddala, and the Weibull.

A priori it is hard to know which distribution, if any, will ﬁt well. In fact, every

distribution from the generalized beta can often be rejected by the following goodness-of-

ﬁt likelihood ratio statistic (von Hippel et al., 2015)

G2=−2(ˆ

l−

B

X

b=1

nbln(nb/T ) (4)

Under the null hypothesis that the ﬁtted distribution is the true distribution of income,

G2would follow a chi-square distribution with B∗−kdegrees of freedom, where B∗=

min(B>0, B −1) and B>0is the number of bins with nonzero counts. We reject the null

hypothesis if G2is extreme with respect to the null distribution, and in empirical daa it is

common to reject every distribution in the generalized beta family (von Hippel et al., 2015).

In addition, some distributions may fail to converge, or may converge on parameter values

that imply that the mean or variance of the distribution is undeﬁned (von Hippel et al.,

2015).

A solution is to ﬁt all 10 distributions, screen out any with undeﬁned moments, and

among those remaining select the best ﬁtting according to the Akaike or Bayes information

criterion:

AIC = 2k−2ˆ

l

BI C =ln(T)k−2ˆ

l

1This formula works for the top bin B; then uB=∞and F(uB) = 1. Some older articles add the

following constant to the likelihood: ln(T!) −PB

b=1 ln(nb!). This is not wrong, but it is unnecessary since

adding a constant does not change the parameter values at which the likelihood is maximized (Edwards,

1972).

7

where kis the number of parameters and ˆ

lis the maximized log likelihood.

Instead of selecting a single best-ﬁtting distribution, one can average estimates across

several candidate distributions weighted proportionately to exp(−AIC/2) or exp(−BI C/2)—

an approach known as model averaging. In general, model averaging produces better es-

timates than model selection (Burnham and Anderson, 2004), but when modeling binned

incomes the advantage of model averaging is negligible (von Hippel et al., 2015).

Although it sounds broad-minded to ﬁt 10 diﬀerent distributions, there is limited di-

versity in the generalized beta family. All the distributions in the generalized beta family

are unimodal and skewed to the right. Some distributions are quite similar (e.g., Dagum

and generalized beta), and others rarely ﬁt well (e.g., log normal, log logistic, Pareto).

So in practice the range of viable and contrasting distributions in the generalized beta

family is small; you can ﬁt just 3 well-chosen distributions (e.g., Dagum, gamma, and gen-

eralized gamma) and get estimates almost as good as those obtained from ﬁtting all 10

(von Hippel et al., 2015).

Some income statistics are functions of the distributional parameters θ, but the functions

for some statistics are unknown or hard to calculate. As a general solution, it is easier to

calculate income statistics by numeric integration.

When the grand mean is available, the parameters could in theory be constrained to

match the grand mean as well as approximate the bin counts. This would be diﬃcult,

though, since for distributions in the generalized beta family the mean is a complicated

nonlinear function of the parameters. We have not attempted to constrain our parametric

distributions to match a known mean.

The methods discussed in this section are implemented in the binequality package for

R (Scarpino et al., 2017) and the mgbe command for Stata (Duan and von Hippel, 2015).

Here mgbe stands for “multi-model generalized beta estimator” (von Hippel et al., 2015).

2.4 Interpolated CDFs

Since parametric distributions may ﬁt poorly, an alternative is to deﬁne a ﬂexible nonpara-

metric curve that ﬁts the bin counts exactly.

As illustrated in the last column of Table 1, the binned data deﬁne Bdiscrete points on

8

the empirical cumulative distribution function (CDF)—i.e., (ub,ˆ

F(ub)), b = 1,··· , B −1,

where ˆ

F(ub) = (n1+n2+···+nb)/T is the fraction of incomes less than ub, and F(0) = 0.2

To estimate a continuous CDF ˆ

F(x), x > 0, we just “connect the dots.” That is, we

deﬁne a continuous nondecreasing function that interpolates between the Bpoints of the

empirical CDF.

The estimated probability density function (PDF) is just the derivative of the interpo-

lated CDF ˆ

F(x). Note that the estimated PDF “preserves areas” (Morandi and Costantini,

1989)—i.e., preserves bin counts—in the sense that the model probability that an income

will fall in each bin ( ˆ

P(lb< x < ub) = ˆ

F(ub)−ˆ

F(lb)) is equal to the observed fraction of

incomes that are in that bin (nb/T ).

The shape of the PDF depends on the function that interpolates the CDF.

•If the CDF is interpolated by line segments, then the CDF is polygonal, and the PDF

is a step function that is discontinuous at the bin boundaries.

•If the CDF is interpolated by a monotone cubic spline, then the CDF is smooth,

and the PDF is piecewise quadratic — i.e., continuous at the bin boundaries and

quadratic between them.

There remains a question of how to shape the CDF in the top bin, which typically has

no upper bound.

If the grand mean of income is known, then the shape of the CDF in the top bin can

be constrained to match the known mean. The CDF of the top bin can be rectangular,

exponential, or Pareto.3

In 4 percent of US counties we cannot reproduce the known mean simply by constrain-

ing the tail, because the known mean is already less than the mean of the lower B−1 bins

without the tail. In that case, we make an ad hoc adjustment by shrinking the bin bound-

aries toward the origin – that is, by replacing (lb, ub) with (slb, sub), where the shrinkage

2If the binned incomes come from a sample (as they do in Table 1), then the estimate ˆ

F(ub) may diﬀer

from the true CDF F(ub) because of sampling error.

3In our implementation the exponential and Pareto tails are approximated by a sequence of rectangles

of decreasing heights.

9

factor s < 1 is chosen so that a small tail can be added, as described above, to reproduce

the grand mean. The shrinkage factor is rarely less than .995.

If the mean income is not known, we use an ad hoc estimate. We obtain that estimmate

by temporarily setting the upper bound of the top bin to uB= 2lB, calculating the mean

of a step PDF ﬁt to all Bbins. Then we unbound the top bin and proceed as though the

mean were known.

Income statistics are estimated by applying numerical integration to functions of the

ﬁtted PDF or CDF.

The methods in this section are implemented by our binsmooth package for R (Hunter and Drown,

2016). Within the binsmooth package, the stepbins function implements a step-function

PDF (and polygonal CDF), while the splinebins function implements a cubic spline CDF

(and piecewise quadratic PDF).

2.5 Recursive subdivision

Another way to obtain a smooth PDF estimate that preserves bin areas is to subdivide

the bins into smaller bins, and then adjust the heights of the subdivided bins to shorten

the jumps at the bin boundaries. This method, recursive subdivision, is implemented by

the recbin command in the binsmooth package for R (Hunter and Drown, 2016). Recursive

subdivision is slower and more computationally intensive than CDF interpolation and yields

PDF estimates that are practically identical. We discuss the details of recursive subdivision

in the Appendix.

3 Data and Results

Between 2006 and 2010, the American Community Survey (ACS) took a 1-in-8 sample of US

households (Census Bureau, 2014). Household incomes were inﬂated to 2010 dollars and

summarized in binned income tables for each of the 3,221 US counties. The published bin

counts are estimates of the population counts. We can approximate the sample counts by

dividing the population counts by 8. Dividing counts by a constant makes no diﬀerence to

any of statistics, except for the BIC and G2statistics that are used when ﬁtting parametric

10

distributions.

The Census also published means and Gini coeﬃcients for each county (Bee, 2012).

These statistics were estimated from exact incomes before binning, and so are more accu-

rate than any estimate that could be calculated from the binned data. They are sample

estimates which may diﬀer from population values, but they remain a useful standard of

comparison for our binned-data estimates.

3.1 Results for Nantucket

Table 1 summarized the binned incomes for Nantucket County. Figure 1 ﬁts several distri-

butions to Nantucket. The purple curve is the Dagum distribution, which ﬁts Nantucket

better than alternatives in the generalized beta family. Yet the Dagum distribution fails

the G2goodness-of-ﬁt test, and visually the Dagum distribution ﬁts the bin counts poorly.

Its ﬁt is reasonable for incomes less than $70,000, but between $70,000 and $150,000 it

underestimates the density, and above $150,000 it overestimates the density.

Figure 1: Diﬀerent PDFs for the Nantucket data.

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

The Dagum distribution is drawn in purple, the quadratic spline in blue, and the step function in black.

Gray spikes illustrate the midpoint method.

The midpoint method is illustrated by gray spikes at the bin midpoints, whose heights

are proportional to the bin counts. The black step function is the PDF implied by a linear

11

interpolation of the CDF, and the blue curve is the piecewise quadratic PDF implied by a

cubic spline interpolation the CDF. Both ﬁt the bin counts perfectly; in fact, their jagged

appearance suggests they may overﬁt the data—a possibility that we will return to later.

The step PDF looks slightly less volatile than the piecewise quadratic PDF, suggesting

that the step PDF may be less overﬁt.

Except for the Dagum distribution, all the methods in Figure 1 are calibrated to repro-

duce the grand mean.

Table 2 summarizes the Nantucket estimates. The true mean is $137,000 and the true

Gini is .547. All the methods yield underestimates. When ﬁt without knowledge of the

true mean, every method underestimates the mean by 12 to 20 percent and the Gini by

15 to 21 percent. Remarkably, the simple midpoint method comes closer than its more

sophisticated competitors.

Table 2: Estimates for Nantucket

Estimate Mean Gini

True $137,811 .547

Without true mean Midpoint (RPME) $121,506 .464

Parametric (MGBE) $112,960 .453

Histospline (linear CDF) $110,419 .438

Histospline (cubic CDF) $110,419 .433

Matching true mean Midpoint .510

Histospline (linear CDF) .537

Histospline (cubic CDF) .525

When given the true mean, the methods do much better. They still underestimate the

Gini, but only by 2 to 7 percent. The closest estimate is obtained by linear interpolation

of the CDF. A smoother cubic spline interpolation does a little worse, but still better than

the midpoint method.

Although the estimates for Nantucket are less accurate than the estimates for most

other counties, the relative performance of diﬀerent methods in Nantucket is typical of

what we will see elsewhere.

12

3.2 Results for all US counties

Figure 2 evaluates all county Gini estimates graphically by plotting the estimated Gini

ˆ

θjof each county against the published Gini θj. In the bottom row, where the methods

are constrained to match the published mean, the Gini estimates are close to a diagonal

reference line (ˆ

θj=θ) indicating nearly perfect estimation. In the top row, where the

methods do not match the mean, the estimates are more scattered, indicating lower accu-

racy. The parametric estimates are about as good as the interpolated CDF estimates. The

midpoint estimates are noticeably worse when the mean is unknown, but very accurate

when constrained to match a known mean.

Figure 2: Accuracy of Gini coeﬃcients estimated by diﬀerent methods, with and without

mean-matching.

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

Midpoint estimates

without mean matching

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

rpme_truemean

with mean matching

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

Parametric estimates

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

Interp. CDF (linear)

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

stepbins_truemean

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

Interp. CDF (cubic)

.2 .3 .4 .5 .6 .7

.2 .3 .4 .5 .6 .7

splinebins_truemean

Estimates

True Ginis

from 3,221 US counties

Accuracy of Gini estimates

We can summarize the accuracy of Gini estimates in several ways. For a single county

j, the percent estimation error is ej= 100 ×(ˆ

θj−θj)/θj. Then across all counties, the

percent relative bias is the mean of ej, the percent root mean squared error (RMSE) is the

13

square root of the mean of e2

j, and the reliability is the squared correlation between θjand

ˆ

θj.

Table 3 summarizes our ﬁndings. If the estimators ignore the published county means,

then estimated Ginis have biases between 0% and -3%, RMSEs between 3% and 4%, and

reliabilities between 82% and 88%. The interpolated CDF estimates have the best bias,

the best RMSE, and the second best reliability, and they are just as good with linear

interpolation as with cubic spline interpolation. The parametric estimates have the best

reliability, but the worst bias, the worst RMSE, and by far the worst runtime, at 4.5 hours.

Table 3: Speed and accuracy of Gini estimates for 3,221 US counties.

Estimator Bias RMSE Reliability Runtime

Without true mean Midpoint -2% 4% 82% 4 sec

Parametric -3% 4% 89% 4.5 hr

CDF interpolation (linear) 0% 3% 88% 40 sec

CDF interpolation (cubic spline) -1% 3% 88% 2 min

Matching true mean Midpoint -1% 2% 98% 7 sec

CDF interpolation (linear) 0% 1% 99% 36 sec

CDF interpolation (cubic spline) -1% 1% 99% 2 min

Note. RMSE=root mean squared error. Runtimes on an Intel i7 Core processor with a speed of 3.6

GHz.

When the methods are constrained to match the published county means, the estimates

improve dramatically. The bias shrinks to 0-1%, the RMSE shrinks to 1-2%, and the

reliability grows to 98-99%. The midpoint estimates are excellent, and the interpolated

CDF estimates even better, and just as good with linear interpolation as with cubic spline

interpolation.

The diﬀerences among the methods are much smaller than the improvement that comes

from constraining any method to match the mean. Of course, this observation is only helpful

when the mean is known.

14

4 Conclusion

CDF interpolation produces estimates that are at least a little better than midpoint or

parametric estimates, whether the true mean is known or not. And CDF interpolation

runs much faster than parametric estimation, thought not as fast as midpoint estimation.

We initially suspected that cubic spline interpolation would improve on simple linear

interpolation, but empirically this turns out to be false. In estimating county Ginis, linear

CDF interpolation was at least as accurate as cubic spline interpolation.

The accuracy of linear CDF interpolation is remarkable, since it implies a step function

for the PDF. Step PDFs seem clearly unrealistic, especially in the top and bottom bins

where the step function is ﬂat while the true distribution likely has an upward or downward

slope (Cloutier, 1995). Our implementation of the step PDF permits a downward Pareto

or exponential slope in the top bin, but this makes little diﬀerence to the Gini estimate.

It would be straightforward to also permit an upward slope in the bottom bin (Jargowsky,

1996; Cloutier, 1995), but perhaps this would make little diﬀerence, either. After all, the

cubic spline CDF yields a bottom bin that slopes upward, yet its Gini estimates are no

better.

The diﬀerences in accuracy among the methods are small, and they are dwarfed by

the improvement in accuracy that comes from knowing the grand mean. By constraining

binned-data methods to match a known mean, we can typically get county Gini estimates

that are typically within 1-2% of the estimates we would get if the data were not binned.

Our binsmooth package for R can constrain interpolated CDFs to match a known mean,

and our rpme command for Stata can constrain the top-bin midpoint to match a known

mean as well. We have not constrained our parametric distributions to match a known

mean, and we believe it would be diﬃcult.

While the mean-constrained estimates are very accurate, there may be room for im-

provement when the mean is unknown. Perhaps the most promising idea for improvement

is smoothing. As we noticed in Figure ﬁg:comparepdfs, interpolated CDFs can be a bit

jagged and may “overﬁt” the sample in the sense that they ﬁnd nooks and crannies that

might not appear in another sample or in the population. Likewise interpolated CDFs may

be overﬁt to a speciﬁc set of bin boundaries. If the ﬁtted CDF were a little smoother and

15

did not quite preserve the counts of the least populous bins, it might ﬁt the population and

other samples (perhaps with diﬀerent bin boundaries) a little better.

References

Bee, A. (2012, nov). Multimodel inference. Technical Report 2.

Burnham, K. P. and D. R. Anderson (2004, nov). Multimodel inference. Sociological

Methods & Research 33 (2), 261–304.

Cloutier, N. R. (1995). Lognormal extrapolation and income estimation for poor black

families. Journal of Regional Science 35 (1), 165–171.

Duan, Y. and P. T. von Hippel (2015). mgbe – Multimodel Generalized Beta Estimator.

Stata command version 1.0.

Edwards, A. W. F. (1972). Likelihood. Johns Hopkins University Press.

Heitjan, D. F. (1989, may). [inference from grouped continuous data: A review]: Rejoinder.

Statistical Science 4 (2), 182–183.

Henson, M. (1967). Trends in the income of families and persons in the United States,

1947-1964. U. S. Dept. of Commerce, Bureau of the Census.

Hunter, D. J. and M. Drown (2016). binsmooth: Generate PDFs and CDFs from Binned

Data. R package version 0.1.0.

Jargowsky, P. A. (1996, dec). Take the money and run: Economic segregation in u.s.

metropolitan areas. American Sociological Review 61 (6), 984.

Kooperberg, C. and C. J. Stone (1992, dec). Logspline density estimation for censored

data. Journal of Computational and Graphical Statistics 1 (4), 301.

McDonald, J. B. and M. Ransom (2008). The generalized beta distribution as a model for

the distribution of income: Estimation of related measures of inequality. In Modeling

Income Distributions and Lorenz Curves, pp. 147–166. Springer New York.

16

Minoiu, C. and S. G. Reddy (2012, apr). Kernel density estimation on grouped data: the

case of poverty assessment. The Journal of Economic Inequality 12 (2), 163–189.

Morandi, R. and P. Costantini (1989, mar). Piecewise monotone quadratic histosplines.

SIAM Journal on Scientiﬁc and Statistical Computing 10 (2), 397–406.

Quandt, R. E. (1966, dec). Old and new methods of estimation and the pareto distribution.

Metrika 10 (1), 55–82.

Scarpino, S. V., P. T. von Hippel, and I. Holas (2017). binequality: Methods for Analyzing

Binned Income Data. R package version 1.02.

United States Census Bureau (2014). American FactFinder. Retrieved from

http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml on 6/16/2014.

von Hippel, P. T., I. Holas, and S. V. Scarpino (2012). Estimation with binned data.

von Hippel, P. T. and D. A. Powers (2017). rpme – Robust Pareto midpoint estimator.

Stata command version 2.0.

von Hippel, P. T., S. V. Scarpino, and I. Holas (2015). Robust estimation of inequality

from binned incomes. Sociological Methodology.

Wahba, G. (1976). Histosplines with knots which are order statistics. Journal of the Royal

Statistical Society. Series B (Methodological) 38 (2), 140–151.

Wang., B. (2015). bda: Density Estimation for Grouped Data. R package version 5.1.6.

17

A Appendix: PDF smoothing by recursive subdivi-

sion

Recursive subdivision is another way to smooth the ﬁtted PDF. It is a little more com-

putationally intensive than ﬁtting a spline to the CDF and produces very similar results.

Recursive subdivision is implemented by the recbins function in our binsmooth package for

R.

A slight change of notation will be helpful. Since the upper bound ubof each bin is

equal to the lower bound lb+1 of the next, we can think the bins as having a set of “edges”

e0, e1,...,eB, where e0= 0, and the other eb=ubare the upper bounds of bins 1,...,B.

Start by ﬁtting a step PDF. Let hbbe the height of the step PDF in the bin [eb, eb+1).

Given parameters ε1∈(0,0.5) and ε2∈(0,1), the subdivision process begins by introducing

new bin edges land rbetween eband eb+1 such that (l+r)/2 = (eb+eb+1)/2 and r−l=

(eb+1 −eb)ε2. The height of the new bin on the left with edges eband lis then shifted

horizontally by (hb−1−hb)ε1, while the height of the new bin on the right with edges rand

eb+1 is shifted horizontally by (hb+1 −hb)ε1. Finally, the new middle bin with edges land r

is shifted horizontally so that the area of the three new bins equals the area of the original

bin.4See Figure 3.

In order for the above formulas to apply to the top and bottom bins, we create pseudo-

bins above and below them with heights of zero. This ensures that the subdivided PDF

will tend toward a height of zero at the lower edge of the bottom bin and the upper edge

of the top bin.

The smoothed PDF is obtained from the step PDF by applying the subdivision process

to each bin, then applying the process again to each subdivided bin, and so on, until the

desired level of smoothness is reached. In practice, three rounds of subdivision are suﬃcient

to produce a reasonably smooth PDF, and we found that choosing ε1= 0.25 and ε2= 0.75

produced nicely smoothed PDF’s from most empirical data sets. Figure 4 shows the result

of recursive subdivision in Nantucket.

Unfortunately, if the original step PDF was constrained to match a known mean, the

4We do not divide the bin in those rare cases where this subdivision algorithm yields a middle bin with

negative height.

18

Figure 3: Bin subdivision. Each original bin is replaced by three new bins (bold, dashed)

such that the bin area is preserved.

ebeb+1

l r

(eb+1 −eb)"2

(hb+1 −hb)"1

j(hb−1−hb)"1j

subdivision process may cause the mean to deviate slightly. But the estimated Gini typically

remains quite accurate.

19

Figure 4: Recursively subdivided step function for the Nantucket data set. The original

step function is shown in the background. Notice that the smoothing process preserves the

area in each bin.

0 50000 100000 150000 200000 250000 300000

20