Conference PaperPDF Available

Models of Income Distributions for Knowledge Discovery

Authors:

Abstract and Figures

Descriptions of income distributions using a single distribution, like Lognormal or Gamma are often quite poor in describing the tails of the distribution [1]. This led to separate models for the upper vs. lower parts of income distributions [2]. For example [3-5] describe the high-income region with the Pareto power laws. Other authors model the low to medium income region using Exponential [6], Lognormal [7] or Gamma distri-butions [8, 9]. The high income range is often modeled using the cumulative distribution function (cdf) [10], whereas the low to medium income regions are modeled using the probability density function (pdf) [11]. Usually no systematic limits between low, medium and high income are defined [3]. A goal for a valid and suitable model for income distributions is to derive a theory of the mechanisms which operate in a society (Computational Social Science) and explains the observed distribution [12]. Here a model for income distributions as a mixture of components is proposed. The model is derived using the Pareto Density Estimation (PDE) [14] for an estimation of the pdf. PDE has been designed in particular to identify groups/classes in a dataset [13]. Precise limits for the classes can be calculated using the theorem of Bayes. Our model suggests that there are different groups/classes in a society, which contribute to the total distribution of income in their own way. The approach is demonstrated on several real world data sets including actual income data from Germany. 1. Dagum, C., New Model of personal Income-Distribution-Specification and Estimation, Economie appliquée, 30(3), pp 413-437, 1977. 2. Richmond, P., et al., A Review of empirical Studies and Models of Income Distributions in Society. Wiley, Berlin, 2006. 3. Chatterjee, A., B.K. Chakrabarti, and R.B. Stinchcombe, Master Equation for a kinetic Model of a Trading Market and its analytic Solution. Phys Rev E Stat Nonlin Soft Matter Phys. 72(2 Pt 2, p. 026126, 2005. 4. Levy, M. and S. Solomon, New Evidence for the power-law Distribution of Wealth, Physica A: Statistical Mechanics and its Applications, 242(1), p. 90-94, 1997. 5. Drăgulescu, A. and V.M. Yakovenko, Exponential and power-law Probability Distributions of Wealth and Income in the United Kingdom and the United States. Physica A: Statistical Mechanics and its Applications, 299(1), pp 213-221, 2001. 6. Chakrabarti, A.S. and B.K. Chakrabarti, Statistical theories of income and wealth distribution. Economics: The Open-Access, Open-Assessment E-Journal. 4, p. 4, 2010. 7. Clementi, F. and M. Gallegati, Pareto’s law of Income Distribution: Evidence for Germany, the United Kingdom, and the United States, in Econophysics of wealth distributions, Springer, New York, pp 3-14, 2005. 8. Ferrero, J.C., The statistical Distribution of Money and the Rate of Money Transference. Physica A: Statistical Mechanics and its Applications, 341, p. 575-585, 2004. 9. Scafetta, N., S. Picozzi, and B.J. West, An out-of-equilibrium Model of the Distributions of Wealth. Quantitative Finance. 4(3): pp 353-364, 2004. 10.Di Matteo, T., T. Aste, and S. Hyde, Exchanges in complex Networks: Income and Wealth Distributions, arXiv preprint, 2003. 11.Drăgulescu, A. and V.M. Yakovenko, Evidence for the exponential Distribution of Income in the USA. The European Physical Journal B-Condensed Matter and Complex Systems, 20(4), pp 585-589, 2001. 12.Cioffi-Revilla, C., Introduction to Computational Social Science: Principles and Applications. Springer, Berlin, 2013. 13.Ultsch, A., Pareto Density Estimation: A Density Estimation for Knowledge Discovery, in Innovations in classification, data science, and information systems, Springer, New York, pp 91-100, 2005.
No caption available
… 
No caption available
… 
No caption available
… 
Content may be subject to copyright.
Models of Income
Distributions for Knowledge
Discovery
M. Thrun, Prof. Dr. Ultsch
Thrun - Databionics
University of Marburg
Income Distributions
1. Always positively
skewed with single mode
and long tail [Kakwani
1980, p.14]
2. Properties of income are
defined by various
distributions
Models often separate
between the upper vs
lower parts
i. No systematic limit
between low and
high income
ii. Different method for
low and high
income
2
Thrun - Databionics
University of Marburg
Examples for Models of Income I
Low income: various distributions, e.g.
Approximation of probability density function (pdf)
Log-Normal distribution [Clementi and Gallegati, 2005]
Exponential distribution [Chakrabarti, 2006]
Gamma distribution [Ferrero 2004, Scafetta 2004]
Boltzmann-Gibbs distribution [Drăgulescu/Yakovenko 2001]
3
Thrun - Databionics
University of Marburg
Examples for Models of Income II
High income: pareto distribution
Approximation of cumulative distribution function (cdf)
Pareto Power Law distribution [Chaterjee et al. 2005, Levy and Solomon
1997]
Covers about 40% of income [Kakwani 1980, p.20]
4
Thrun - Databionics
University of Marburg
Dataset
gross annual income of German population in 2001
a detailed overview Campus-File of income tax statistics 2001
for public use [EVAS 73111] discloses a 1% sample
Dataset of income is preprocessed:
Through anonymization process income higher than 500 000
was oversampled
=> we down sampled to 1%
From now on: Income:=gross annual income in Germany
Now two examples are presented...
5
Thrun - Databionics
University of Marburg
Low Income with pdf
BLACK: histogram of
Income
RED: Boltzmann-Gibbs
() =
e
Regression and Fit of
Range 0-126000 Euro
6
pdf estimation of Income
Thrun - Databionics
University of Marburg
High Income modeled
with CCDF
BLUE: follows Power Law
with() =   
RED: linear Regression of data
with     log () = log ()
BLACK: Income
Log/log plot of
() = ()
Regression begins
somewhere at
10.25000 Euro
Imprecise fit for
>10.500000 Euro
7
Complementary Cumulative
Distribution Function (CCDF)
Thrun - Databionics
University of Marburg
Problems
limit for the range of low income
unclear
Right Choice for number and width of
bins critical for the right fit of the pdf
kernel density estimation with fixed
radius
No clear start point for high income
Log linear approximation imprecise for
vast income
8
=> No systematic limit between low and high income
Thrun - Databionics
University of Marburg
New Approach
i. Data logarithmic transformed
BoxCox λ= 0.2 with < 0.01 [Asar et al. 2015]
ii. Pdf through pareto density estimation (PDE) [Ultsch
2005]
iii. Mixture of Gaussians with Toolbox
iv. Visual and statistical verification of model
-> Toolbox „Multimodal“ available in R on
CRAN(http://cran.r-project.org/)
9
Thrun - Databionics
University of Marburg
Estimation of pdf
Kernel density estimation
with variable radius
-> PDE is designed in
particular to identify
groups in data [Ultsch
2005]
How to estimate
density states within?
10
Thrun - Databionics
University of Marburg
Gaussian Mixture Model (GMM)
11
EM-algorithm [Press 2007]
estimates a log Gaussian mixture
of four density states
(Components)
Blue: Components N, S
Red:
GMM =
N, S

= 1
 = 1
How do we calculate limits
between components?
Thrun - Databionics
University of Marburg
Application of Bayes Theorem
12
Through the likelihood
to generate data in a
component c of the
mixture, the
conditional p(x| c) we
calculate the posterior
(|)
Blue: Components
Red: GMM(x)
Example: Lets look at the red
window with component
and component 2
i=1
i=2
i=4
i=3
Thrun - Databionics
University of Marburg
First Boundary in GMM
13
GMM =wN m, SD

=pp(x|)

(Details, see Bayes theorem)
Posteriori = 50%
Mixture Components:
Orange:  ,,
Green:  ,,
Posteriori, i=1 Posteriori, i=2
Thrun - Databionics
University of Marburg
Exact Boundaries
14
Green: Calculated
posteriori of mixture
components
, = 1, … , 4
Posteriori = 50%
Bayes Boundary
between = 1 and
= 2 (magenta)
i=1 i=2 i=4
i=3
Thrun - Databionics
University of Marburg
GMM result for Income
Black = pdf(log(Data))
Magenta=Bayes Boundaries
Red=GMM
Blue=Components
Range:
1. Group: 0-1100 Euro
2. Group: 1100-12000 Euro
3. Group: 12000 -139000 Euro
4. Group: > 139000 Euro
15
i=1 i=2 i=4 i=3
Thrun - Databionics
University of Marburg
Knowledge from Income Distribution
16
No. Group M±AMAD
in Euro Population
in %
1 Unemployed 500 ±550 3
2 Low earners 6000 ±5000 24
3 Middle class 30 000 ±20 000 72
4 Upper class 190 000 ±75 000 1
1 4
3
2
Model suggests different
classes in society
Thrun - Databionics
University of Marburg
Verification
Statistical testing: Xi-
Quadrat-Test: p<.001
Visually: QQ plot
Compares two distributions by
using n quantiles
Empirical distribution vs known
distribution
If straight line: distributions
equal
17
Thrun - Databionics
University of Marburg
Inequality A Property of Distributions
Instead of comparing income data by
pdf or cdf, use ABCplot
[Ultsch, Lötsch 2015]
graphical representation of a
upturned Lorenz Curve L(P),
Equals: ABC(p)=1-L(1-p)
BUT: Comparing inequality of data to
uniform distribution instead of identity
distribution
inequality distribution is more skewed if
above uniform distribution
18
ABCanalysis on CRAN
For L(p) see [Kakwani 1980, p.30 ff]
Thrun - Databionics
University of Marburg
Summary
Previous models have the disadvantages:
No systematic limit between high and low income
Inconsistent analysis methods: pdf vs cdf
Do not explain whole range of Income
Our model is
Simple mathematics founding (Bayes)
Good fit of the whole range income
Easily understandable and reproducible
Open problem:
Which parameters of log transformed income
do describe the income distribution itself?
19
Thrun - Databionics
University of Marburg
Sources
N. Kakwani, Income Inequality and Poverty, A World Bank Research Publication, Oxford University Press,
Oxford 1980
Chatterjee, A., Chakrabarti, B. K., & Stinchcombe, R. B. (2005). Master equation for a kinetic model of a
trading market and its analytic solution. Physical Review E, 72(2), 026126_026121-026126_026124.
Drăgulescu, A., & Yakovenko, V. M. (2001). Evidence for the exponential distribution of income in the USA.
The European Physical Journal B-Condensed Matter and Complex Systems, 20(4), 585-589.
Lohn- und Einkommensteuerstatistik (EVAS 73111), Statistische Ämter des Bundes und der Länder,
http://www.forschungsdatenzentrum.de/bestand/lest/index.asp, 11.2014 15:15
Campus-File der Einkommensteuerstatistik 2001, Statistisches Bundesamt, Gruppe „Steuern“, VID-
37313100-04, Wiesbaden, Jan 2008
Asar, O., Ilk, O., Dag, O. (2015). Estimating Box-Cox Power Transformation Parameter via Goodness of Fit
Tests. Accepted to be published in Communications in Statistics - Simulation and Computation
Press, W.H., Numerical recipes 3rd edition: The art of scientific computing. 2007: Cambridge university press.
Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification. 2nd. Edition. New York, 2001, p 512 ff
Ultsch, A., Pareto Density Estimation: A Density Estimation for Knowledge Discovery, in Innovations in
classification, data science, and information systems, Springer, New York, pp 91-100, 2005.
Chakrabarti, A.S. and B.K. Chakrabarti, Statistical theories of income and wealth distribution. Economics: The
Open-Access, Open-Assessment E-Journal. 4, p. 4, 2010.
Clementi, F. and M. Gallegati, Pareto’s law of Income Distribution: Evidence for Germany, the United
Kingdom, and the United States, in Econophysics of wealth distributions, Springer, New York, pp 3-14, 2005.
Ferrero, J.C., The statistical Distribution of Money and the Rate of Money Transference. Physica A: Statistical
Mechanics and its Applications, 341, p. 575-585, 2004.
Scafetta, N., S. Picozzi, and B.J. West, An out-of-equilibrium Model of the Distributions of Wealth.
Quantitative Finance. 4(3): pp 353-364, 2004.
20
Thrun - Databionics
University of Marburg
Thank you for listening, any
Questions?
21
Thrun - Databionics
University of Marburg
Boundaries by using Bayes Theorem
Posterior:
Probability,
that data x is
in class
22
Conditional Probability:
Likelihood to generate data
in this class
Prior:
Probability to choose
a class
Normalization, equals
N, S

1)_(
1
=
=
L
iicp
1)|_(
1
=
=
L
ixicp
(|) =  (|)
 (|)

Thrun - Databionics
University of Marburg
Xi-Quadrat-Test
Xi-Quadrat-Test
Let m be the number of Bins, one expected and Oi one
Observed Bin, then the test statistics (C2V) is
degree of freedom is m-2
23
 =

Thrun - Databionics
University of Marburg 24
Thrun - Databionics
University of Marburg
Definition Gaussian (pdf)
N, S=1
 exp   
2
25
Source of Img: http://matheguru.com/images/normalverteilung_68-95-99.png
m m+SD
M-SD m+2*SD m+3*SD M-2*SD M-3*SD
... The third experiment investigates the clipping of data, which is often used in data science. The fourth experiment uses a well-investigated clipped feature that is lognormal distributed and possesses several modes [49]. In the exploratory investigation, descriptive statistics in a high-dimensional case are used to outline major differences between the bean plot and the MD plot. ...
... Here, one feature is used to compare the histogram and the schematic plots against each other. The feature is the income of German population from 2003 [49]. The whole feature was modeled with a Gaussian mixture model on the log scale and verified with the Xi-quadrat-test (p < .001) ...
... The whole feature was modeled with a Gaussian mixture model on the log scale and verified with the Xi-quadrat-test (p < .001) and QQ plot [49]. A sample of 500 cases was taken and the PDF of the sample was skewed on the log scale in accordance with the D'Agostino skewness test (skew = -1.73, ...
Article
Full-text available
One aim of data mining is the identification of interesting structures in data. For better analytical results, the basic properties of an empirical distribution, such as skewness and eventual clipping, i.e. hard limits in value ranges, need to be assessed. Of particular interest is the question of whether the data originate from one process or contain subsets related to different states of the data producing process. Data visualization tools should deliver a clear picture of the univariate probability density distribution (PDF) for each feature. Visualization tools for PDFs typically use kernel density estimates and include both the classical histogram, as well as the modern tools like ridgeline plots, bean plots and violin plots. If density estimation parameters remain in a default setting, conventional methods pose several problems when visualizing the PDF of uniform, multimodal, skewed distributions and distributions with clipped data, For that reason, a new visualization tool called the mirrored density plot (MD plot), which is specifically designed to discover interesting structures in continuous features, is proposed. The MD plot does not require adjusting any parameters of density estimation, which is what may make the use of this plot compelling particularly to non-experts. The visualization tools in question are evaluated against statistical tests with regard to typical challenges of explorative distribution analysis. The results of the evaluation are presented using bimodal Gaussian, skewed distributions and several features with already published PDFs. In an exploratory data analysis of 12 features describing quarterly financial statements, when statistical testing poses a great difficulty, only the MD plots can identify the structure of their PDFs. In sum, the MD plot outperforms the above mentioned methods.
... A GMM can be visually verified by a QQ-plot and statistically by Xi-Quadrat test (cf. Thrun and Ultsch, 2015). ...
Conference Paper
Full-text available
The methods and possibilities of data mining for knowledge discovery in economic data are demonstrated on data of the German system of allocating tax revenues to municipalities. This system is complex and not easily understandable due to the involvement of several layers of administration and legislation. The general aim of the system is that a share of income tax revenue for a municipality (system output) should be a fixed part of the total income tax yield of each municipality (system input). Tools for the scientific exploration of empirical distributions are applied to the input and output variables of the system. The main finding is that, although the critical input variables are unimodally distributed, the output variable showed a bimodal distribution. The conclusion from this finding is that the system works in two distinct states: municipalities receive either a large share or a small share of the input. Relating these states of the system to the location of the municipality a distinct east-west gradient is found. A significantly larger percentage of East-German municipalities receive in average 5% less of the taxes share. This paper focuses on the methods and tools for such type of knowledge discovery.
Article
Full-text available
The distributions of income and wealth in countries across the world are found to possess some robust and stable features independent of the specific economic, social and political conditions of the countries. We discuss a few physics-inspired multi-agent dynamic models along with their microeconomic counterparts, that can produce the statistical features of the distributions observed in reality. A number of exact analytical methods and solutions are also provided. --
Article
Full-text available
We analyze three sets of income data: the US Panel Study of Income Dynamics (PSID), the British Household Panel Survey (BHPS), and the German Socio-Economic Panel (GSOEP). It is shown that the empirical income distribution is consistent with a two-parameter lognormal function for the low-middle income group (97%–99% of the population), and with a Pareto or power law function for the high income group (1%–3% of the population). This mixture of two qualitatively different analytical distributions seems stable over the years covered by our data sets, although their parameters significantly change in time. It is also found that the probability density of income growth rates almost has the form of an exponential function.
Article
Full-text available
We analyze an ideal-gas-like model of a trading market with quenched random saving factors for its agents and show that the steady state income (m) distribution P(m) in the model has a power law tail with Pareto index nu exactly equal to unity, confirming the earlier numerical studies on this model. The analysis starts with the development of a master equation for the time development of P(m) . Precise solutions are then obtained in some special cases.
Article
Using tax and census data, we demonstrate that the distribution of individual income in the USA is exponential. Our calculated Lorenz curve without fitting parameters and Gini coefficient 1/2 agree well with the data. From the individual income distribution, we derive the distribution function of income for families with two earners and show that it also agrees well with the data. The family data for the period 1947-1994 fit the Lorenz curve and Gini coefficient 3/8 = 0.375 calculated for two-earners families.
Article
The distribution of wealth among the members of a society is herein assumed to result from two fundamental mechanisms, trade and investment. An empirical distribution of wealth shows an abrupt change between the low-medium range, that may be fitted by a non-monotonic function with an exponential-like tail such as a Gamma distribution, and the high wealth range, that is well fitted by a Pareto or inverse power-law function. We demonstrate that an appropriate trade-investment model, depending on three adjustable parameters associated with the total wealth of a society, a social differentiation among agents, and economic volatility referred to as investment can successfully reproduce the distribution of empirical wealth data in the low, medium and high ranges. Finally, we provide an economic interpretation of the mechanisms in the model and, in particular, we discuss the difference between Classical and Neoclassical theories regarding the concepts of {\it value} and {\it price}. We consider the importance that out-of-equilibrium trade transactions, where the prices differ from values, have in real economic societies.
Article
The distribution of money is analysed in connection with the Boltzmann distribution of energy in the degenerate states of molecules. Plots of the population density of income distribution for various countries are well reproduced by a Gamma function, confirming the validity of the statistical distribution at equilibrium. The equilibrium state is reached through pair wise money transference processes, independently of the shape of the initial distribution and also of the detailed nature of the money transactions between the economic agents. Comment: 15 pages plus 1 table plus 3 figures
Estimating Box-Cox Power Transformation Parameter via Goodness of Fit Tests. Accepted to be published in Communications in Statistics-Simulation and Computation Press, W.H., Numerical recipes 3rd edition: The art of scientific computing
  • O Asar
  • O Ilk
  • O Dag
Asar, O., Ilk, O., Dag, O. (2015). Estimating Box-Cox Power Transformation Parameter via Goodness of Fit Tests. Accepted to be published in Communications in Statistics-Simulation and Computation Press, W.H., Numerical recipes 3rd edition: The art of scientific computing. 2007: Cambridge university press.
Pattern classification. 2nd. Edition Pareto Density Estimation: A Density Estimation for Knowledge Discovery
  • R O Duda
  • P E Hart
  • D G Stork Ultsch
Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification. 2nd. Edition. New York, 2001, p 512 ff Ultsch, A., Pareto Density Estimation: A Density Estimation for Knowledge Discovery, in Innovations in classification, data science, and information systems, Springer, New York, pp 91-100, 2005.
Income Inequality and Poverty, A World Bank Research Publication
  • N Kakwani
N. Kakwani, Income Inequality and Poverty, A World Bank Research Publication, Oxford University Press, Oxford 1980