Content uploaded by Michael Christoph Thrun

Author content

All content in this area was uploaded by Michael Christoph Thrun on Sep 06, 2015

Content may be subject to copyright.

Models of Income

Distributions for Knowledge

Discovery

M. Thrun, Prof. Dr. Ultsch

Thrun - Databionics

University of Marburg

Income Distributions

1. Always positively

skewed with single mode

and long tail [Kakwani

1980, p.14]

2. Properties of income are

defined by various

distributions

Models often separate

between the upper vs

lower parts

i. No systematic limit

between low and

high income

ii. Different method for

low and high

income

2

Thrun - Databionics

University of Marburg

Examples for Models of Income I

Low income: various distributions, e.g.

Approximation of probability density function (pdf)

Log-Normal distribution [Clementi and Gallegati, 2005]

Exponential distribution [Chakrabarti, 2006]

Gamma distribution [Ferrero 2004, Scafetta 2004]

Boltzmann-Gibbs distribution [Drăgulescu/Yakovenko 2001]

3

Thrun - Databionics

University of Marburg

Examples for Models of Income II

High income: pareto distribution

Approximation of cumulative distribution function (cdf)

Pareto Power Law distribution [Chaterjee et al. 2005, Levy and Solomon

1997]

Covers about 40% of income [Kakwani 1980, p.20]

4

Thrun - Databionics

University of Marburg

Dataset

gross annual income of German population in 2001

a detailed overview Campus-File of income tax statistics 2001

for public use [EVAS 73111] discloses a 1% sample

Dataset of income is preprocessed:

Through anonymization process income higher than 500 000

was oversampled

=> we down sampled to 1%

From now on: Income:=gross annual income in Germany

Now two examples are presented...

5

Thrun - Databionics

University of Marburg

Low Income with pdf

•BLACK: histogram of

Income

•RED: Boltzmann-Gibbs

() =

e

•Regression and Fit of

Range 0-126000 Euro

6

pdf estimation of Income

Thrun - Databionics

University of Marburg

High Income modeled

with CCDF

BLUE: follows Power Law

with() =

RED: linear Regression of data

with log () = log ()

BLACK: Income

•Log/log plot of

() = ()

•Regression begins

somewhere at

10.25000 Euro

•Imprecise fit for

>10.500000 Euro

7

Complementary Cumulative

Distribution Function (CCDF)

Thrun - Databionics

University of Marburg

Problems

limit for the range of low income

unclear

Right Choice for number and width of

bins critical for the right fit of the pdf

kernel density estimation with fixed

radius

No clear start point for high income

Log linear approximation imprecise for

vast income

8

=> No systematic limit between low and high income

Thrun - Databionics

University of Marburg

New Approach

i. Data logarithmic transformed

BoxCox λ= 0.2 with < 0.01 [Asar et al. 2015]

ii. Pdf through pareto density estimation (PDE) [Ultsch

2005]

iii. Mixture of Gaussians with Toolbox

iv. Visual and statistical verification of model

-> Toolbox „Multimodal“ available in R on

CRAN(http://cran.r-project.org/)

9

Thrun - Databionics

University of Marburg

Estimation of pdf

Kernel density estimation

with variable radius

-> PDE is designed in

particular to identify

groups in data [Ultsch

2005]

How to estimate

density states within?

10

Thrun - Databionics

University of Marburg

Gaussian Mixture Model (GMM)

11

EM-algorithm [Press 2007]

estimates a log Gaussian mixture

of four density states

(Components)

Blue: Components N, S

Red:

GMM =

N, S

= 1

= 1

How do we calculate limits

between components?

Thrun - Databionics

University of Marburg

Application of Bayes Theorem

12

Through the likelihood

to generate data in a

component c of the

mixture, the

conditional p(x| c) we

calculate the posterior

(|)

Blue: Components

Red: GMM(x)

Example: Lets look at the red

window with component

and component 2

i=1

i=2

i=4

i=3

Thrun - Databionics

University of Marburg

First Boundary in GMM

13

GMM =wN m, SD

=pp(x|)

(Details, see Bayes theorem)

Posteriori = 50%

Mixture Components:

Orange: ,,

Green: ,,

Posteriori, i=1 Posteriori, i=2

Thrun - Databionics

University of Marburg

Exact Boundaries

14

Green: Calculated

posteriori of mixture

components

, = 1, … , 4

Posteriori = 50%

⇒Bayes Boundary

between = 1 and

= 2 (magenta)

i=1 i=2 i=4

i=3

Thrun - Databionics

University of Marburg

GMM result for Income

Black = pdf(log(Data))

Magenta=Bayes Boundaries

Red=GMM

Blue=Components

Range:

1. Group: 0-1100 Euro

2. Group: 1100-12000 Euro

3. Group: 12000 -139000 Euro

4. Group: > 139000 Euro

15

i=1 i=2 i=4 i=3

Thrun - Databionics

University of Marburg

Knowledge from Income Distribution

16

No. Group M±AMAD

in Euro Population

in %

1 Unemployed 500 ±550 3

2 Low earners 6000 ±5000 24

3 Middle class 30 000 ±20 000 72

4 Upper class 190 000 ±75 000 1

1 4

3

2

Model suggests different

classes in society

Thrun - Databionics

University of Marburg

Verification

Statistical testing: Xi-

Quadrat-Test: p<.001

Visually: QQ plot

Compares two distributions by

using n quantiles

Empirical distribution vs known

distribution

If straight line: distributions

equal

17

Thrun - Databionics

University of Marburg

Inequality – A Property of Distributions

•Instead of comparing income data by

pdf or cdf, use ABCplot

[Ultsch, Lötsch 2015]

•graphical representation of a

upturned Lorenz Curve L(P),

•Equals: ABC(p)=1-L(1-p)

•BUT: Comparing inequality of data to

uniform distribution instead of identity

distribution

•inequality distribution is more skewed if

above uniform distribution

18

ABCanalysis on CRAN

For L(p) see [Kakwani 1980, p.30 ff]

Thrun - Databionics

University of Marburg

Summary

Previous models have the disadvantages:

No systematic limit between high and low income

Inconsistent analysis methods: pdf vs cdf

Do not explain whole range of Income

Our model is

Simple mathematics founding (Bayes)

Good fit of the whole range income

Easily understandable and reproducible

Open problem:

Which parameters of log transformed income

do describe the income distribution itself?

19

Thrun - Databionics

University of Marburg

Sources

N. Kakwani, Income Inequality and Poverty, A World Bank Research Publication, Oxford University Press,

Oxford 1980

Chatterjee, A., Chakrabarti, B. K., & Stinchcombe, R. B. (2005). Master equation for a kinetic model of a

trading market and its analytic solution. Physical Review E, 72(2), 026126_026121-026126_026124.

Drăgulescu, A., & Yakovenko, V. M. (2001). Evidence for the exponential distribution of income in the USA.

The European Physical Journal B-Condensed Matter and Complex Systems, 20(4), 585-589.

Lohn- und Einkommensteuerstatistik (EVAS 73111), Statistische Ämter des Bundes und der Länder,

http://www.forschungsdatenzentrum.de/bestand/lest/index.asp, 11.2014 15:15

Campus-File der Einkommensteuerstatistik 2001, Statistisches Bundesamt, Gruppe „Steuern“, VID-

37313100-04, Wiesbaden, Jan 2008

Asar, O., Ilk, O., Dag, O. (2015). Estimating Box-Cox Power Transformation Parameter via Goodness of Fit

Tests. Accepted to be published in Communications in Statistics - Simulation and Computation

Press, W.H., Numerical recipes 3rd edition: The art of scientific computing. 2007: Cambridge university press.

Duda, R.O., P.E. Hart, and D.G. Stork, Pattern classification. 2nd. Edition. New York, 2001, p 512 ff

Ultsch, A., Pareto Density Estimation: A Density Estimation for Knowledge Discovery, in Innovations in

classification, data science, and information systems, Springer, New York, pp 91-100, 2005.

Chakrabarti, A.S. and B.K. Chakrabarti, Statistical theories of income and wealth distribution. Economics: The

Open-Access, Open-Assessment E-Journal. 4, p. 4, 2010.

Clementi, F. and M. Gallegati, Pareto’s law of Income Distribution: Evidence for Germany, the United

Kingdom, and the United States, in Econophysics of wealth distributions, Springer, New York, pp 3-14, 2005.

Ferrero, J.C., The statistical Distribution of Money and the Rate of Money Transference. Physica A: Statistical

Mechanics and its Applications, 341, p. 575-585, 2004.

Scafetta, N., S. Picozzi, and B.J. West, An out-of-equilibrium Model of the Distributions of Wealth.

Quantitative Finance. 4(3): pp 353-364, 2004.

20

Thrun - Databionics

University of Marburg

Thank you for listening, any

Questions?

21

Thrun - Databionics

University of Marburg

Boundaries by using Bayes Theorem

Posterior:

Probability,

that data x is

in class

22

Conditional Probability:

Likelihood to generate data

in this class

Prior:

Probability to choose

a class

Normalization, equals

N, S

1)_(

1

=

∑

=

L

iicp

1)|_(

1

=

∑

=

L

ixicp

(|) = (|)

(|)

Thrun - Databionics

University of Marburg

Xi-Quadrat-Test

Xi-Quadrat-Test

Let m be the number of Bins, one expected and Oi one

Observed Bin, then the test statistics (C2V) is

degree of freedom is m-2

23

=

Thrun - Databionics

University of Marburg 24

Thrun - Databionics

University of Marburg

Definition Gaussian (pdf)

N, S=1

exp

2

25

Source of Img: http://matheguru.com/images/normalverteilung_68-95-99.png

m m+SD

M-SD m+2*SD m+3*SD M-2*SD M-3*SD