Content uploaded by Maja Zaloznik
Author content
All content in this area was uploaded by Maja Zaloznik on May 13, 2014
Content may be subject to copyright.
Iterative Proportional Fitting
Theoretical Synthesis and Practical Limitations
Thesis submitted in accordance with the requirements of the University of Liverpool
for the degree of Doctor in Philosophy
by
Maja Založnik
November 2011
Abstract
Iterative proportional fitting (IPF) is described formally and historically and its advan-
tages and limitations are investigated through two practical simulation exercises using
UK census microdata. The theoretical review is unique in being comprehensive and
interdisciplinary. It is structured by progressing through three levels of understanding
IPF: contingency table analysis in classic applications, analysis using log-linear models
and finally the understanding IPF as a method for maximizing entropy. An elaborate
methodological section develops the measures and technical tools for the analysis, and
explores the geographical aspects of the dataset by providing a unique and exhaus-
tive overview of the ecological fallacy, Simpson’s paradox and the modifiable areal unit
problem. The practical section investigates the behaviour of IPF under different sam-
pling scenarios and different data availability conditions using a large scale computer
simulation based on the UK Samples of Anonymised Records. By systematically and
comprehensively investigating the theoretical and practical issues related to IPF this
thesis supplements the fragmentary and piecemeal nature of the current literature and
does so in an accessible and interdisciplinary manner.
i
Acknowledgements
First and foremost I wish to thank my supervisors, Dr Paul Williamson and Dr Hill
Kulu for your incredible support and encouragement. Paul, thank you for repeatedly
reining me in when I went too far off on tangents and for showing such unwavering
support when I needed it most. I am particularly grateful to have had the chance to
know Prof. Bob Woods and the honour of having him examine my upgrade proposal.
I would have loved nothing more than to have had a chance to receive more input from
him.
I would like to thank the University of Liverpool and their Postgraduate Research
Studentship Fund and the Department of Geography for funding my tuition and stipend
without which this thesis would never have been possible. I am also indebted to the
RSAI-BIS for recognizing my work and giving me an added impetus at the start of my
research.
Thanks to Herbert Voss and the pstricks gang, to Alexey Koloydenko, and scores
of other lovely and helpful (and sometimes (rarely) arrogant but still helpful) people at
Nabble, stackoverflow.com, r-help@r-project.org, the L
A
T
EX Community. Zhenya and
Sasha, thanks to you too for not behaving irrationally and inexplicably, despite my
irrational and inexplicable need to anthropomorphize you.
Thank you Ivo for being the best office mate I could have hoped for, and also for
the stapler. Ashley, thanks for taking a shy Eastern European under your wing and
making her feel welcome. As for me, thanks for not annoying me too much. And to all
the postgraduates and staff in the department for your advice, support, encouragement
and/or companionship, especially Sarah, Becky, Andy, Bill, Richard, Bob, Andreas,
John, Tinho, Jayne, Claire, Susanne and Sandra.
Katja and Račka, thanks for years of friendship and support – it works long distance
as well! And thanks to the Garlic Mansion™in all its versions, but in particular the
current one: Alex, Edu, Joel, Josh, Luis, Tiny and Fat Sophies, Steph – you kept me
sane and grounded.
Aleš and Lenča, this is of course all your fault. You not only made me want this,
you tricked me into thinking it was my idea. Well played. Pika, I hope you know what
you’re getting yourself into! And finally Matej, thank you for everything - without your
continued love, support and patience I would not be where I am today.
iii
11
Ignorance is preferable to error, and he is less remote from the truth who
believes nothing than he who believes what is wrong . . .
Thomas Jefferson (1782)
Contents
Abstract i
Acknowledgements iii
1 Introduction 1
2 Notes on Notation 7
I IPF History and Practice 11
3 Early IPF Inventions and Applications 13
3.1 Introduction ————————————————————————— 14
3.2 Example of Classical IPF —————————————————— 16
3.2.1 Variational Independence of Marginal and Joint Distributions — 17
3.2.2 Odds Ratios as a Measure of Association——————————— 19
3.2.3 Combining Marginal and Joint Distributions ————————— 21
3.3 Formal Statement of the IPF Algorithm —————————— 25
3.4 IPF and Other Classical Applications ———————————— 26
3.4.1 IPF for Input-Output Matrices ——————————————— 26
3.4.2 IPF for Table Standardization ——————————————— 27
3.4.3 Testing for Hypothetical Patterns with IPF ————————— 33
3.5 Beyond Classical Applications of IPF ———————————— 35
4 IPF and Log-linear models 37
4.1 Introduction to Log-linear models ————————————— 38
4.2 Choice of Log-linear Parameters —————————————— 40
4.3 Log-linear Models as Generalized Linear Models —————— 45
4.4 Parameter Interpretation and Cell Fitting ————————— 47
4.4.1 Deviation Contrast Type Constraints ———————————— 47
v
4.4.2 Prescribed Interaction Models ——————————————— 48
4.5 Gravity Models of Spatial Interaction as Log-Linear Models 50
4.6 Maximum Entropy and Log-linear models —————————— 55
5 IPF for Maximum Entropy or Minimum Discrimination Information? 57
6 Estimating Cells in Three Dimensions 65
6.1 Original Data ———————————————————————— 65
6.2 Defining the Problem ———————————————————— 67
6.3 Estimating Cells with no Second Order Interaction————— 69
6.4 Estimating Cells with a Borrowed Second Order Interaction 71
6.5 Lessons from 3D Estimation ————————————————— 76
7 IPF in Geography — Applications and Limitations 77
II Methodology and Data 81
8 Measures and Methods 83
8.1 Introduction ————————————————————————— 83
8.2 Data ————————————————————————————— 83
8.3 Measuring the strength of bivariate associations —————— 87
8.3.1 Chi-square based measures ————————————————— 87
8.3.2 Proportional reduction in error (PRE) measures ——————— 89
8.3.3 The information-theoretic approach ————————————— 91
8.3.4 Choice of association descriptors —————————————— 93
8.4 Measuring goodness-of-fit —————————————————— 95
8.4.1 General distance based measures —————————————— 96
Proportion misclassified ———————————————— 97
Standardized root mean squared error —————————— 99
8.4.2 Z-scores ————————————————————————— 100
8.4.3 Pearson’s X2and the Power Divergence Family of Statistics —— 105
8.4.4 Exact Permutation Distributions of Goodness-of-fit Statistics —— 111
8.4.5 The importance of significance ——————————————— 113
8.5 Software Solutions ————————————————————— 118
8.6 Summary ——————————————————————————— 120
vi
9 Small Area Microdata And Geographic Variation 123
9.1 Introduction ————————————————————————— 123
9.2 Levels of Geographic Variation ——————————————— 124
9.2.1 Variation of Association Strength between Local Authorities —— 124
9.2.2 Changes in association strength with geographic and geodemo-
graphic aggregation ———————————————————— 130
9.3 Ecological Fallacy and Simpson’s Paradox—————————— 132
9.3.1 Ecological Fallacy ————————————————————— 133
9.3.2 Simpson’s Paradox ———————————————————— 136
9.3.3 Correlation coefficient ——————————————————— 140
9.3.4 Simpson and Robinson combined? ————————————— 144
9.4 ‘Magnitude’ of Ecological Fallacy Effects and Occurrence
of Simpson’s Paradox in SAM ————————————————— 146
9.5 Summary ——————————————————————————— 156
III Applications 161
10 IPF and the Error of insufficient constraints 163
10.1 Model 1: From [A],[B],[C]to [d
ABC ] —————————————— 164
Geographic variation of model fit ———————————— 169
10.2 Model 2: From [AB],[C]to [d
ABC ] ——————————————— 172
Geographic variation of model fit ———————————— 175
10.3 Model 3: From [AC],[BC]to [d
ABC ] —————————————— 178
Geographic variation of model fit ———————————— 184
10.4 Model 4: From [AB],[AC],[BC]to [d
ABC] ———————————— 185
Geographic variation of model fit ———————————— 191
10.5 Summary of results—————————————————————— 195
11 IPF and the Error of inaccurate constraints 205
11.1 Sampling Zeros ———————————————————————— 205
11.1.1 Selection of constant to add to empty cells —————————— 211
11.2 Model 4A: From [AB],[AC],[BC]and [ABC ]′to [d
ABC ] —————— 216
11.3 Model 4B: From [AB],[AC],[BC]and [ABCGOR]′or
[ABCS g ]′to [d
ABC ] ——————————————————————— 220
11.4 Summary ——————————————————————————— 226
vii
IV Evaluation 229
12 Conclusion 231
Bibliography 236
Appendices 251
A SAM Variable list 251
B List of Local authorities and their geographic and geodemographic
associations 263
C Robinson’s data on Nativity and Illiteracy 277
D R IPF code 279
E Summary statistics for the analysis in Chapter 10 283
F Sampling results: Model 4B-GOR vs. Model 4B-SG 287
G Sampling results: Model 4B-GOR andModel 4B-SG vs. Model 4 289
viii
List of Figures
1.1 Schematic overview of most important IPF contributions ——————— 3
3.1 Tabular and graphical description of marginal distributions —————— 17
3.2 Three possible joint distributions consistent with known marginal distri-
butions ———————————————————————————— 18
3.3 The odds ratio and fourfold displays ———————————————— 21
3.4 Three possible marginal distributions with the same odds ratio ———— 22
3.5 Multiplying each column by a constant preserves the odds ratio ———— 23
3.6 Schematic overview of IPF ———————————————————— 24
3.7 Mosaic plots of British and Danish original data——————————— 29
3.8 Mosaic plots of British and Danish standardized data ———————— 31
3.9 Hypothetical and “normalized” endogamy data ——————————— 33
3.10 Fitting a hypothetical marriage pattern —————————————— 34
6.1 3D visualization of gender by LLTI for the three countries of the UK—— 67
6.2 Mosaic cube of LLTI by gender in countries of GB ————————— 68
6.3 Comparison of log-linear coefficients from the original data, the sample
data and the newly estimated data ———————————————— 74
8.1 Cramer’s V values for three crosstabulations across 373 LAs ————— 88
8.2 Adjusted Freeman Tukey values for three crosstabulations across 373 LAs 89
8.3 Goodman and Kruskal’s lambda values for three crosstabulations across
373 LAs ———————————————————————————— 90
8.4 Entropy coefficient values for three crosstabulations across 373 LAs —— 93
8.5 Average strength of association for 1596 pairs measured by the three
measures ———————————————————————————— 95
8.6 Binomial distribution for p=0.005 and N=1000 and normal approximation103
8.7 Comparison of X2and Z2p-values ———————————————— 107
8.8 Power-divergence statistic for various λvalues and corresponding p-values110
8.9 Selected χ2distributions and densities of χ2/df ——————————— 116
8.10 Three normalizations of selected χ2distributions (left) and their respec-
tive critical values for p= 0.01 (right) ——————————————— 117
8.11 Pseudocode for IPF kernel ———————————————————— 119
ix
9.1 Local authorities with weakest and strongest association when Freeman-
Tukey variation is largest————————————————————— 126
9.2 Association between central heating and housing indicator —————— 127
9.3 Local authorities with weakest and strongest association when entropy
variation is second smallest ———————————————————— 128
9.4 Local authorities with weakest and strongest association when Freeman-
Tukey variation is smallest ———————————————————— 129
9.5 Scale and zoning effects on entropy coefficients for the crosstabulation of
Communal establishment type by Accommodation self-contained ———— 131
9.6 Relative changes in standard deviations of measures of association strenght
after two types of aggregation (N=1596) —————————————— 133
9.7 Individual-level and division-level correlation of Nativity and Illiteracy — 136
9.8 3-D visualization of the original Robinson data ——————————— 137
9.9 Simpson’s paradox for categorical variables (based on (Blyth, 1972) —— 138
9.10 Within-region correlation coefficients for Robinson’s data ——————— 140
9.11 Notation in three-dimensional example ——————————————— 144
9.12 Simpson’s paradox or Ecological Fallacy? —————————————— 146
9.13 Individual vs. ecological correlations at LA level (N=41,629) with ex-
amples of Simson’s paradox shown in red —————————————— 148
9.14 Largest discrepancy between ecological (N=373) and individual correla-
tion (N=2,621,560) ——————————————————————— 150
9.15 Simpson’s paradox example found in SAM data ——————————— 151
9.16 Individual vs. ecological correlations at GOR level (N=41,629) ———— 152
9.17 Largest discrepancy between ecological (N=10) and individual correla-
tion (N=2,621,560) ——————————————————————— 153
9.18 Ecological correlations reverse at GOR (and Supergroup) and LA levels 154
9.19 Example of Simpson’s paradox at regional level ——————————— 155
10.1 Hierarchy of all possible three-dimensional models (adapted from (Wick-
ens, 1989, p.67) ————————————————————————— 163
10.2 Model 1: [1],[2],[3] →[123] with equation and degrees of freedom ——— 164
10.3 Goodness-of-fit statistics for Model 1 (N= 1596) —————————— 165
10.4 Type of communal establishment by Sex —————————————— 167
10.5 The two worst performing tables under Model 1 ——————————— 168
10.6 Individual cell contributions to lack-of-fit ————————————— 169
10.7 Correlation between F T 2and goodness-of-fit under Model 1 ————— 170
10.8 Geographic variation of goodness-of-fit under Model 1 ———————— 170
10.9 Model 2 with equation and degrees of freedom ——————————— 172
10.10Goodness-of-fit statistics for Model 2 (N= 1596) —————————— 173
10.11The two worst performing tables under Model 2 ——————————— 174
x
10.12Year last worked byTransport to work goodness-of-fit under first two
models ————————————————————————————— 176
10.13Country of birth - best and worst performing LAs under Model 2 and
Model 1 ———————————————————————————— 178
10.14Model 3 with equation and degrees of freedom ——————————— 179
10.15Goodness-of-fit statistics for Model 3 (N= 1596) —————————— 179
10.16Goodness-of-fit statistics distribution for Models 1,2 and 3 (N= 1596) 180
10.17Type of communal establishment by Country of birth ————————— 181
10.18The two worst performing tables under Model 3 ——————————— 182
10.19Accommodation type by Country of birth fit under three models ———— 183
10.20Ethnic group by Religion goodness-of-fit under Model 3 and Model 1 —— 185
10.21Country of Birth by Ethnic group goodness-of-fit under Model 3 and
Model 1 ———————————————————————————— 186
10.22Model 4 with equation and degrees of freedom ——————————— 186
10.23Goodness-of-fit statistics for Model 4 (N= 1596) —————————— 187
10.24Relationship between Z-scores and p-values ————————————— 188
10.25Year last worked by Workplace —————————————————— 189
10.26The two worst performing tables under Model 4 ——————————— 190
10.27Family type by Number of employed adults in the household —————— 191
10.28Odds ratio analysis under high geographic variation ————————— 192
10.29Housing indicator by Central heating best improvement of fit ————— 194
10.30Hierarchy of all possible three-dimensional models (adapted from (Wick-
ens, 1989, p.67) ————————————————————————— 196
10.31Relative factor effects for Accommodation type by Age (grey bars - partial
association, white bars - marginal association) ——————————— 199
10.32Relative factor effects in 56 tables where one of the variables is A=
Accommodation type1—————————————————————— 199
10.33All 56 [AB] effects for the top and bottom five variables as ranked by
their mean [AB] ———————————————————————— 203
11.1 Goodness-of-fit after substitution of zeros in [ABC ]′sample N=10,000
in Gender by Student by LA table (y axis is logarithmic)——————— 207
11.2 Effect of different constant added to Isle of Anglesey sample ————— 209
11.3 Effect of constant added to empty cells for four different GOR tables
(shades of grey correspond to sample sizes) ———————————— 213
11.4 Effect of constant added to empty cells for four different LA tables
(shades of grey correspond to sample sizes) ———————————— 214
11.5 Model 4A ——————————————————————————— 216
11.6 Summary goodness-of-fit results for nine different sizes of the [ABC]′
sample with Model 4 in red for comparison (N=1596) ———————— 217
xi
11.7 Size of workforce by NS-SEC of FRP: Sutton plots for three sample sizes
before and after IPF. —————————————————————— 218
11.8 Only two tables where ∆falls monotonically with increased sample size 219
11.9 Improvement under 50% sample in Model 4A over Model 4 and propor-
tion empty cells ————————————————————————— 220
11.10Model 4 with with 3D sample taken at regional/Supergroup level ——— 221
11.11Goodness-of-fit for Size of workforce by NS-SEC of FRP under all three
sampling models ———————————————————————— 222
11.12Goodness-of-fit for Size of workforce by NS-SEC of FRP under all three
sampling models relative to average cell size ———————————— 222
11.13Summary goodness-of-fit results comparing all three sampled models
(N=1596) ——————————————————————————— 223
xii
List of Tables
2.1 Notation of observed frequencies ————————————————— 8
3.1 List of selected marginal configurations——————————————— 16
3.2 Occupational mobility father-son frequency counts —————————— 29
3.3 Occupational mobility father-son standardized counts ———————— 30
4.1 Three hierarchical log-linear models in a 2 ×2 table ————————— 39
4.2 Multiplicative and additive log-linear parameters for two styles of con-
straints fitted to the saturated model example in Table 4.1 —————— 44
4.3 Relationships between cell frequencies and multiplicative/additive coef-
ficients ————————————————————————————— 48
4.4 Four Hypothetical intermarriage models used for testing endogamy pat-
terns—————————————————————————————— 49
5.1 Entropy of Possible Groupings of Six People ———————————— 58
6.1 LLTI by gender in countries of GB ———————————————— 66
6.2 Multiplicative coefficients, fully saturated model (deviation contrast)—— 69
6.3 Estimated full table with no CLG interaction ——————————— 72
6.4 Random sample of 1000 people and estimated full table using the sam-
ple’s CLG interaction —————————————————————— 73
6.5 Comparison of odds ratios of LLTI and Gender ——————————— 75
8.1 SAM Variable List (ONS, 2006) —————————————————— 84
8.2 Type of accommodation by Term time address in Bexley (Kent)———— 91
8.3 Ranking of three variable pairs by the three different measures ———— 94
8.4 Step by step calculation of z2—————————————————— 102
8.5 Calculation of exact p-value (N= 4, k= 3, ˆp1= 0.2, ˆp2= 0.3) ———— 113
9.1 Highest and lowest levels of variation of association strength ————— 125
9.2 Robinson’s Nativity and Illiteracy for US (Source: US Census (1931)) — 135
9.3 Summary statistics on the ecological fallacy and Simpson’s paradox at 4
different levels of aggregation (N=41,629) ————————————— 156
9.4 Variants of the ecological fallacy,Simpson’s paradox and the MAUP —— 158
xiii
10.1 Coefficients of determination (R2) ———————————————— 169
10.2 Ranking by geographic variation of Percent misclassified ( ∆) ———— 175
10.3 Highest ranking variables by Mean ∆and standard deviation across LAs 177
10.4 Coefficients of determination (R2) ———————————————— 184
10.5 Ranking by geographic variation of Percent misclassified ( ∆) ———— 184
10.6 National SAM table of Housing indicator by Accommodation self-contained.189
10.7 Goodness-of-fit for Accommodation type by Age under all four models2— 196
10.8 A set of hierarchical models and their G2values ——————————— 198
10.9 Top and bottom five variables by strength of single factor effect ———— 201
10.10Top and bottom seven variables ranked by number of tables where [AB]>
[AC] —————————————————————————————— 201
10.11Top and bottom five variables by strength of geographic variation ([AC]) 202
11.1 Four tables used to investigate effects of adding different constants——— 212
11.2 Proportion of tables where regional sampling outperforms the geodemo-
graphic one - top and bottom five variables ————————————— 224
11.3 Proportion of tables where Model 4 outperforms regional (darker grey)
and Supergroup sampling (lighter grey) - top and bottom five variables— 225
xiv
Chapter 1
Introduction
For a relatively simple and intuitive algorithm, iterative proportional fitting (IPF ) has
a surprisingly convoluted history. Although several authors have attempted to give
background information on the method, these accounts of the development of IPF have
invariably been only partial. IPF is a procedure applied to contingency tables and
hence has a wide range of applications in numerous fields where such crosstabulations
are used as a matter of course. This has meant that researchers working in fields ranging
from engineering, demography, transportation research, economics to information and
computing have ‘discovered’ the algorithm several times in what seems to have been an
essentially independent fashion. In parallel with these application driven discoveries the
theoretical development of statistical analysis of contingency tables also led to this same
algorithm and the related mathematical proofs, firmly embedding IPF in statistical
theory, but often shrouding it in inaccessible language, and in the modern computer age
often concealing it from the view of practitioners by writing it into the computer code
of programmes one only needs to run, without understanding the mechanism behind
them. As a result, the development path for IPF has so far led to a series of insights
into various aspects of its theoretical underpinnings and practical applications that are
wide in range, but piecemeal and fragmentary in nature. This thesis aims to draw
these existing insights together into a comprehensive theoretical account (Part I), and
to use this improved theoretical insight to shape the most fully developed systematic
investigation so far conducted into the technique’s practical limitations in real world
applications (Parts II-IV).
As its name suggests, IPF is an iterative procedure, it is used to adjust or fit
contingency table cells to a set of constraints. It has now been proven that the IPF
produces table cell values that are the unique maximum likelihood estimates under
the given constraints. Given these constraints, the IPF procedure also represents a
maximum entropy (i.e. minimum discrimination information) solution. Although it
has been used in a variety of applications, these can be classified into the following
groups:
1
(i.) Classical applications of IPF involve combining information from two or more
sources.
a. Updating tables e.g. a migration table from a previous census can be up-
dated with new marginal totals
b. Combining population and sample data e.g. when population totals are
known, the relationships between variables from a sample can be adjusted
to fit the table
c. Standardizing table margins e.g. making the table margins uniform so the
relationship between the variables is easier to establish.
(ii.) Fitting log-linear models where it is used for finding the maximum likelihood
estimate given specified model constraints.
(iii.) Hybrid applications are essentially classical applications with some of the data
missing - the estimation then involves implicitly fitting a log-linear model where
the missing data (variable relationships) are excluded from the model.
Figure 1 shows a schematic overview of some of the most important contributions
during the first 40 years of IPF history. The ten publications that are highlighted refer
to ten independent discoveries of the algorithm, and it is not unthinkable that there are
more. The remaining entries indicate publications that have made the strongest contri-
butions to the theoretical understanding of IPF including proofs of its convergence, its
relationship to maximum likelihood and log-linear modelling and to maximum entropy
(or minimum discrimination information) estimation.
All of the entries in Figure 1 are covered in the following text in an attempt to
provide a comprehensive overview of the historical, practical and technical develop-
ments relevant to IPF. Inasmuch as is possible the proceeding chapters are structured
to linearly follow the historical development and relevant literature alongside the math-
ematical and graphical elaboration of the procedure and associated concepts. Histor-
ical linearity is however suspended occasionally in favour of gradually increasing the
technical complexity of the narrative, which does not always follow the chronological
development of the field.
The central three chapters of Part I of this thesis covering classical IPF, log-linear
modelling and entropy maximization are organized as a hierarchy representing three
levels of understanding IPF. In the first IPF is treated simply as a tool. This level is
sufficient for many applications and in practice is often the only one considered. In
fact for two dimensional applications of IPF there is rarely a need to go beyond this
level. The second section allows a more abstract statement of problems where IPF
provides a solution. Log-linear models allow contingency tables of any dimension to
2
be unambiguously and systematically analysed and we feel are indispensable for high-
dimensional analysis where the intuitiveness of simple tables vanishes. The third level,
understanding IPF as a method for maximizing entropy, borders on the philosophical,
with maximum entropy presented as a fundamental principle of reasoning about prob-
ability. IPF is shown to provide the least biased estimate while taking into account
all known information. Appreciating IPF at this level as well, in addition to its cross-
disciplinary history, the basic workings of the procedure and its equivalent log-linear
specification, completes a comprehensive overview of this procedure. The many terms
and concepts discussed here are often used synonymously in the literature or discussed
only partially. Structuring the text this way - progressing through the three levels
mentioned - allows us a systematic conceptualization of IPF, one that is missing in the
wider literature.
1935
1940
1955
1960
1965
1970
1975
Sheleikovskii(193?)
Kruithof (1937)
“Methode der dubbele factoren”
Deming & Stephan (1940)
“Deming-Stephan algorithm”
Fratar (1954)
“Cross-Fratar algorithm”
Brown (1959)
“iterative scaling procedure”
Romney (1957?)
Stone & Brown (1962)
"R.A.S."
Darroch (1962)
"iterative scaling method"
Furness (1965)
"Furness iteration procedure"
Levine (1967?)Bregman (1967)
Mosteller (1968)
"Mostellerization"
Good (1963) Birch (1963)
Bishop (1967)
Bishop et al. (1975)
Figure 1.1: Schematic overview of most important contributions to the development of
IPF
3
To develop these three levels of understanding IPF, Part I is structured as follows:
As the first of the three central chapters, Chapter 3 covers the early or ‘classical’ ap-
plications of IPF. It describes one of the first and most famous applications of IPF by
Deming and Stephan before introducing elementary contingency table concepts such
as variational independence and odds ratios as a measure of association. IPF is then
illustrated in a full step-by-step example as well as being stated formally. Several types
of classical application are then described including table updating and table standard-
ization with examples from a variety of fields such as transportation and anthropology.
Chapter 4 is dedicated to log-linear analysis of contingency tables. Particular atten-
tion is devoted to the choice and interpretation of parameters. The models are further
described as a member of the generalized linear models family. Linking onwards to
maximum entropy this section also includes a review of so-called gravity models, which
can be seen as a special formulation of log-linear models, and are particularly interest-
ing for our purposes as they have been couched in the language of maximum entropy
from early on. This leads directly into Chapter 5 which gives a technical description
of the maximum entropy principle and its generalization as minimum discrimination
information, thereby completing the third level of the theoretical framework. Chapter
6 summarizes these three levels through a fully worked example of a three-dimensional
contingency table analysis. The data are analysed using three research scenarios build-
ing on the previously established principles of log-linear modelling and entropy maxi-
mization. Concluding this historical, practical and theoretical review, the final section
focuses on some possible limitations and problems associated with IPF, specifically
within the context of spatial applications; although these issues in fact generalize to
any high-dimensional problem. A brief review is given of recent geographical litera-
ture with regard to IPF as this provides a convenient starting point for the systematic
analysis of these issues that is attempted in Parts II and III.
In Part II we first focus on measures and methods in Chapter 8. The Small Area
Microdata dataset, used as a platform for the analysis presented in Part III, is briefly
described before two extensive sections focus on the choice of metrics for measuring the
strength of bivariate associations and goodness-of-fit. This chapter ends with a section
dedicated to introducing a bespoke IPF programme written in R, which is included
as Appendix D. The second chapter in this part investigates the levels of geographic
variation in the dataset and follows on from there to explore the issues of the modifiable
areal unit problem, ecological fallacy and Simpson’s paradox, both with regard to the
data as well as generally.
Part III is composed of two application chapters, which investigate the quality of
IPF estimates using the SAM data as a full population, which is sampled from in
various ways. This allows us to evaluate the algorithm’s performance under various
scenarios. Chapter 10 describes four different estimation models informed by various
4
levels of information. These are used to analyse how different variable combinations
behave under these circumstances and to what extent the geographical nature of the
data is relevant. Chapter 11 goes one step further: in addition to known and correct
information, the models now include sampled information as well. A systematic eval-
uation of sampling error for three-dimensional tables has not yet been attempted, and
this analysis is further expanded by using samples taken at higher aggregations as well.
This allows us to further explore the capacity of IPF to provide accurate data esti-
mates by borrowing strength from samples that have been aggregated either regionally
or based on geodemographic classification.
Investigating a single algorithm, one that we shall see is relatively simple and even
intuitive, will nevertheless compel us to explore very diverse topics, from the history
of statistics, applied engineering and anthropology even, to some of the most classic
and timeless issues of geography. Through the following investigation we will address
a great deficiency in the literature, one that does not however prevent researchers from
using IPF as a matter of course, but one that we believe prevents them from using it
effectively and efficiently.
5
6
Chapter 2
Notes on Notation
Consistent standards of notation are an ideal that is unfortunately rarely achieved even
within narrow fields of speciality. The principles of non-ambiguity and readability can
often work in opposite directions and their optimum balance is not always unique.
Furthermore, researchers in different fields that involve communicating mathematical
representations symbolically have adopted different styles and standards of notation,
complicating interdisciplinary linkage, while attempts at standardization are notori-
ously slow and contested.1
The notation in the literature reviewed in this research is by no means consistent,
not only due to the diversity of fields covered, but also due to parallel or unrelated
methodological developments in the same field. This can lead to such extreme cases
of indecision as found in an article by Wrigley (1980), who insistently duplicates all
the equations into ‘the notation of Goodman’ and ‘the notation of Fienberg’ versions,
while warning the reader that the same subscripts have different meanings in the various
equations.
Armed with the knowledge that a completely satisfactory and comprehensive nota-
tion is probably unattainable, the following is a description of the notational standards
adopted throughout the rest of this work in an attempt to accommodate contingency
table analysis, log-linear modelling, probability and information theory, by remaining
as faithful to semi-established conventions and without running out of symbols. Any
departures from these norms or shorthand notations will be explicitly noted in the text.
In a crosstabulation A,B,C. . . refer to variables with I,J,K. . . number of
categories, indexed with the letters i,j,k. . . where i= 1,2,...,I;j= 1,2,...,J, etc.
These variable notations are used strictly whenever general formulations are attempted,
although in specific examples a more mnemonic notation style is adopted for clarity.
Observed frequencies are denoted with xand the appropriately indexed subscripts—
1In their proposal for a standardized notation in econometrics Abadir & Magnus (2002, p.89) note
somewhat despondently that 150 years after the introduction of the ‘=’ sign for equality, Bernoulli still
used the symbol ‘α’, and conclude that it will take the same amount of time before a common notation
is adopted in their field, which is probably too optimistic.
7
xijk then denotes the observed frequency in cell (A=i, B =j, C =k). The equivalent
(unknown) population values are denoted with nsubscripted the same way, while the
upper-case Nstands for the population total. Summation over an index, which is
equivalent to the corresponding row or column total, is indicated by a + sign in place
of the summed over subscript. So summing over variable A, which has Icategories we
have
X
i
xijk =x1j k +x2jk +x3jk +···+xI jk =x+jk
summing over the second variable Band its Jcategories is defined as
X
j
x+jk =x+1k+x+2k+x+3k+···+x+J k =x++k
therefore the total number of observations in the table can be arrived at by summing
over the third variable Cwith Kcategories
X
k
x++k=x++1 +x++2 +x++3 +···+x++K=x+++ =X
i,j,k
xijk =N
These summations are perhaps easier to understand as marginal totals and Table 2.1
shows the notation in a two-dimensional table, where the first subscript corresponds to
summing the columns and the second to summing up the rows.
Table 2.1: Notation of observed frequencies in a crosstabulation of Variable A(i=
1,2,...,I) and Variable B(j= 1,2,...,J)
B1B2··· BJB
A1x11 x12 ··· x1Jx1+
A2x21 x22 ··· x2Jx2+
.
.
..
.
..
.
.....
.
..
.
.
AIxI1xI2··· xIJ xI+
A x+1 x+2 ··· x+Jx++
Bracket notation is used to denote full tables and configurations of variables of
lower dimensionality — so called marginal configurations. These can range from one-
dimensional arrays (‘edges’):
[A] = {x1+, x2+ ,...,xI+}=xi+
to multidimensional tables produced by summing over one or more variables e.g. in a
four-way contingency table crosstabulating the variables A,B,Cand Dof dimensions
8
I×J×K×L, the marginal configuration achieved by summing over the Lcategories
of D:
[ABC ] = {x111+ , x112+ ,...,xIJ K +}=xijk+.
The number of possible marginal configurations depends of course on the number
of dimensions.2The number of possible configurations for each combination can be
calculated using the binomial coefficient, e.g. the number of three-dimensional marginal
configurations of a five dimensional table is
5
3!=5!
3! ·(5 −3)! = 10
It should however be kept in mind that the highest level configurations automatically
imply all the lower levels: e.g. if [ABD] is known, the lower configurations of [AB],
[BD], [AD] and [A], [B] and [D] are simply the marginal sums that can be easily
calculated and their declaration is therefore redundant. In combining data from several
sources it will be useful to distinguish between the up-to-date, ‘known’ population
configurations and the auxiliary configurations e.g. from surveys or older data sources.
The latter, which is by definition less reliable, will be denoted by the prime symbol:
[AB]′or x′
ij. Estimated cell values use the same indexing as observed values, but are
denoted with a hat symbol: d
[ABC ] or ˆxij k . Thus a typical problem could for example
be stated as finding the best cell estimates d
[ABC ] from known margins [AB] [AC]
and[BC ] and a sample [ABC]′.
Proportions are indexed in the same way as cell frequencies:
P(xij ) = xij
x++
=pij
and the probabilities in a table must of course sum up to unity:
X
i,j
pij =p++ = 1
The same indexing rules apply to conditional probabilities e.g.:
P(Ai|Bj) = xij
x+j
=pi|j
is used to denote the conditional probability of a random individual being in category
iof Agiven that they are in category jof B.
Other reserved symbols that can also be indexed appropriately if necessary are:
ω— odds,
θ— the odds ratio,
2Table 3.1 in the next section shows the possible marginal configurations for two, three, four and
five-way tables.
9
τ— multiplicative effect parameters,
λ=ln(τ) — additive effect parameters,
and their definitions will be introduced in the ensuing text.
10
Part I
IPF History and Practice
11
Chapter 3
Early IPF Inventions and
Applications
Part I of this thesis is intended as a comprehensive overview of the IPF literature
and practice. Nevertheless, it cannot claim to be exhaustive. As will become clear
through the diversity and disparity of literature reviewed, there is a real possibility
that IPF applications presented under another name have been missed. Furthermore
some known applications have been intentionally omitted for the sake of clarity and
parsimony. Still this is, to our knowledge, the most complete review to this date. The
second omission that should be mentioned is with regard to some of the more advanced
mathematical derivations and proofs. This overview cannot claim to be non-technical,
however the level of mathematical complexity builds up gradually and particular care
is taken that no steps are skipped in the derivations that are described. Parsimony
again requires some topics to be omitted, notably Lagrange multipliers, the Newton-
Raphson algorithm and several proofs of convergence of IPF. Wherever necessary the
appropriate references to the more advanced literature are given.
This chapter is intended as an comprehensive and non-technical introduction to
IPF. While more technical statistical frameworks will be introduced in the succeeding
sections, we first describe the methodology through the more accessible early historical
applications, whenever possible running in parallel with numerical and/or graphical
illustrations. This section also introduces several basic concepts that will be extended
in the remaining text.
Although it is not the first and not necessarily even the most important ‘invention’
of IPF, we start with the Deming and Stephan application of the method of iterative
proportions as it remains the most commonly cited example of this method and of
particular relevance to this thesis as it was also originally implemented on census data.
Some groundwork is necessary before describing a fully worked example of IPF, in
particular with regard to the concept of variational independence in a contingency table
and the measurement of associations using odds ratios. These ideas are introduced with
the help of dedicated graphical displays. A formal statement of the algorithm is then
13
given in its most general form. Having described the procedure, the chapter continues
with a historical overview of other early applications, which cover several disciplines
and a variety of approaches indicating the method’s flexibility and intuitiveness. It
is impossible to complete this overview without touching upon log-linear modelling,
however this is the topic of the next chapter.
3.1 Introduction
Contingency tables may seem as possibly one of the most straightforward ways of dis-
playing quantitative data, yet their analysis has a comparatively short history (Stigler,
2002). Apart from a few applications using the simplest 2 ×2 tables, it was not until
the beginning of the 20th century that the field started developing and it took until
the 1960s before more complicated models of contingency table structure were being
proposed.
Early contingency table analysis involved testing the independence hypothesis by
comparing observed and expected frequencies using the chi-square goodness of fit test
developed by Pearson (1900) and in case the variables were not independent, investigat-
ing the relationships by looking at percentage differences or through odds ratios or some
other measure of association. It was not until 1935 that Bartlett introduced the con-
cept of second-order interaction in a 2 ×2×2 table—a table considered complex at the
time (Fienberg & Rinaldo, 2007). In fact when over 15 years later Lancaster expanded
Bartlett’s work for a general three-dimensional table, he also expressed his conviction
that “little use will ever be made of more than a three-dimensional classification” (1951,
p.247). This statement can only be appreciated with reference to the computational
facilities of the time although the renowned historian of statistics Stephen Stigler has
argued that this is a poor excuse (2002). There are numerous examples of authors in
the pre-computer era coping with incredibly burdensome calculations, so he concludes:
“[h]ad the need arisen I have no doubt that an enterprising statistician in the 19th cen-
tury would have employed iterative proportional fitting without the helpful instruction
of Deming and Stephan” (ibid., p. 566).
In fact at least one author did employ IPF before the 1940 publication of the
Deming and Stephan paper, which is frequently quoted as the first known use of the
procedure. Published in 1937 in a Dutch engineering journal, Kruithof’s de Methode
der dubbele factoren or double-factor method, was used to estimate telephone traffic
between telephone exchanges. This was an example of a ‘classical’ application of table
updating, where Kruithof used new data on total incoming and outgoing calls to revise
an older input-output table by iteratively rescaling the rows and the columns until
the new table summed up correctly (Kruithof, 1937). He did not however provide any
rationale for using the method, except that it was symmetrical and that it seemed to
work (ibid., E23).
14
The rationale provided by Deming and Stephan in their paper (1940), turned out to
be wrong, and was retracted by Stephan a few years later (Stephan, 1942). Regardless
of this fact, the article has become perhaps the best known exposition of the algorithm
and its practical ‘classical’ application. It describes a solution to the Bureau of the
Census’ problem of adjusting sample tables that were inconsistent with population
totals.
The Bureau first encountered this problem in preparation for the 1940 US census.
Faced both with growing demands for population statistics and a lack of computational
powers and resources that make modern day census data tabulation and release still
take years, the dilemma was resolved by creating an extended census questionnaire that
was applied only to every twentieth respondent, thereby creating a 5 % sample of full
census responses in addition to the 100 % basic census data (Stephan et al., 1940, p.
615). This left the Census Bureau with the task of adjusting the sample data to fit
consistently with the counts from the complete enumeration (ibid. p. 629) as opposed
to simply multiplying the tables by 20.
The task was therefore to make the new table be as similar to the original sample
table, while agreeing with the population totals and Deming and Stephan’s reasoning
led them to define this similarity by the sum of the squared differences, which should
be minimized. The reason for choosing least squares is given as the practical advantage
of uniqueness and the theoretical dignity of giving one kind of best estimate under ideal
sampling conditions (Deming & Stephan, 1940, p. 428). Finding the least squares
solution requires however solving a set of normal equations using Lagrange multipliers,
a task that quickly became unmanageable given the number of dimensions of these
tables, and this prospect led the two researchers to propose an approximation method.
Their solution was based on the fact that the simplest case of a two-dimensional
table and only one set of constraints can be solved by simply multiplying the cells by
the marginal ratio. They therefore proposed that this proportional adjustment could be
repeated for each set of constraints, and the process continued iteratively until all the
constraints were met. This procedure proved both intuitively simple and undemanding
to calculate. The conditions (marginal constraints) are met after relatively few itera-
tions and as they note “The final results coincide with the least squares solutions, which
is thus accomplished without the use of normal equations”. The procedure took about
1/7th of the time normally required for a 2 dimensional table, and was even more time
saving for higher dimensions. They termed the successive proportional adjustments
the method of iterative proportions and it was immediately adopted by the US Census
Bureau, which promptly renamed it as raking (Heyde and Seneta, 2001, p.488).
As has been noted Stephan quickly retracted the assertion that IPF provided the
least squares solution (although it was close enough). In fact their method provided
the solution that minimized the discrimination information distance, however it would
15
take several decades to make this connection and for the underlying rationale of IPF
to be properly examined and its convergence properties proven. These developments
are inextricably linked to the development of log-linear models as well as the principle
of maximum entropy as the mathematical and philosophical basis of this estimation.
Both are described in the following sections, however to begin to understand these
developments we must first describe the algorithm itself, after first defining some basic
contingency table terminology.
3.2 Example of Classical IPF for Combining Population
and Sample Data
Both Kruithof and Deming and Stephan were faced with a similar scenario: They had
a set of up-to-date variable distributions [A] and [B] from one data source, and a full
crosstabulation of the variables from an older/smaller data source [AB]′.The question
therefore is how to estimate d
[AB] i.e. the most likely joint distribution of the variables
that fits the known marginal distributions [A] and [B], while utilizing the information
about their relationship contained in [AB]′.
This scenario scales up easily and Table 3.1 shows the possible marginal configura-
tions in tables ranging from two to five dimensions. Thus if [A], [B], [C] and [D] were
for example known and a sample was available crosstabulating [ABCD]′, a classical
application of IPF would result in the estimated fitted crosstabulation d
[ABC D]. The
right-hand columns in the table always determine the lower order configurations on the
left, hence if [ABC] is known, [AB], [BC] and [AC] are also automatically known. This
however does not work in the opposite direction.
Table 3.1: List of marginal configurations for two to five-dimensional tables
Dimensionality of (marginal) configuration
1 2 3 4 Full
[A][B] [AB]
[A][B][C] [AB][BC][AC] [ABC ]
[A][B][C][D] [AB][AC][AD]
[AB][AC][AD]
[ABC ][ABD]
[ACD][BCD]
[ABC D]
[A][B][C][D]
[E]
[AB][AC][AD]
[AE][BC][BD]
[BE][CD][C E]
[DE]
[ABC ][ABD]
[ABE][ACD]
[ACE][ADE]
[BC D][BC E]
[BDE][C DE]
[ABC D][ABC E]
[ABDE][AC DE]
[BC DE]
[ABC DE]
16
3.2.1 Variational Independence of Marginal and Joint Distributions
A particular marginal distribution [A][B] can in fact result from (be compatible with)
several joint distributions [AB] (if we do not limit ourselves to integer solutions, there
are actually infinitely many possibilities). Thus knowing the marginal distribution of
two or more variables on a population tells us absolutely nothing about the relationship
between the variables.
male female
rich ? ? 30
poor ? ? 70
50 50 100
[W]
[G]marginal
distributions
OO
oo
Figure 3.1: Tabular and graphical description of marginal distribution [G][W]
Figure 3.1 shows an example of an observed marginal distribution of gender [G] and
wealth [W]. In addition to presenting the data in tabular format, we introduce here
a graphical technique for visualizing tabular data - the mosaic plot. Figure 3.2 gives
three possible distributions of the data: [GW ]′, [GW ]′′ and [GW ]′′′ — three possible
associations between gender and wealth, all consistent with the ‘observed’ marginal
distributions. Again in addition to the tables, the data is also displayed in mosaic
plots, where tiles correspond to table cells and their sizes are proportional to the cell
frequencies1.
1Mosaic plots for visualising categorical data were originally proposed by Hartigan and Kleiner
(1984) and were further extended by Michael Friendly (1994; 1995b; 1998). In many of these ap-
plications the plots are enhanced using shading, patterns and different border styles to depict the
standardized residuals in a way that is intended to make the patterns stand out. For our purposes here
shading of the mosaic tiles is more prosaic and serves primarily in distinguishing variable categories to
reduce the need to clutter the plots with labels. Although the topic here does not require the distinc-
tion between dependent and independent variables, the convention is observed whereby the first split
(vertical) is made by the independent variable (e.g. gender), which allows comparison by width of the
areas, and the second split (horizontal) is made by the dependent variable (e.g wealth) and shaded
accordingly, to allow comparison of cell size and general ease of interpretation. In examples where
the plots refer to square tables such as origin-destination matrices or endogamy tables where the row
17
[W]marginal
distributions
oo
[G]
male female
rich 30 0 30
poor 20 50 70
50 50 100
[GW ]′
male female
rich 1 29 30
poor 49 21 70
50 50 100
[GW ]′′
male female
rich 15 15 30
poor 35 35 70
50 50 100
[GW ]′′′
Figure 3.2: Three possible joint distributions of [GW ] consistent with the marginal
distribution [G][W] from Figure 3.1
As is clear from Figure 3.2, a whole spectrum of associations is possible given
the marginal distribution of gender and wealth. On the one hand it is possible that
none of the females are rich as in [GW ]′and at the other extreme (almost) all of the
rich people are female e.g. in [GW ]′′. The relationship [GW ]′′′ shows the case where
the two variables are independent, which can also be clearly seen on the mosaic plot,
where the cells align completely in such a case. These and many more relationships
and column categories are the same, the shading was chosen to reflect the diagonal patterns (see for
example Section 3.4)
18
between variables are all consistent with (i.e. add up to) the given marginal distribution.
This may seem like a trivial point to make, but its misunderstanding is in fact one of
the underlying mechanisms for the ecological fallacy and is the reason why Simpson’s
paradox is even seen as paradoxical, a topic we will explore in more depth in Section
9.3.
In a similar fashion, the same association between two variables can also be consis-
tent with different marginal distributions. We could, for example, know that the odds
of being rich are four times greater for males than they are for females. This measure of
association expressed as the odds ratio (sometimes called the cross product ratio) how-
ever, tells us nothing about the actual proportions of rich or poor, male or female, and
we can therefore again construct several possible crosstabulations where the association
between the variables holds constant, while the proportions of each group varies. Such
an example will be illustrated right after a brief description of the odds ratio measure
itself.
3.2.2 Odds Ratios as a Measure of Association
The odds ratio is denoted by the Greek letter theta: θand is calculated as the ratio of
two conditional odds, which we denote using the Greek letter omega ω. We first obtain
the conditional odds. In our example these can be stated the following way: the odds
of a person being rich given they are male is defined as the ratio of the probability of
being rich and male and the probability of being poor and male:
ωr|m=P(Wr|Gm)
P(Wp|Gm)[3.1]
which we can calculate from the cell values as:
ωr|m=pr|m
pp|m
=xrm/x+m
xpm/x+m
=xrm
xpm
[3.2]
The same can be done to find the odds of a person being rich given they are female:
ωr|f=pr|f
pp|f
=xrf /x+f
xpf /x+f
=xrf
xpf
[3.3]
The odds ratio is then defined as the ratio of these two odds:
θ(rp)(mf)=ωr|m
ωr|f
=xrm/xpm
xrf /xpf
[3.4]
We can also write the odds ratio in the general form as:
θ(ii′)(jj′)=xij ·xi′j′
xij′·xi′j
[3.5]
Using the second example from Figure 3.2 ([GW ]′′) we can calculate the odds of being
rich for men:
ωr|m= 1/49 = 0.02
19
and the odds of being rich for women:
ωr|f= 29/21 = 1.38
And the odds ratio is therefore:
θ(rp)(mf)=1/49
29/21 = 0.015
meaning that in our example the odds of a male being rich are 0.015 times the odds
of a female being rich. We could have also started with different conditional odds e.g.
the odds of being poor given one’s gender. It is easy to see that in this case the odds
ratio would have been simply the inverse:
θ(pr)(mf)= 1/θ(rp)(mf )= 67.67 [3.6]
indicating that the odds of being poor are 67.67 times higher for males than for females.
Regardless of the way the odds ratio is calculated, it contains the same information: it
fully describes the association between our two variables.2
Deconstructing the odds ratio in this way also allows us to make two important
points about this measure of association. First of all, it is easy to see what an odds
ratio value of one means: that there is no association between the variables and more
specifically it means the conditional odds are the same for both categories. A second
important point is that the odds ratio is not a symmetric measure of association. In
our numerical example, the odds ratio was 67.67 or 0.015, depending on the perspec-
tive. The values describe the same strength of association, and are inverse. This can
make it difficult to compare their magnitude directly. It is therefore sometimes more
convenient to take a logarithm of the odds ratio. This makes the measure symmetric
with independence at zero (ln 1 = 0) and values running from minus infinity to plus
infinity. Using natural logarithms, in our example the odds ratios map onto 4.2 and
−4.2 respectively, making it clear they are of equal strength and different direction.
Odds ratios and their logarithms each have their advantages and disadvantages: the
logarithms make it easier to compare their strengths and the regular values are easier
to interpret, so the choice depends on the application.
We use fourfold displays (Figure 3.3) to graphically depict odds ratios. The asso-
ciation between the variables is expressed by the diagonal pairs of cells differing from
each other i.e. independent variables where the odds ratio equals one, would be rep-
resented by a fourfold display with four equal sized wedges. The quarter circle wedges
are proportional to the cell frequencies after being standardized to equalize the table
2This is only strictly true for a 2 ×2 table as larger tables have more odds ratios to completely
define the association. For larger tables the indexing of the odds ratio is indispensable, however for
2×2 tables we can write simply θas it is clear what it refers to.
20
poor
males
rich
females
poor
females
rich
males
θ(rp)(mf)=xrm/xpm
xrf /xpf
=xrm ·xpf
xrf ·xpm
male f emale
rich xrm ee
%%
K
K
K
K
K
K
K
K
K
Kxrf
poor xpm yy
99
s
s
s
s
s
s
s
s
sxpf
Figure 3.3: The calculation of the odds ratio and its visualisation using a fourfold
display
margins3(Friendly, 1995a). When used together with the previously introduced mosaic
plots, fourfold displays allow the visualization and comparison of contingency tables by
separately focusing on the composition of the population and the association of the
variables.
Figure 3.4 shows the tables and corresponding mosaic plots for three possible popu-
lation (or sample) compositions, which all exhibit the same association between gender
and wealth: the odds ratio is four. The mosaic plots show strikingly different pictures
– a very visual reminder of the effect of marginal totals – however all three sets of data
have come from the same hypothetical population where the odds of being rich are four
times greater for males than for females.
3.2.3 Combining Marginal and Joint Distributions
It is clear from the above examples that the marginal distribution and the odds ratio
are independent from each other, a property referred to as variation independence
(Rudas, 1998, p. 8 ff). Thus to uniquely identify a crosstabulation, both the marginal
distribution and the odds ratio have to be known, or to put it differently: a set of
marginal distributions and the relationship between the variables expressed as odds
ratios completely define a unique crosstabulation.
A brief point to be made about odds ratios is that multiplying the cell frequencies
by a constant will of course leave the odds ratio unchanged. The same applies if the
values of a single column or a single row are multiplied by a constant, because the
constant appears in both the numerator and denominator of the odds ratio and cancels
3In fact the actual standardization of the frequencies required to produce a fourfold plot is done
using IPF, preserving the odds ratio while adjusting the margins to a 50:50 ratio, a topic dealt with in
depth in Section 3.4
21
male female
rich 7.79 42.21 50
poor 2.21 47.79 50
10 90 100
θ=7.79 ×47.79
2.21 ×42.21 =
4
male female
rich 7.79 2.21 10
poor 42.21 47.79 90
50 50 100
θ=7.79 ×47.79
42.21 ×2.21 =
4
male female
rich 33.33 16.67 50
poor 16.67 33.33 50
50 50 100
θ=33.33 ×33.33
16.67 ×16.67 =
4
Figure 3.4: Tabular and graphical displays of three possible marginal distributions with
the same odds ratio θ= 4.
out. For example, if we were to double the frequencies in the first row:
θ=x11 ×x22
x21 ×x12
//double the first row //θ′=2·x11 ×x22
x21 ×2·x12
=x11 ×x22
x21 ×x12
=θ
we see the odds ratio remains unchanged. This is simply an aspect of the variational
independence of the odds ratio as a measure of association between two variables —
its value is independent of row or column manipulations. Although this might seem a
desirable property, many commonly used measures of association do not in fact con-
form to it and this can often lead to confusing results (or rather misinterpretations of
results — cf. Tan et al. (2004)).
If we take the example the bottom panel of Figure 3.4 and say increase the number of
men by 50 % while halving the number of women, the association between the variables
is preserved. Figure 3.5 shows this transformation, as well as the calculations of the
odds ratio, which remains four. The association between the variables, measured as the
odds ratio, is again the same in both cases, which can be thought of as two different
samples of the population — one with equal numbers of males and females, and one
heavily biased towards males. Not only have the column totals changed, but the row
22
totals as well. This example column adjustment epitomizes the underlying logic of the
IPF procedure.
male female
rich 33.33 16.67 50
poor 16.67 33.33 50
50 50 100
θ=33.33 ×33.33
16.67 ×16.67 =
×3/2
×1/2
=3/2·33.33 ×1/2·33.33
3/2·16.67 ×1/2·16.67 =
male female
rich 50 8.33 58.33
poor 25 16.67 41.67
75 25 100
=50 ×16.67
25 ×8.33 =
4
Figure 3.5: Multiplying each column by a constant preserves the odds ratio - but
changes the marginal distribution.
IPF proceeds through a sequence of steps just as the one just described, successively
adjusting column and row margins to a set of predefined values, all the while preserving
the variable association i.e. odds ratio intact. In each iteration of the procedure all of
the dimensions are fitted sequentially — in the 2 dimensional scenario shown here these
are the column and the row margins respectively — until satisfactory convergence is
achieved.
Figure 3.6 gives a schematic overview of the procedure for a two-dimensional 2 ×2
example. The starting point are a set or marginal configurations [A][B] without their
joint distribution [AB] and a known odds ratio θ. The odds ratio can very simply
be transformed into a tentative joint distribution [AB]′— essentially any distribution
that gives θis acceptable. In the example in Figure 3.6 the value of x22 is set to 4
and the remaining three frequencies to 1, but other combinations of values with the
cross product of four would all yield the same result. In step 1.a — the first step
of the first iteration — this starting joint distribution [AB]′is adjusted to along the
columns to agree with the observed marginal distribution [B]. This means multiplying
the frequencies in the first column with 50/2 = 25, and in the second with 50/5 = 10.
The column marginal totals are now correct, and so is the odds ratio — it remains
23
B1B2
A1? ? 30
A2? ? 70
50 50 100
Known
marginal
distributions
[A][B]
oo
B1B2
A11 1 2
A21 4 5
2 5 7
Known odds ratio θ
oo//4
Step 1.a — columns
B1B2
A125 10 35
A225 40 65
50 50 100
4
Step 1.b — rows
B1B2
A121.43 8.57 30
A226.92 43.08 70
48.35 51.65 100
4
Step 2.a — columns
B1B2
A122.16 8.3 30.46
A227.84 41.7 69.54
50 50 100
4
.
.
..
.
..
.
..
.
.
Final step
B1B2
A121.87 8.13 30
A228.13 41.87 70
50 50 100
4
Figure 3.6: Schematic overview of IPF—the framed graphics represent the known or
correct parameters.
unchanged throughout the procedure — but the row marginal totals (35 and 65) do
not agree with the observed data [A]. We therefore proceed to Step 1.b and adjust the
row marginals. With the row margins now correct, as is the odds ratio, the column
24
totals have become unaligned. The second iteration again adjusts the row and column
margins in successive steps, and the discrepancy reduces further.
The procedure is stopped when satisfactory convergence is achieved - in the above
example, the convergence criterion used is 0.01, meaning the marginal subtotals in the
final panel in are correct to within two decimal places. In this case only four iterations
were required to achieve the convergence criterion - the cell frequencies add up to the
correct marginal totals, while the odds ratio is 4.0039. If more precision is required,
the steps are simply repeated until the desired precision is reached.
3.3 Formal Statement of the IPF Algorithm
The following formal statement for the IPF algorithm is given for a 3-dimensional ta-
ble, but easily extends to higher dimensions by analogy (after Bishop et al. (1975)).
Estimation of [ABC] given the marginal constraints [AB], [AC] and [BC] (which au-
tomatically includes [A], [B] and [C]) begins with some initial estimates of the cell
frequencies [ABC]′—x(0)
ijk where the superscripted index refers to the iteration step
in this case zero indicates the starting estimates before the first iteration. Depending
on the application these initial estimates can be taken from a sample, from historical
data, they can be created directly from an odds ratio as in the previous example or
they can all be set to the same value (usually one) so that no association is present,
what we will also call a uniform prior. Each step successively adjusts the estimates to
one of the marginal configurations:
Step 1.a: adjusting to [AB]
x(1)
ijk =x(0)
ijk ×nij +
n(0)
ij+
Step 1.b: adjusting to [BC]
x(2)
ijk =x(1)
ijk ×n+j k
n(1)
+jk
Step 1.c: adjusting to [AC]
x(3)
ijk =x(2)
ijk ×ni+k
n(2)
i+k
Each round of iterations rhas as many steps as there are marginal configurations
to adjust to — in this example three. The iterations stop when the difference between
estimates of successive iterations becomes lower than a predefined small amount δ:
x(3r)
ijk −x(3r−3)
ijk ≤δ
The final xijk then satisfy — with a δmargin of error
ˆxij+≈xij +,ˆx+j k ≈x+jk and ˆxi+k≈xi+k
which is to say the estimated table cells sum up correctly to the given constraints —
the starting marginal configurations [AB], [AC] and [BC ].
25
For additional dimensions the algorithm is simply extended to include an extra
step in every iteration cycle so that the number of steps corresponds to the number of
dimensions. Although this demonstration assumes all the second-highest level configu-
rations are known e.g. in a five dimensional table to be estimated we know all five of
the four-dimensional marginal configurations, IPF can still be used if they are not all
known. In such a case IPF will automatically estimate the missing lower level config-
urations as it estimates the full table. So if only [AB] and [C] are known, and IPF is
used to estimate the full [ d
ABC] table, the [ d
AC] and [ d
BC ] will have automatically been
correctly calculated without the need to estimate them separately in an extra step.
3.4 IPF and Other Classical Applications
Deming and Stephan’s invention and use of IPF for adjusting a census sample has
already been described in the introduction. This section traces some of the other
early uses of IPF showing the variety of fields it was found useful in as well as the
terminological complications that have arisen through this early history. We first look
at a special case of square tables or matrices common in transportation research and
economics, then at a less limiting application of standardizing and finally at another
use of IPF for square tables in anthropology that foresaw much of the subsequent
development in log-linear modelling. The timeline for these references can be found in
Figure 1 on page 3.
3.4.1 IPF for Input-Output Matrices
It seems the classical application of IPF is particularly intuitive in the 2-dimensional
input-output or origin-destination matrix scenario. We can infer this from the fact
that the procedure was discovered on at least five separate occasions that involved such
problems. Kruithof’s telephone traffic problem has already been mentioned. Vehicular
traffic estimation lead Fratar to develop his successive approximation procedure (1954),
to estimate the traffic in a origin-destination table. The procedure was approved by
the U.S. Bureau of Public Roads and renamed the Cross-Fratar method. In spite of
Fratar’s work, transportation research saw another reinvention of the procedure by
Furness (Furness, 1965) to become known as the Furness (iteration) method. With
reference to a 1967 paper by L.M. Bregman, the procedure has also become known
as Bregman’s balancing method, again primarily in the field of transportation research
(Lamond & Stewart, 1981). This is particularly curious as Bregman himself speaks
of it as Sheleikhovskii’s method, and claims it was first proposed in the 1930’s by the
Leningrad architect for calculating passenger flow, while Bregman himself picks it up
26
to prove the method’s convergence.4
Around the same time the group of economists studying inter-industry transactions
for a large econometric research project called the Cambridge Growth Project was being
inconvenienced by the fact that full industry input-output tables only existed for years
when Censuses of Production were taken, which created the need to update these tables
using only partial information about current industry inputs and outputs. Again the
problem was solved using IPF in this case known as the biproportional and later the
RAS model 5(Stone & Brown, 1962). Of course all of the methods mentioned here have
continued under their individual names in strands of publications within their fields,
which tend not to mention each other and their equivalence6, or use the term IPF.
3.4.2 IPF for Table Standardization
The third ’classical’ application of IPF that has not yet been mentioned is table stan-
dardization. As opposed to combining data sources, standardizing tables using IPF
adjusts a known table [AB] to a set of standard, uniform margins [A] and [B] where
A1=A2=... =AI=N/I and B1=B2=... =BJ=N/J. This application is
sometimes also referred to as mostellerization after Frederick Mosteller, who gave the
first published example of such an application (Mosteller, 1968). His article, originally
delivered as the Presidential Address to the American Statistical Association in 1967,
is more of a summary of existing methods for contingency tables and in it Mosteller
explicitly states that his research student Joel H. Levine used this technique in his
(unpublished) PhD thesis comparing British and Danish occupational mobility data in
1967, a reference of possible authorship that seems to have been lost in later reviews
4Unfortunately Bregman gives no reference in his paper and other authors have been similarly
unable to find the original work by G.V. Sheleikhovskii, which might well have preceded Kruithof’s.
In any case this potentially first invention of IPF took place during a period of strong anti-intellectual
sentiment in the Soviet Union, exemplified by an attack of professor Sheleikhovskii’s work at the First
Congress of Soviet Architects in 1937, where he was reprimanded for his use of “logarithms, integrals,
and other mathematical attributes, [which serve only] as a means of showing off, as a smoke screen,
to hide his false conceptualization” (quoted in Paperny, 2003). According to E. Naydina, a senior
bibliographer at the Russian Naitonal Library, Sheleikhovskii was in fact forbidden from publishing
any work that attempted a “mathematical description of human behavior”. He was however allowed
100 copies of his manuscripts to distribute among other academics. Despite his work being practically
inaccessible, he became one of the most widely cited authors, with most of the authors quoting him
not actually having the physical opportunity to see the original text (personal communication, May
2009). The earliest surviving copy of his transportation manuscripts is dated in 1940, however a later
1946 manuscript I have obtained, explicitly refers to his original, 1936 work. This, lost manuscript is
in all likelihood the one referred to by Bregman, making it the earliest known invention of IPF.
5The name comes from the original notation, which used rto designate row multipliers, Ato
designate the base year coefficient and sto designate the column multiplier, hence rAs, or RAS. It
does not, as Willekens claims (1994, p. 14), come from the name R. A. Stone, Stone’s given name
being John Richard Nicholas.
6There are of course exceptions to this rule, although these do not necessarily contribute to a
standardization across the fields. Lamond and Stewart for example note that the Kruithof and Furness
methods as well as several other methods used in gravity models (see Section 4.5) are essentially
equivalent, however they chose to group them under the term Bregman’s balancing method, which was
not a common term at the time—and never became one either.
27
of the method. Whether this method was used earlier is unclear although not entirely
unlikely.
To confuse matters further, the term mostellerizing is occasionally used as a syn-
onym to IPF in general. It appears Upton was the first to call this type of table
standardization after Mosteller (1978, p. 93), however other authors have insisted on
using the term mostellerization for the general IPF procedure, despite being clearly
aware of the original work by Deming and Stephan and the particularity of Mosteller’s
application (e.g. Fingleton, 1981b)7. Analysing British voting transitions Sälvrik and
Crewe even claim IPF is “colloquially known as ‘Mostellerization’ after its begetter”,
seemingly unaware of any previous applications (1983, p. 360). In the standard refer-
ence book on categorical data analysis Alan Agresti adds to the confusion by claiming
table standardization is called raking (1983, p. 345), although that term had by that
time been pretty well established — especially in the official statistics community —
to refer to adjusting sample data to population margins.
What we will now refer to as table margin standardization is again an intuitively
appealing application of IPF for both descriptive and comparative crosstabulation anal-
ysis. By preserving the association between the variables i.e. the odds ratios, but by
setting uniform margins it allows an easier overview of the association between the
variables and/or the comparison of tables from different populations.8
Introductory statistical texts will invariably advocate percentaging as a means of
descriptive analysis of a crosstabulation. Because of different category sizes, cell fre-
quencies can obscure the data patterns, so the student is required to calculate row
or column percentages, the choice being made depending on which variable is deemed
dependent/independent. If no such distinction can be made, two percentage tables are
calculated, thereby splitting the analysis of the association between the variables into
two separate interactions. This allows us to inspect the effect of one variable keeping
the other constant. Table margin standardization using IPF allows us to keep both
variables constant i.e. to remove the effect of category sizes for both variables and
inspect the association independently of the margins — this is a practical effect of the
variational independence which was dealt with extensively, alongside odds ratios, in
Section 3.2.
Levine’s application of margin standardization (as it is reproduced in Mosteller
7In a paper estimating the characteristics of Western radio audiences in the USSR, Parta et al. even
describe how they “put a methodological problem to Frederick Mosteller, a Harvard statistician. He
referred us to a technique reported by Deming and Stephan [...] we have called the resulting procedure
‘Mostellerization’.” (1982, p. 586). This same ‘mostellerization’ was used to estimate Cuban listeners
of Radio Liberty as recently as 1999 (Roberts), where it was described as a significant advance in the
study of ‘captive populations’ (Seligson, 1999).
8The first example of margin standardization to my knowledge, was published by Udny Yule (1912,
p. 590 ff) to compare vaccination and recovery incidences in three British hospitals, however in this
example the 2 ×2 tables were standardized algebraically and not iteratively.
28
Table 3.2: Occupational mobility father-son frequency counts for Great Britain (Glass,
1954) and Denmark (Svalastoga, 1959)
British occupational mobility
Father’s
status
Son’s status
(1) (2) (3) (4) (5)
(1) 50 45 8 18 8 129
(2) 28 174 84 154 55 495
(3) 11 78 110 223 96 518
(4) 14 150 185 714 447 1510
(5) 3 42 72 320 411 848
106 489 459 1429 1017 3500
Danish occupational mobility
Father’s
status
Son’s status
(1) (2) (3) (4) (5)
(1) 18 17 16 4 2 57
(2) 24 105 109 59 21 318
(3) 23 84 289 217 95 708
(4) 8 49 175 348 195 778
(5) 6 8 69 201 246 530
79 263 658 829 562 2391
(1968)) is an attempt at comparing two classic occupational mobility data sets for
Britain and Denmark (Glass, 1954; Svalastoga, 1959). Table 3.2 shows the two data
sets of father and son pairs distributed into five occupational categories. Not only do
the table totals differ (3500 and 2391), which could adjusted relatively easily, but the
numbers of fathers and sons in each category are also not comparable between the
two countries (or rather samples), making them appear incompatible. This can be
visualised using the mosaic plots in Figure 3.7, where the rows refer to the fathers’
occupational categories, and the columns to the sons’ and the shading depends on the
level of intergenerational mobility.
In the British data we can see for example that a surprisingly large number of
fathers in categories (2) and (3) have sons in category (4). Does this indicate something
about the mobility patterns, or is it simply an artefact of the fact that category (4) is
particularly numerous? And how does intergenerational occupational mobility compare
between the two countries if the composition differs as well? Standardising the table
margins of both tables removes this effect and allows a direct comparison of the patterns.
Father’s status
Father’s status
Britain Denmark
Son’s status Son’s status
Figure 3.7: Mosaic plots of British and Danish original occupational mobility data from
Table 3.2
29
Table 3.3 shows the results of margin standardization of the original frequency
counts. The patterns in the two tables, visualised in the mosaic plots in Figure 3.8, are
now removed from the effects of differently sized categories and are therefore directly
comparable and as can be seen also incredibly similar, a conclusion that could not have
easily been made from the original frequency data (Mosteller, 1968, p. 8). The same
type of analysis could also be used to compare occupational mobility over time.
Table 3.3: Occupational mobility father-son standardized counts for Great Britain
(Glass, 1954) and Denmark (Svalastoga, 1959)
British occupational mobility
Father’s
status
Son’s status
(1) (2) (3) (4) (5)
(1) 68.5 20.9 4.6 3.7 2.3 100
(2) 17.8 37.5 22.5 14.7 7.5 100
(3) 8.0 19.2 33.7 24.3 14.9 100
(4) 4.1 14.7 22.6 31.1 27.6 100
(5) 1.6 7.8 16.6 26.2 47.8 100
100 100 100 100 100 500
Danish occupational mobility
Father’s
status
Son’s status
(1) (2) (3) (4) (5)
(1) 58.6 25.0 12.0 2.6 1.8 100
(2) 21.1 41.6 21.9 10.3 5.1 100
(3) 11.7 19.3 33.7 21.9 13.5 100
(4) 4.1 11.4 20.7 35.5 28.4 100
(5) 4.5 2.7 11.8 29.8 51.2 100
100 100 100 100 100 500
Standardizing the margins of a table therefore allows us to isolate the core pattern
of the association, what Mosteller calls its basic nucleus, as it is reasonable to think of
the association as independent of the relative size of the variable categories (1968, p.
4). In this sense Mosteller suggests interpreting the resulting numbers as transitional
or conditional probabilities (ibid. p. 8). To make this distinction clear, consider the
first row of the original British frequency counts in Table 3.2: fathers with the highest
social status (1) have 50 sons in the same category (1) and 45 sons in the second highest
category (2). Does this mean a father with social status (1) is almost equally likely to
have a son in either group (1) or group (2)? How do we take into account the fact that
there are almost five times as many sons in group (2) than in group (1)?
If a father’s occupational status had no effect on the sons i.e. their statuses were
independent, the first two cells in the table would read approximately 4 (= 106 ×
129/3500) and 18 (= 469 ×129/3500). But with the statuses being independent that
means a father in category (1) is just as likely to have a son in group (1) than in
group (2). This sort of confusion that arises because of different group totals is of
course absent in the standardized tables shown in Table 4. The values in this table
can now be interpreted as follows: if there were no relationship between fathers’ and
sons’ statuses all the entries would be 20 (a fifth - since we have 5 categories). However
an association clearly exists and the fathers in group (1) are almost 3 1/2 times more
likely to have sons in the same group (68.5/20), a pattern which in fact applies to all
5 categories as can be seen from the main diagonal cells being the largest, much as
would be expected from an intergenerational mobility table. This factor is in fact the
30
Father’s status
Father’s status
Britain Denmark
Son’s status Son’s status
Figure 3.8: Mosaic plots of British and Danish standardized occupational mobility data
from Table 3.3
multiplicative coefficient of association that are central to log-linear models, a topic
that is elaborated in the next section.9
As has already been noted the odds ratio is a measure of variable association that is
independent of the margins, and is therefore preserved during table margin standard-
ization. However not all common measures of association are invariant under marginal
transformations. One such measure is the phi-coefficient, the geometric mean of the
percentage differences across the rows and columns in a cross tabulation of two bi-
nary variables. Richard Liu (1980) noted that the effect of unequal margins makes
the φ-coefficient incomparable between tables with different margins and showed that
margin standardization using IPF would correct this problem. A similar problem with
Cohen’s kappa was resolved by Agresti et al. (1995) by introducing a smoothed version
of kappa, calculated after the table margins were standardized, of course using IPF.
In a comprehensive overview of 21 different measures of association or interestingness
of association Tan et al. showed these measures could give dramatically conflicting
information about the strength of association (2004). However, the authors then show
that standardizing the margins of the table using IPF makes all the 21 measures of
association become consistent.10
Apart from Mosteller’s presidential address to the American Statistical Association
9We can also note at this point the additional usefulness of the mosaic plots in this situation, where
we can interpret the shape of the individual cells in the following way: if the cell is approximately square,
the interaction between the variables has little effect in that particular combination, while wider or
narrower cells indicate that particular combination is more/less common than we would have expected
(relative to independence). This interpretation of course applies only to tables with standardized
margins as e.g. in Figure 3.8 and not e.g. in Figure 3.7!
10In fact, because the IPF standardization preserves the odds ratio, it is the remaining 20 measures
that become consistent with the odds ratio. The authors point out that other standardizations are
possible that would preserve one of the other measures and make the rest consistent with it. These
other standardization schemes are however not generally used, and can only be achieved algebraically.
31
(1968) several other key texts can be identified as introducing table standardization to
practising social scientists. One is Stephen Fienberg’s A Statistical Technique for His-
torians: Standardizing Tables of Counts (1971), another Kimball Romney’s Measuring
Endogamy (1971), both of which advocate the use of IPF for table margin standardiza-
tion to allow for descriptive analysis of patterns in crosstabulations. Romney’s work on
endogamy tables (crosstabulations of inter- and intra-group marriages) also represents
an important intermediary step in bringing crosstabulation analysis using IPF closer
to log-linear modelling and is therefore worthy of some consideration.
Endogamy tables again seem like an ideal ground for the development and concep-
tualisation of IPF. In fact Romney’s work does in fact read as an introduction to a
new procedure: normalization by iteration, with no mention of Deming and Stephan
or any other previous applications of IPF. Romney’s subject matter allows a helpful
conceptualisation of the procedure in the context of measuring marriage preferences.
The marriage data is arranged in a square table, where the categories represent lin-
eages, clans, social classes or whatever other grouping, and the rows represent the male
groups while the columns represent female groups. The cells along the main diagonal
are frequency counts of endogamous marriages, the off diagonal cells are exogamous11
(see Table 3.9). Romney regards the marriage process in two stages, distinguishing
between potential mates meeting and actually deciding to marry. The first meeting
stage is considered to be random, therefore the probability of meeting a potential part-
ner depends only on the relative sizes of the groups, while the second stage reflects
differential marital preferences. It is this latter stage that he is interested in as the
endogamy pattern, and in order to estimate it, he needs to eliminate the biases of the
first stage.
In the hypothetical community with two intermarrying groups A1and A2shown in
the left-hand panel of Table 3.9, there are 50 males in group A2, 26 of whom married
to females from their own group and 24 to females from A1. But because the other
group has 150 females, three times more than their own group, they were also three
times more likely to meet them. In order to disentangle the actual marital preferences,
the data must be normalized i.e. the table margins standardized, to reveal the genuine
marriage pattern - shown in the right-hand panel of Table 3.9. Simply: 70% of the time
people will prefer to marry within their own group or to put it differently, the odds of
marrying within ones group are 2 1/3 to 1 in favour. Or, to put it in the language of
multiplicative coefficients: if group membership were irrelevant in marital choice, all
four values in the right-hand panel of Table 3.9 would equal 50, therefore only 60%
(30/50) of people who would have married outside their group if group membership did
not matter actually did so, while 40% more people (70/50) married within their group.
11In the visualisations of endogamy tables the diagonal cells corresponding to the within-group mar-
riages are shaded and the cells corresponding exogamous marriages are not.
32
Females
A1A2Total
Males A1126 24 150
A224 26 50
Total 150 50 200
Females
A1A2Total
Males A170.46 29.54 100
A229.54 70.46 100
Total 100.00 100.00 200
θ=126 ·26
24 ·24 =70.492
29.542= 5.69
Figure 3.9: Hypothetical endogamy data (left) and “normalized” data (right) (Romney,
1971)
The effect of standardization can also be visualised with the help of the mosaic
plots below their respective tables, where the standardized values are much clearer in
describing the relationship — strength of endogamy— than the original data. Of course
in both cases the odds ratio has remained a constant 5.69 meaning that the odds of
marrying a female from group A1are 5.69 times greater for a male from group A1
than for a male from group A2—this should not be confused with the odds of marrying
within one’s own group!
The idea of removing the effect of the margins before analysing a table is of course
not limited to square mobility or endogamy tables, but can be used on any type of
crosstabulated data. It should be kept in mind that if the table is not square, the
column and row totals can of course not equal each other because table margins can
only be standardized to margins that add up to the correct total N.
3.4.3 Testing for Hypothetical Patterns with IPF
After standardizing to hypothetical (uniform) margins to discover hidden patterns in
the joint distributions it seems the next natural step is to try to standardize the patterns
themselves. And this was Romney’s next step: in addition to the descriptive analysis
just described, he also used IPF to compare endogamy tables with idealized patterns
in hypothetical tables he constructed himself. He could then compare the observed
counts with the fitted idealized model in the style of independence hypothesis testing.
This analytical use of IPF as Roberts calls it (2002, p. 189) is particularly important
as it allows an intermediate model for a two dimensional table, one between complete
independence and a fully saturated model, the only two options considered by log-linear
modeling at the time.
This procedure can be illustrated with a brief example taken from Romney’s original
33
Females
A1A2A3Total
Males
A146 6 1 53
A28 24 5 37
A32 13 8 23
Total 56 43 14 113
Original
Data
3.77 1 1
1 3.77 1
1 1 3.77
Hypothetical
Odds Ratios
Females
A1A2A3Total
Males
A140.35 9.50 3.15 53
A27.97 26.68 2.34 37
A37.68 6.81 8.51 23
Total 56 43 14 113
Fitted
Data
Figure 3.10: Fitting a hypothetical marriage pattern on Aguacatenango data (Romney
(1971) and author’s own calculations)
paper using data from the Aguacatenango village in Chiapas, Mexico from 1964 (1971).
The data on intermarriage between the three barrios is presented at the top of Figure
3.10 alongside a mosaic plot of the original pattern. Romney calculated12 that in the
village as a whole the odds of marrying within one’s group were 3.77, all things being
equal. He therefore sets up what he calls a ratio table with 3.77 on the diagonal and
ones in the remaining cells. This seed table is equivalent to determining the odds ratios
for the table. Using IPF this table is then “iterated back to the original margins”
as is shown in the bottom of Figure 3.10. This fitted table gives the hypothetical
frequencies if the odds for marying within ones group were 3.77, and this tendency
towards endogamy was equal for all three barrios while exogamous marriages were
equally likely to be with either one of the remaining groups. The fit of this model can
now be evaluated using the χ2or some other statistic.
Romney’s original paper (1971) was critically evaluated by Strauss (1977) who,
despite acknowledging the pioneering work, was unforgiving about the technical diffi-
12The exact nature of this calculation is unclear from the original publication and as Strauss notes
later (1977) Romney was inconsistent on this point. For the purpose at hand it would not matter if he
had chosen it arbitrarily or arrived at it by some heuristic manner.
34
culties and general ad hoc nature of the proposed methodology. Both of them seem to
have solved their differences amicably enough to produce a revised paper in 1982, where
Romney’s ideas are restated in the language of log-linear models and given a proper
formal framework. We will return to these models and this example in particular again
in the context of log-linear parameters (Section 4.4.2).
Looking for interesting diagonal or other patterns as Romney did indicated the
direction of future development of log-linear modelling (Roberts, 2002). His use of IPF
in doing so seems natural, although it is not clear where the idea originated. Romney
was not aware of Deming and Stephan’s work at the time, but he was a member of the
Harvard faculty and worked with Joel Levine and co-taught with Frederick Mosteller
with Yvonne Bishop as a TA. On the other hand his work on the 1971 endogamy paper
started as early as 1957, when Mosteller had apparently not yet worked out IPF; nor
were log-linear models in the making. According to Romney himself he does not claim
priority, but can equally not recall any other influences, making his perhaps yet another
independent invention of IPF (personal communication, February 2009).
3.5 Beyond Classical Applications of IPF
This section has introduced the basic principles of IPF with numerical examples as well
as a variety of early applications. Such an approach has allowed us to explore IPF in
a relatively non-technical manner while giving an impression of the method’s general
applicability and flexibility. Although it would be a gross oversimplification to deny
these example’s theoretical underpinning, the numerous ‘re-inventions’ mentioned here
give a correct impression of IPF’s intuitiveness and simplicity, which has very often led
to its use without regard to statistical theory.
The following two sections deal with the more formal and rigorous statistical frame-
works that underlie IPF. We first take a detailed look at log-linear modelling as a general
tool for analysing and describing interactions in contingency tables before discussing
entropy and information discrimination measures as the principles of inference this
analysis is based on.
35
36
Chapter 4
IPF and Log-linear models
As was shown in the previous chapter, the idea to iteratively rescale tables to achieve
desired margins was “invented” on several occasions to solve practical problems. More
often than not however, these authors did not consider IPF as more than a tool to
achieve their objectives, with no theoretical backing and without reference to the tech-
nical literature which in many cases already existed. The main developments that lead
to theoretical explanations of IPF and allowed it to move beyond its classical applica-
tions occurred in the early 1960s in connection with the development of contingency
table analysis and log-linear models in particular. Fienberg (1992) identifies the cir-
cumstances that were conductive to this development as (i) the increased availability
of high-speed computers, (ii) the impetus that came from several large-scale medical
and epidemiological studies carried out in the United States, most notably the National
Halothane Study (ed. Bunker (1969)), with an unprecedented complexity and dimen-
sionality and (iii) several breakthrough articles that would finally extend the framework
up until then focused largely on two and three-dimensional tables. In particular the
works of Darroch (1962), Birch (1963) and Good (1963) provided the basis for log-
linear representation of cell frequencies and their relationship to maximum likelihood
estimation.
Due to the importance of log-linear modelling for more complex applications of IPF
the basic principles are described below. The following is based partly on the works of
Bishop et al. (1975), Upton (1978), Knoke & Burke (1980), Wickens (1989) and Agresti
(1996), while trying to be as comprehensive and yet as non-technical as possible. After
a general definition of the main principles of log-linear models, the first part of this
section uses an ad hoc example to introduce the general log-linear model equation in
both its multiplicative and additive formulations. What follows is an account of the
model parameter types and constraints at a level of detail that is unfortunately absent
from the above mentioned literature. The model is then explained within the wider
framework of the generalized linear model family, relating it to other, perhaps more
familiar parametric models. Next the interpretation of parameters is discussed and
37
Romney’s endogamy tables are revisited once more. Finally the concept of maximum
entropy is introduced through another sidestep - gravity models of spatial interactions,
which can be seen as a special case of log-linear models as well.
4.1 Introduction to Log-linear models
The general idea of variational independence has already been discussed in the previous
section. Log-linear models are based on that same principle: in a two dimensional table
the cell frequencies can be broken down into separate ’effects’. We can define each cell
count as being determined by the independent effects of
(i.) the total size of the table - constant term;
(ii.) the marginal distribution of the variables - first order coefficients;
(iii.) the interaction between the variables - second order coefficients.
These three effects are explicitly defined as independent of each other. The table margin
standardization covered in the previous section is based on the fact that the interaction
between the variables (iii.) can be separated from the marginal distributions (ii.) and
in a similar way the marginal distributions (ii.) are seen in relative terms and therefore
independent of the total table size (i.) This allows us to represent the cell frequencies
in the following multiplicative form (for a two dimensional table):
xij =τ·τA
i·τB
j·τAB
ij [4.1]
where the τ(tau) coefficients correspond to the respective effects. Although non-
hierarchical models are possible, they are rare special cases and will not be considered
here. The hierarchy principle stipulates that if a higher order interaction term is as-
sumed by the model, the appropriate lower order terms of that interaction must also
be included. Thus e.g. if τAB
ij is in the model, both τA
iand τB
jmust also be included.
By the same principle, the simplification of a model must always proceed by removing
the highest order interactions, before the lower ones.
These principles can be demonstrated for a 2 ×2 table (Table 4.1). The following
calculations are given here for purely illustrative purposes — the actual formulas for
log-linear coefficients and their interpretations will be presented in the next sub-section.
The hierarchical model for a 2×2 table is built up step-wise: first a uniform distribution
of cell counts with only the constant term, then the model of independence, with the
addition of both main effects and finally the saturated model, where all four terms are
included in the model.
In the simplest example of a uniform distribution (left-hand column of Table 4.1)
we can describe the crosstabulation with just one parameter, namely the constant term
38
Table 4.1: Three hierarchical log-linear models in a 2 ×2 table
Uniform distribution Independence Interaction
B1B2
A125 25 50
A225 25 50
50 50 100
B1B2
A18 32 40
A212 48 60
20 80 100
B1B2
A118 22 40
A22 58 60
20 80 100
xij =τ xij =τ·τA
i·τB
jxij =τ·τA
i·τB
j·τAB
ij
τ25 25 25
τA
1,τA
21 1 0.8 1.2 0.8 1.2
τB
1,τB
21 1 0.4 1.6 0.4 1.6
τAB
11 ,τAB
12 1 1 1 1 2.25 0.6875
τAB
21 ,τAB
22 1 1 1 1 0.167 1.208
τ. All the remaining coefficients have a value of 1 and thus have no effect:
xij =τ
The second example is that of independence (middle column). We have kept the con-
stant term, which is related to the total size of the table the same as in the first
example, but now the cell sizes are also determined by the row and column margins -
the so called first order coefficients τA
iand τB
i. As each variable has two categories,
we can calculate four first order coefficients. The cell values are then calculated using
Equation [4.1] with τAB
ij set to one. For example the top left-hand cell is:
xij =τ·τA
i·τB
j= 25 ·0.8·0.4 = 8
The final example is that of a fully saturated model which requires all three sets of
coefficients to determine the cell entries1(right-hand column of Table 4.1). To calculate
the top left hand cell value, and for the sake of this illustration keeping the lower level
coefficients the same as in the previous models, we must multiply coefficients from all
three sets of terms: the constant, two main effect terms: the row term and the column
term, and the interaction term:
x11 =τ·τA
1·τB
1·τAB
11 = 25 ·0.8·0.4·2.25 = 18
Equation [4.1] describes the so-called multiplicative formulation form of the log-
linear model, which extends naturally to variables with more categories as well as
1It should be borne in mind that this calculation is an ad hoc one, used to exemplify the decompo-
sition of the information in the table, however it is not the one actually used in log-linear models. The
reasons for this are explained in the next section.
39
higher dimensions. By taking the (natural) logarithm of this equation, it can also be
expressed as a linear equation:
ln xij =ln(τ) + ln(τA
i) + ln(τB
j) + ln(τAB
ij )
or
ln xij =λ+λA
i+λB
j+λAB
ij [4.2]
where λ(lambda) equals ln(τ). As before the size of the coefficient determines the
relative strength of the effect and because ln(1) = 0, the absence of an effect has no
impact on the log of the cell frequency. Both equations are completely equivalent,
and the choice of which one to use is often one of convenience and familiarity2. The
multiplicative form seems similar to classical independence testing in contingency ta-
bles, where the expected counts are calculated by multiplying the relevant marginal
totals and dividing them by the total table count. This can easily be reformulated as
a multiplicative log-linear equation:
ˆxij =xi+·x+j
x++
=N·pi+·N·p+j
N=N·pi+·p+j
where the table total can be interpreted as the ‘constant term’ and the marginal proba-
bilities and as the row and column effects respectively. Thus multiplicative formulations
seem more convenient in situations where probabilities are common currency and are
particularly popular in applications using transition tables (transportation research,
econometrics), because similar multiplicative models have been used in these situations
(e.g. the gravity model, the RAS model). On the other hand additive formulations
are appealing because of their similarity to analysis of variance or linear regression, an
analogy which allows the parameters to be interpreted in a similar way: λas the con-
stant term (analogous to α), and the remaining coefficients sizes corresponding to the
strength of each effect (analogous to the βcoefficients). The important difference being
that the log-linear lambda coefficients determine the logarithm of the cell frequency and
not the cell frequency itself.
4.2 Choice of Log-linear Parameters
Although the basic principle of log-linear models – a set of independent parameters for
each effect determines the cell frequencies – is simple enough, in practice the calculation
2Although technically it is only the additive formulation that can rightfully be called log-linear, as
it uses the logarithmic function to relate the cell count to a sum of explanatory parameters, in practice
the product equation is also referred to as a formulation of the log-linear model (see e.g. Willekens
(1982); Strauss & Romney (1982); Knoke & Burke (1980)). This seems sensible as they both refer to
different formulations of the same model (log-linear), and this convention is followed in this text.
40
of parameters and their interpretation can be more confusing. The first reason is
the already mentioned distinction between the additive and multiplicative form. For
example, in the first formulation a non-existent effect has a value of 0 in the second in
has a value of 1. Furthermore very small values (e.g. 0.001) have radically different
meanings if they are multiplied by or summed with, a distinction which can be difficult
to keep track of. Further difficulties arise from the fact that the basic formulas as
stated in Equations [4.1] and [4.2] are in fact over-parametrized. This can be seen from
the fact that the four cells of the saturated model in the example in Table 4.1 require
nine parameters in our ad hoc calculation: one constant term, four first order effects
and four second order or interaction terms. In this form — using so many parameters
— there is no unique way to state the model thus the model is not identifiable. This
means the set of parameters that is given in the above example is not the only way to
express the model. In fact we could express it with an infinitely many combinations of
nine coefficients.
Furthermore, the parameters of the above decomposition of the table are not vari-
ationally independent. The interaction parameters are interpreted as how much more
or less likely an observation is compared to what it would be under independence,
and this makes it sound intuitively appealing, however these parameters unfortunately
depend on the marginals. The independent decomposition of the effects is the only
way log-linear parameters can reasonably be comparable, and although they can not
be interpreted with reference to the independent distribution as the ad hoc example
above, we know that they will be interpretable using the odds ratio (see Figure 3.4
on page 22 on odds ratios3; also Rudas (1998) for an in-depth discussion). This point
cannot be stated enough as the appealing nature of the original example has lead at
least one researcher to optimistically yet inappropriately adopt this new parametriza-
tion calling it the total sum reference coding scheme, without, however, realising the
marginal dependence made the resulting parameters incomparable and impossible to
compare correctly(Raymer et al., 2006; Raymer, 2007, 2008).
In order to state the model uniquely and its parameters independently, constraints
must be added to the model formulation to remove redundant parameters. The mini-
mum number of independent parameters required to completely define a crosstabulation
is equal to the number of cells in the table. This can be verified by adding together the
degrees of freedom associated with each lambda or tau term in an I×Jtable. There
is one overall constant effect coefficient. For each of the main effects, we require I−1
and J−1 coefficients as the last one can be determined from the others. Similarly for
3The relevant interaction parameters for the bottom two tables in that figure would be 1.56 0.44
0.94 1.06
and 1.33 0.66
0.66 1.33 , although the odds ratio equals four in both cases. Without knowing the row and
column effects these parameters cannot be interpreted correctly.
41
the interaction effects (I−1) ·(J−1) coefficients are sufficient. If we add number of
parameters required for each term we get the number of cells I·J:
constant effect
z}|{
1 + (I−1) + (J−1)
|{z }
main effects
+
interaction effects
z}| {
(I−1) ·(J−1) = I·J
Thus the above example of the saturated model should be identifiable with only four
parameters, the independence model with only three (1+(I−1)+(J−1) = I+J−1 = 3)
, and the uniform model with a single one. There are two main types of constraints
that are typically used to do this4:
(i.) Indicator contrast5constraints remove one parameter in each term by setting one
of the categories as a reference category so the remaining coefficients are relative
to that category (Plackett, 1983). In the multiplicative form Equation [4.1] is
then supplemented by
τA
1=τB
1=τAB
ij = 1, i, j 6= 1 [4.3]
and in the additive form Equation [4.2] requires
λA
1=λB
1=λAB
ij = 0, i, j 6= 1 [4.4]
for a unique solution. In both cases the first categories and the first cell were
taken as the reference categories, but this is of course arbitrary and the choice can
depend on the variables. The remaining coefficients are then interpreted relative
to these categories. The parameters for the reference categories being known
means fewer parameters need to be estimated and all the redundant parameters
are therefore eliminated, making the model uniquely identifiable.
(ii.) Deviation contrast6constraints require all the coefficients across a category to
multiply to one in the case of the multiplicative model (Goodman, 1970; Bishop
et al., 1975): Y
i
τA
1=Y
j
τB
1=Y
ij
τAB
ij = 1 [4.5]
which is equivalent in the case of the additive model for all the coefficients to sum
up to zero: X
i
λA
1=X
j
λB
1=X
ij
λAB
ij = 0 [4.6]
4Other types of constraints are of course also possible, but are less common, or rather are not
available in statistical software (see e.g. Hendrickx (2005).
5Also known as regression-like constraints (Long, 1984, p. 405) or cornered-effect coding (Raymer,
2007, p. 989).
6Also known as ANOVA-like constraints (Long, 1984, p. 405) or geometric mean coding (Raymer,
2007, p. 989).
42
In this case the coefficients are relative to each other. If we take an example of
the main effect of a dichotomous variable Ai, i ={1,2}, then Equation [4.5] leads
to τA
1= 1/τA
2or equivalently Equation [4.6] to λA
1=−λA
2thus again making one
of the parameters redundant in each term.
The choice of type of constraints determines the way in which the coefficients are
interpreted: relative to a reference category or relative to each other. The complications
that arise from this fact and the fact that various computer programmes use different
constraints in calculating the model parameters, leading to difficulty in interpretation
are seen as a reason for the under-utilization of log-linear models, especially in social
sciences (Hendrickx, 2005; Holt, 1979).
An example of both types of constraints used in calculating the parameters for both
the multiplicative and the additive form of the model for the previous example are given
in Table 4.2. All four sets of parameters are equally ‘correct’, and as they refer to a
four-celled table, all can be calculated from four basic coefficients (I·J= 4). This fact
is most obviously clear in the case of the indicator contrast: only four parameters are
required, while the remaining five are set to 1 or 0, depending on the model formulation.
In the multiplicative formulation (first column in Table 4.2) this means7:
τ= 18, τA
2= 0.11, τB
2= 1.22, τAB
22 = 23.72 and τA
1=τB
1=τAB
12 =τAB
21 = 1
So the value for the bottom left-hand cell for example, where A= 2 and B= 1 can be
calculated in the following manner:
x21 =τ·τA
2·τB
1·τAB
21 = 18 ·0.11 ·1·1 = 2
In the additive formulation we can similarly take the parameters from the third column
to calculate
ln(x21) = λ+λA
2+λB
1+λAB
21 = 2.89 −2.20 + 0 + 0 = 0.69
Of course seeing as we are using the additive formulation of the model we have
actually calculated ln(x21) so in order to get the actual cell value we need to take the
exponential:
x21 =e0.69 = 2
If we take the example of the multiplicative model using deviation contrast to constrain
the parameters (second column), we see that the nine coefficients, taking into account
the constraints in Equation [4.5] are defined in the following way:
constant effect
z}| {
τ= 14.64 ,
main effect A
z}| {
τA
1= 1/τA
2= 1.36,
main effect B
z}| {
τB
1= 1/τB
2= 0.41,
7The numerical examples are all formatted to show the numbers rounded to two decimal places,
however the calculations were all done without rounding, hence the occasional discrepancies.
43
Table 4.2: Multiplicative and additive log-linear parameters for two styles of constraints
fitted to the saturated model example in Table 4.1
Multiplicative models (xij )
Indicator Deviation
contrast contrast
τ18 14.64
τA
11 1.36
τA
20.11 0.74
τB
11 0.41
τB
21.22 2.44
τAB
11 1 2.21
τAB
12 1 0.45
τAB
21 1 0.45
τAB
22 23.72 2.21
Additive models(ln(xij ))
Indicator Deviation
contrast contrast
λ2.89 2.68
λA
10 0.31
λA
2−2.2−0.31
λB
10−0.89
λB
20.20 0.89
λAB
11 0 0.79
λAB
12 0−0.79
λAB
21 0−0.79
λAB
22 3.17 0.79
τAB
11 =τAB
22 = 1/τAB
12 = 1/τAB
21 = 2.21
|{z }
interaction effects
Using only these four parameter values and we can calculate the upper right-hand cell
where A= 1 and B= 2:
xij =τ·τA
1·τB
2·τAB
12 =τ·τA
1·1
τB
1·1
τAB
11
= 14.64 ·1.36 ·1
0.41 ·1
2.21 = 22
And again similarly for the additive model using deviation contrast only four parameter
values are needed (final column):
constant effect
z}| {
λ= 2.68 ,
main effect A
z}| {
λA
1=−λA
2= 0.31,
main effect B
z}| {
λB
1=−λB
2=−0.89,
λAB
11 =λAB
22 =−λAB
12 =−λAB
21 = 0.79
|{z }
interaction effects
Thus we can again calculate the value of the upper right hand cell x12:
ln(x12) = λ+λA
1+λB
2+λAB
12 =λ+λA
1−λB
1−λAB
11 = 2.68+0.31+0.89+−0.79 = 3.09
x12 =e3.09 = 22
To reiterate then, the choice of model formulation and parameter constraint contrast
are irrelevant for the actual model fit, but should be borne in mind when interpreting
the coefficients. Although we will focus on the multiplicative model using deviation
contrast constraints, the interpretation of which will be explained below, the additive
formulation allows us to take a step back and look at the wider family of generalized
linear models using the model formulation and contrast definitions just given.
44
4.3 Log-linear Models as Generalized Linear Models
Log-linear models fall into a wider class of models called generalized linear models. The
statement of this family of models by Nelder & Wedderburn (1972) usefully unifies
a whole set of probabilistic models for both continuous and discrete data, including
ordinary least square regression, logistic regression, logit, probit models and log-linear
models. The GLM framework allows a unified methodological approach to producing
parameter estimates for a whole family of models and using the additive model formu-
lation we will be able to see how log-linear models fit in. A generalized linear model
consists of three elements (ibid. p. 372):
(i.) Random component - a dependent variable ywith a probability distribution from
the exponential family (this includes the normal, gamma, beta, Poisson and multi-
nomial distributions) and expected value µ,
(ii.) Systematic component - a linear relationship linking a set of explanatory variables
with the linear predictor expressed in matrix notation as g(x)0 = βX, and
(iii.) Link function - the relationship between the linear predictor and the expected
value of the dependent variable i.e. a link between the random and systematic
component g(µ).
In the case of ordinary linear regression, the random component ywith the expected
value µis a continuous variable assumed to be normally distributed; the systematic
component is the linear combination of independent variables and is written g(µ) =
β0+β1x1+··· +βkxkand the link function is identity:g(µ) = µ. This states the
model in a very familiar way, the only distinction being that β0is used instead of the
more common αas the constant.
In the case of log-linear models the cell counts are considered the random compo-
nent ywith an expected value of π, and have a discrete random error with a Poisson
distribution8. The systematic component is again a linear combination of independent
variables g(π) = β0+β1x1+··· +βkxk, and the link function connecting both
components is the natural log: g(π) = ln π. Thus
ln π=β0+β1x1+··· +βkxk[4.7]
represents the GLM statement of the log-linear model. This is in fact equivalent to
the additive expression of the model given in Equation [4.2] using indicator contrasts
to constraint the parameters as in Equation [4.4]. To see why this is so, consider a
8While xis used throughout this text to denote cell counts, the generalized nature of the GLM
framework requires this convention to be suspended briefly and have xrefer to the independent variables
determining the cell count y.
45
crosstabulation of two dichotomous variables Aand B, for which the GLM is:
ln π=β0+β1x1+β2x2+β3x3
The xs then refer to membership in individual variable categories. Dummy-coding is
used for the indicator variables:
x1=(1 if A=A1,
0 if A=A2.x2=(1 if B=B1,
0 if B=B2.and x3=(0 if A=A1∨B=B1,
1 if A=A2∧B=B2.
The βcoefficients are then model parameters to be estimated and are equivalent to the
lambda coefficients in the third column of Table 4.2:
• the GLM constant is equal to the constant term: β0=λ
• the next two βcoefficients are equivalent to the main effects: β1=λA
2and
β2=λB
2
• the last coefficient is the interaction term: β3=λAB
22
Thus calculating the expected cell frequency in the upper right hand corner cell would
proceed as follows: since A=A1and B=B2that means that x1= 1 x2= 0 and
x3= 0, therefore:
ln π=β0+β1x1+β2x2+β3x3
=λ+λA
2·1 + λB
2·0 + λAB
22 ·0
= 2.89 −2.20
= 0.69
π=e0.69 = 2
Log-linear models therefore belong into the family of generalized linear models, the
parametrisation of which, as we have seen, is equivalent to the additive formulation
using indicator contrasts. Although the importance of the GLM framework has wider
repercussions, e.g. common approaches to parameter calculation; it also highlights one
of the possible reasons for choosing the mentioned formulations and constraints.
Nevertheless, the remainder of this section will now focus exclusively on deviation
contrasts. As it has been shown that all the mentioned methods are mathematically
equivalent and their choice is often the result of previous formal training and/or type
of application, the main reason for disregarding indicator contrasts from the following
exposition is that the choice of reference category unnecessarily complicates analysis
by forcing one to chose and defend the choice of an often arbitrary decision.
46
4.4 Parameter Interpretation and Cell Fitting
4.4.1 Deviation Contrast Type Constraints
In order to clarify the meaning of the deviation contrast effect parameters we continue
with the 2 ×2 crosstabulation example. We can express the odds ratio using the terms
of the multiplicative model:
θ=x11 ·x22
x21 ·x12
=(τ·τA
1·τB
1·τAB
11 )·(τ·τA
2·τB
2·τAB
22 )
(τ·τA
2·τB
1·τAB
21 )·(τ·τA
1·τB
2·τAB
12 )=τAB
11 ·τAB
22
τAB
21 ·τAB
12
Because we know from Equation [4.5] defining the deviation contrast that due to the
constraints the following equalities hold:
τAB
11 =τAB
22 = 1/τAB
12 = 1/τAB
21
it therefore follows that
θ= (τAB
11 )4or τAB
11 =4
√θ[4.8]
allowing us to interpret the interaction coefficient τAB
11 as the fourth root of the odds
ratio. In order to relate the main effect terms to the cell values, we can define the
average (geometric mean) of the odds of being in A1as opposed to A2given B:
odds(A1|B1) = x11
x21
and odds(A1|B2) = x12
x22
thus average odds(A1|B) = rx11 ·x12
x21 ·x22
Inserting the multiplicative log-linear parameters we get
average odds(A1|B) = sτA
1·τA
1
τA
2·τA
2
=τA
1
τA
2
which is to say the average odds of being in category A1as opposed to A2given Bare
equal to the ratio of the categories tau coefficients. Because we know from Equation
[4.5] that τA
1= 1/τA
2we can express the main effect parameter as the square root of
the average odds:
τA
1=qaverage odds(A1|B) = 4
rx11 ·x12
x21 ·x22
[4.9]
The equivalent can also be done for the odds of being in B1as opposed to B2given A
to define the
τB
1=qaverage odds(B1|A) = 4
rx11 ·x21
x12 ·x22
[4.10]
If we multiply all four cell values together and express them using the multiplicative
Equation [4.1] all the terms cancel each other out because of the equalities expressed
in [4.5], leaving only the constant term also known as the grand mean:
τ=4
√x11 ·x12 ·x21 ·x22 [4.11]
Therefore the constant term in the multiplicative model is the geometric mean of the
cell frequencies. The equivalent of Equations [4.8] to [4.11] can be calculated for the
47
Table 4.3: Relationships between cell frequencies and multiplicative/additive coeffi-
cients
Multiplicative models Additive models
Constant τ=4
√x11 ·x12 ·x21 ·x22 λ=1
4(ln x11 + ln x12 + ln x21 + ln x22 )
term
Main effect τA
1=4
rx11 ·x12
x21 ·x22 λA
1=1
4(ln x11 + ln x12 −ln x21 −ln x22)
of [A]
Main effect τB
1=4
rx11 ·x21
x12 ·x22 λB
1=1
4(ln x11 −ln x12 + ln x21 −ln x22)
of [B]
Interaction τB
1=4
rx11 ·x22
x12 ·x21 λB
1=1
4(ln x11 −ln x12 −ln x21 + ln x22)
effect [AB]
additive log-linear formulation using the constraints from [4.6] and Table 4.3 lists the
formulas for the lambdas alongside the ones just calculated above. The system of
four equations relating four model parameters with the four cell counts can be used as
shown, to calculate the model parameters, or can be rearranged to calculate the cell
counts if the model parameters are known.
For clarity this exposition was given for a two dimensional table of dichotomous
variables; however it extends to larger tables of higher dimensions. Thus given anal-
ogous constraints that were used in Equations [4.5] and [4.6] , a table with kcells
in however many dimensions can be fully specified with a log-linear or multiplicative
model using kterms.
4.4.2 Prescribed Interaction Models
The log-linear model description has so far focused on the hierarchical models, which
in two dimensions means either a uniform model, where only the constant effect, the
model of independence, adding the main effects or the fully saturated model, with an
extra parameter for each cell in the table. The possibilities of course increase when
more dimensions are added (see Section 9.3) but first we take a look at a the variety
of models that are still possible in two-dimensional models and return to Romney’s
endogamy tables.
As we have seen log-linear models explicitly separate the interaction effects from
the row and column effects. Thus the table of interaction parameters — or the cor-
responding odds ratios — is the most parsimonious way of expressing the interaction
pattern. We saw in the end of Section 3.4 how Romney tested a hypothetical pattern
on his endogamy data, and within the log-linear model framework, this hypothetical
pattern can now be explicitly formulated.
To reiterate, the general form of the multiplicative form using deviation contrast in
48
a two dimensional table is
xij =τ·τA
i·τB
j·τAB
ij [4.1]
In a saturated model, each combination of iand jis associated with its own interaction
parameter τAB
ij . We can however try to impose a pattern on these parameters. Table 4.4
shows four examples of idealized marriage patterns for a case of four intermarrying
groups. The simplest of these models (barring independence) is the simple endogamy
pattern, also referred to as homogeneous endogamy (Strauss & Romney, 1982), and
essentially the same type of model as the one presented in Section 3.4 by Romney. We
can now state this model formally:
xij =τ·τA
i·τB
j·δ for i =j
xij =τ·τA
i·τB
jfor i 6=j
For a more complicated pattern, the symmetric decay model assumes a hierarchical
ordering of the different intermarrying groups and the tendency towards marriage de-
pends on how far removed the spouse’s group is:
xij =τ·τA
i·τB
j·δ1for i =j
xij =τ·τA
i·τB
j·δ2for |i−j= 1|
xij =τ·τA
i·τB
j·δ3for |i−j= 2|
xij =τ·τA
i·τB
jfor |i−j= 3|
Of course many more patterns such as these can be hypothesized and compared with
the observed data. The cells are then fitted using a slightly adjusted form of IPF. For
the simple endogamy pattern this means adding one more step to the IPF algorithm:
after adjusting the rows and the columns, a third step adjusts the diagonal cells to sum
up correctly to Pixii.
Unfortunately we cannot avoid some further terminological confusion. Although we
have stated the above models in log-linear formulations, the fact is the usual constraints
no longer apply. Furthermore this type of model is particularly useful for ordinal
Table 4.4: Four Hypothetical intermarriage models used for testing endogamy patterns
Independence Simple Endogamy Quasi Independence Symmetric Decay
1 1 1 1 δ1 1 1 δ11 1 1 δ1δ2δ31
1 1 1 1 1 δ1 1 1 δ21 1 δ2δ1δ2δ3
1 1 1 1 1 1 δ1 1 1 δ31δ3δ2δ1δ2
1 1 1 1 1 1 1 δ1 1 1 δ41δ3δ2δ1
49
variables although this is not a necessary condition. The models were described as
association models by Goodman in 1979 and as such considered an extension of loglinear
models (Frick & Axhausen, 2004). However another school of thought considers log-
linear models to be as special cases of association models but calls them generalized log-
linear models instead - not to be confused with generalized linear models (Haberman,
1974). Still others prefer the term prescribed interaction models (Rudas, 1998, 1991).
The latter term seems most descriptive, and it should further be kept in mind that
these prescribed interactions need not be nice symmetrical patterns as the ones in the
above examples for endogamy. In fact the prescribed interaction can just as easily be
the interaction from a sample or from a historical or other comparable source as we have
seen in the classical applications of IPF. This is of course implicit in these applications
where the interaction from one source is grafted upon the margins from another. The
interaction is nothing more but a set of odds ratios and these are equivalent to the
interaction terms of the log-linear model.
4.5 Gravity Models of Spatial Interaction as Log-Linear
Models
Transportation research has already been mentioned as a field where IPF-like proce-
dures were “invented” several times. Despite their ostensible simplicity as descriptions
of square, two-dimensional, mobility matrices, the general family of models of spatial
interaction plays an important part in the history of IPF, bridging many theoretically
distinct fields from entropy maximization, information minimization and finally to log-
linear models and therefore deserves more careful consideration.
Sir Isaac Newton’s Theory of Universal Gravitation describes the strength of the
gravitational force between two bodies as proportional to the bodies’ masses and in-
versely proportional to their squared distance. The earliest known explicit application
of this principle in social sciences is ascribed to Carey, who wrote in 1858 of man as the
molecule of society and “the Great Law of Molecular Gravitation [as] the indispensable
condition of the existence of the being known as man [. . . ] the greater the number
collected in a given space the greater is the attractive force that is there exerted”
(quoted in Carrothers, 1956, p. 94). Other early formulations of the gravity principle
include Ravenstein in migration (1885), Lill for railway travel (1891), Young for migra-
tion (1924) and Reilly for retail (1931). All of these applications can be considered as
only partial expressions of the gravity principle, as they were either purely descriptive
in nature and/or refer only to the attraction forces governing population movement.
The first complete formulation of gravity as it applies to human spatial interactions is
therefore commonly attributed to John Q. Stewart, who is credited with the Laws of
Demographic Gravitation (1948). In direct analogy with Newton the two masses were
explicitly stated as the population sizes, while their interaction as inversely related to
50
their square of their distance9. Ever since then variations on this theme under the
general heading of gravity models, have dominated the modeling of spatial interaction
flows (Willekens, 1999). Thus in its simplest form, the gravity model can be stated as:
Tij =k·Wi·Wj·d−2
ij [4.12]
where the interaction between iand j—usually the number of trips between them Tij
— is proportional (constant k) to the number of people in the origin and destination,
Wiand Wjrespectively, and inversely proportional to the square of the distance dij .
Given the subject of the previous sections, the similarity with the multiplicative log-
linear model seems clear. However it was not until the 1980s that this connection was
formally made (Willekens, 1980, 1983). Until then however, there were several problems
with the equation as stated. For one thing, there was no theoretical justification why
the flows would decline with the square of the distance, and not some other power.
Thus the exponent became a parameter of the model to be estimated and would perhaps
vary from application to application10 (Anderson, 1955). Furthermore distance was not
necessarily seen as the only obstacle to population flows, so a generalized cost function
was proposed (also known as the deterrence or impedance function). This would include
various other inverse powers but also more complex exponential functions of distance,
cost, perceived costs and accessibility (Willekens, 1982; Wilson, 1970). Experimentation
with these various cost functions would eventually arrive at a unanimous conclusion:
the choice of distribution function can hardly improve the accuracy of estimates and
no general function can likely be identified (Openshaw, 1979; Willekens, 1983).
A more serious problem had to do with the specification of sending and attraction
factors. These could be defined simply as population totals; although other options e.g.
acreage of industrial and commercial land could also be used as a proxy for the propen-
sity to attract interactions11. Regardless of the measure used, and in transportation
9Stewart, although an astrophysicist, originally observed that“The number of undergraduates or
alumni of a given college who reside in a given area is directly proportional to the total population of
that area and inversely proportional to the distance from the college.” (1941, p. 89). In his later article
formalizing his previous observations he argues for a social physics noting that “Continued insistence
on examining people’s purposes and motives only blocks the way to a science of society as a whole”
(Stewart, 1948) before he goes on to formalize the Laws of Demographic Gravitation. In exchanging
Newtonian mass with the number of people in his equation, he assigns a standard weight to the “sort
of person considered” and “that of the ‘average American’ is taken as unity. Presumably [that] of an
Australian aborigine, for example, is on this scale much less than one.” (ibid., p. 34). Stated more
appropriately later models distinguished types of person based on car ownership and access to different
modes of transport and assigned them appropriate weights (Wilson, 1970).
10According to Anderson the exponent should be inversely related to the destination population size,
Carrothers even went as far as to suggest the distance exponent should be inversely related to the
distance itself (1956, p. 97). An additional problem was related with defining the ‘distances’ for the
intra-zonal interactions i.e. the values of dij on the diagonal where i=j. This tended to be solved
using quite arbitrary rules of thumb such as taking a quarter of the distance to the nearest zone.
11The problem of heterogeneous composition of populations was noted already by Stewart (see foot-
note 9 above) and was treated by adding “molecular weights” proportional to “individual’s capacities
for sociological interaction” and even raising the numerator of the different population elements (dif-
51
models it was usually the number of travellers originating (Oi) and ending (Dj) in par-
ticular zones, the classical model lead to internal inconsistencies: the predicted flows
would not add up correctly to the known origin and destination totals12 . Furthermore,
the model could also lead to clearly false predictions. Thus Wilson observed that the
way the model is stated has an obvious deficiency: “if a particular Oiand a particular
Djare both doubled, then the number of trips between these zones would quadruple
according to the equation [4.13], when it would be expected that they would double
also” (1967, p. 253). Essentially such models could predict impossible solutions, with
more (or fewer) trips originating than were actually possible. Wilson’s solution was
to introduce a set of constraints to make sure the number of predicted trips summed
up correctly, which represented a complete reformulation of spatial interaction models
(Bennett & Haining, 1985, p. 3). The complete, doubly constrained model can now be
stated as
Tij =Ai·Oi·BjDj·Fij [4.13]
where Aiand Bjare balancing factors13 that ensure the predicted flows (Tij ) sum up
to the origin and destination total:
X
j
Tij =Oiand X
i
Tij =Dj
while Fij refers to the chosen cost of impedance function14.
Wilson further responded to a common criticism of the gravity model and ‘social
physics’ in general: not only was the model in its unconstrained form at best heuristic,
but the analogy with Newton’s gravitational law was seen as an inappropriate the-
oretical basis for describing social behaviour (Nijkamp, 1977, p. 13). His proposed
reformulation gave a new interpretation and theoretical base to the ‘gravity’ model
based on the principle of entropy. Without explicitly stating the mathematical deriva-
tion, this principle is briefly described here.
The distribution of trips that is sought (Tij) must of course be subject to the above
mentioned constraints. This distribution is seen as the macro state and is all we are
interested in. It is composed however of individual travellers, and this micro state
completely describes the system. Many different micro states can give rise to the same
ferentiated by sex, income, education etc.) to a power other then one (Carrothers, 1956, p. 97-98).
12For a very detailed yet non-technical account of this internal inconsistency problem see Senior
(1979, p. 180-191).
13Aiand Bjare dependent on each other and are therefore calculated iteratively.
14The alternative single constrained models were used in situations where only the sizes on one side of
the flow were known e.g. if the number of available jobs per zone were known, but not the numbers of
workers in each zone, an attraction constrained model could be set up, to make sure the correct number
of workers were predicted to end up working in each zone (Dj). Similarly if the number of people
leaving from each zone were known, then a "production constrained model" would predict interactions
that correctly sum up to the known Oi. For excellent worked examples demonstrating all the models
in Wilson’s family of spatial interaction models again see Senior (1979).
52
trip distribution. For example, if from a total of Ntravellers only one travelled from i
to j, there are Npossibilities for who this person could be. If there were two trips made
between these two origins and destinations, this could have happened in N(N−1) ways.
The basic principle of entropy maximization states that the most likely distribution of
trips Tij is the one that could have occurred in the largest number of ways. That is to
say, given no other information than the given constraints, the probability of a given
trip distribution is proportional to the number of micro states that can give rise to
that distribution15.(Wilson, 1970, p. 1-4) (Wilson, 1967, p. 256) Entropy can therefore
be conceptualized as a measure of uncertainty about the micro state of the system,
so maximising the entropy is equivalent to introducing the least possible amount of
information about the micro state, i.e. information we do not have.
Following the principle of maximum entropy, the problem of predicting population
flows becomes a problem of maximising the entropy of the Tij distribution subject to
the given constraints i.e. finding the distribution Tij that can be the result of the largest
number of distinct micro states. As it turns out, the resulting equation16 is equivalent
to the double constrained gravity model given in Equation [4.13]! Thus the ‘gravity’
model is newly derived without any reference to Newtonian gravity, but is effectively
given a new statistical theoretical basis17.
As opposed to explaining human spatial interaction with the attraction forces of
large populations, the entropy maximization principle allows the gravity model to be
interpreted by analogy to gas molecules18, which although difficult to reconcile with any
behavioural rationale, is still conceptually more appealing (Nijkamp, 1977, p. 15,18).
This analogy should be viewed as pedagogical, with the theoretical basis being simply
the solution that minimises the uncertainty associated with incomplete information.
With Wilson’s reformulation ‘gravity’ models were released from the arbitrariness
of their calibration and given a theoretical justification for producing the most likely
interaction patterns in line with known information. However the “known informa-
tion” still included the assumption that some function of distance and/or cost will best
describe the interaction effects. This assumption began to be increasingly challenged.
Openshaw, in an atmosphere of “increasing evidence of unsatisfactory model perfor-
15The corresponding situation in physics applies to ideal gases and is known as the Boltzmann
hypothesis stating that in equilibrium all gas particle states with the same level of total energy are
equally likely to occur (Sen & Smith, 1995, p. 117).
16Because of the constraints the maximisation of the entropy cannot be solved using standard differ-
ential calculus, but must be solved using Lagrangian multipliers, a method that is beyond the scope of
this text. The complete mathematical derivation is given in Wilson (1967, p. 256-8).
17According to Senior, geographers are to blame for the idea of gravity in spatial interaction living
on. The gravity model was originally developed in transportation research and regional science, with
geographers as late-comers largely unaware of the adaptations made to the Newtonian model and
furthermore reluctant to accept Wilson’s entropy maximising methodology, thereby giving gravity a
"prolonged lease of life despite its serious shortcomings".(1979, p. 175)
18The maximum entropy method does not in fact rest on this physical analogy, a point which is
elaborated upon in Section 5.
53
mances” conducted a detailed empirical evaluation of a series of nine singly and doubly
constrained models with different levels of complexity of the cost functions and other
parameters to vary the attractiveness of zones and found the results “cast severe doubts
about [the models’] empirical acceptability” (Openshaw, 1976, p. 40). A similar con-
clusion was arrived to by Snickars and Weibull, also performing a comparative study
of various gravity models and finding them considerably less powerful than a model
based on historical data on travel patterns (Snickars & Weibull, 1977, p. 156). Thus
the traditional cost function Fij started to be replaced by distribution functions that
were found to better describe the spatial effects such as historical trip distributions or
their averages, or samples of current distributions (Willekens, 1983)19.
In terms of the classical log-linear model the analogy is clear. The spatial interaction
matrix containing the values Tij is a two-dimensional contingency table. The expected
cell count (xij) or number of interactions between two zones (Tij ) is the product of
origin and destination (main) effects and an interaction effect which in classical gravity
models is a function of distance. The balancing factors Aiand Bjplay the same role
as the parameter constraints (e.g. deviation or indicator contrasts) and determine the
interpretation of the model’s parameters. We can see from Equation [4.13] that the
interaction in the model is the same as the interaction in the distribution function.
Therefore it also holds that:
Tij ·Tkl
Til ·Tkj
=Fij ·Fkl
Fil ·Fkj
[4.14]
which is to say the flows predicted by the model will have the same odds ratios as the
distribution matrix (historical, distance, etc.). In the classical log-linear formulation
this is equivalent to saying that the values of τAB
ij for both the Tij matrix and the Fij
matrix are the same. Thus we can see the importance of the cost/distance distribution
function in determining the accuracy of the gravity model prediction which is also the
reason for its general poor performance.
Conceptualising the gravity models (with their entropy maximising justification)
as a special case of log-linear models clarifies the role of individual parameters as
well as linking these concepts with IPF. The whole evolution of the ‘gravity’ model
just described effectively ends up as simply a case of the classical IPF applications
introduced at the start, such as Kruithoff’s telephone traffic estimation problem, or the
Fratar and Furness methods for vehicular traffic. In these cases historical matrices were
used for the Fij distribution function, and were these not available, a simpler distance
function would have been a second best option. And finally, if none of the above are
available or applicable, maximum entropy dictates the Fij terms equal unity, thereby
producing the most probable and least biased estimate. This is equivalent to setting
19At this point the ‘gravity’ model looses any connection to the Newtonian paradigm, both functional
and theoretical.
54
the (multiplicative) log-linear interaction terms to one, or using a uniform prior.
4.6 Maximum Entropy and Log-linear models
This brings us back to the start, to two of the seminal papers in the development of
log-linear models, namely M.W. Birch’s Maximum Likelihood in Three-Way Contin-
gency Tables (1963) and I.J. Good’s Maximum Entropy for Hypothesis Formulation,
Especially for Multidimensional Contingency Tables (1963). In the first paper, Birch
showed that there is a unique solution to estimating the cell frequencies to satisfy the
marginal totals of the highest order interaction and maximise the likelihood. This
was a crucial input to the work of other statisticians perfecting log-linear model meth-
ods, by providing the underpinnings of the general statistical theory involved in the
methodology (Fienberg, 1992, p. 453). Birch’s results mean that in estimating the
maximum likelihood cell frequencies, there is no need to calculate the actual log-linear
coefficients as a sort of intermediate step, but that these can be calculated directly
from the marginal totals — the sufficient configurations of the table. Then, with these
marginal constraints in place, there exists only one solution to the cell estimates. To
give a more concrete example: in a three-way table with no second order interaction,
the sufficient configurations are the three two-way marginal tables. There is only one
set of frequencies, that adds up correctly to the marginal tables and has no second or-
der interaction. These frequencies represent the maximum likelihood solution. (Bishop
et al., 1975, p. 64-70)
Good’s paper produced some similar results yet was less influential than the Birch
paper, apparently due to its difficult notation and unusual form of proofs (Fienberg,
1992, p. 459). His proof of a unique solution only applied to a more narrow class of
log-linear models, where the highest interactions in the model are all the same level, so
was not as general as Birch’s20. But another contribution of Good’s paper is equally
important: his proof that the maximum likelihood estimates are equivalent to the
maximum entropy estimates21. What he described as a “curious duality connecting
maximum likelihood and maximum entropy” (Good, 1963, p. 927) would thus in a 3-
dimensional example mean that the maximum likelihood estimates under the hypothesis
of no three-way interaction are equal to the maximum entropy of the cell frequencies if
the 3 marginal faces (two-dimensional) are known.
Good suggests IPF be used to maximize the entropy and refers to it as the method
proposed by Brown (Brown, 1959), calling it the iterative scaling procedure. Brown’s
is another independent invention of the procedure that has not yet been mentioned,
20Nor was Birch’s completely general either, as it assumed all cell entries were positive. It was
Yvonne Bishop who showed that this restriction was not necessary — working of course on the National
Halothane Study, renowned for the sparsity of the data tables analysed (Bishop, 1969, p. 275)
21This equivalence is mistakenly attributed to Batty & Mckie (1972) by Willekens (1982, 1994).
55
and was developed explicitly in the literature on information theory. Curiously, Good
was also aware of Deming and Stephan’s work and actually references their solution,
but fails to realise it is the same as the Brown one he is advocating. This is of course
due to the fact that although we now know the Deming-Stephan algorithm finds the
maximum likelihood estimate (and hence maximizes the entropy as Good showed), they
had thought it produces the least squares solution, which seems to be the reason Good
didn’t recognise the algorithm as correct (for solving his problem, not theirs). Following
from Birch’s work, it was Yvonne Bishop who formally connected the Deming-Stephan
algorithm to solve the likelihood equations22. According to Fienberg by the time Good’s
work had been recognized as important, “many of the foundations of the log-linear
model literature had already been set” (ibid. p. 459). His work was picked up by
the relatively separate strand of literature dealing with information theory (e.g. Ireland
& Kullback, 1968a). These developments were also quite separate from Wilson’s use
of entropy maximizing in spatial interaction models described in the previous section
which were only explicitly linked to log-linear models in the 1980s (Willekens, 1980,
1983). And even so, with the exception of what we could call the Dutch school (in
addition to Willekens see also work by Brouwer, Nijkamp and Sholten (1988)) the
links between IPF, log-linear modelling, generalized linear models, spatial interaction
modelling, maximum likelihood and maximum entropy that have been comprehensively
described here, have not generally been recognized in the relevant literature.
22This was apparently done in her PhD dissertation, which is unfortunately unpublished, but the
same work appears in Bishop (1969) and Bishop et al. (1975).
56
Chapter 5
IPF for Maximum Entropy or
Minimum Discrimination
Information?
In the last two sections of the previous chapter the concept of entropy from Wilson’s
spatial interaction models and Good’s maximization of entropy in contingency tables
were discussed in the context of log-linear models, yet entropy deserves further elabora-
tion in its own right. The purpose of this chapter is to clarify the concepts of entropy,
information and uncertainty that have already been touched upon in the preceding
text. With regard to IPF this clarification represents the final level in our conceptu-
alization and understanding of the technique and its implications . In addition to the
formal specification of the IPF solution as a log-linear model, the entropy framework
leads to a more precise understanding of the nature of the estimates IPF produces.
This understanding consequently allows us to asses the quality of IPF estimates and
investigate their limitations.
This chapter can by no means attempt a comprehensive review of the entropy liter-
ature. Rather we deal with some of the confusions arising from what have been called
different schools of entropy: both above mentioned contributions by Wilson and Good
can be said to employ the statistical, quantitative or information theoretical interpreta-
tion of entropy, as opposed to the entropy from statistical mechanics or thermodynam-
ics, which has been termed the descriptive school of thought. The concept of entropy
evokes philosophical debates as to its proper meaning as well as at times vitriolic at-
tacks about its inappropriate name or use. In this section we will try to sidestep the
philosophical issues as much as possible and still present the most important aspects
of entropy in an accessible manner.
It has already been mentioned that entropy is a concept originating in thermo-
dynamics or statistical mechanics and was developed by Ludwig Boltzmann (using S
57
Table 5.1: Entropy of Possible Groupings of Six People
Macrostate Number of Microstates Entropy
6 0 6!
0!·6! = 1 H=−6
6·log26
6+0
6·log20
6= 0
0 6
5 1 6!
1!·5! = 6 H=−5
6·log25
6+1
6·log21
6= 0.65
1 5
4 2 6!
2!·4! = 15 H=−4
6·log24
6+2
6·log22
6= 0.92
2 4
3 3 6!
3!·3! = 20 H=−3
6·log23
6+3
6·log23
6= 1
to denote it1). When Claude E. Shannon developed a measure of uncertainty in the
context of information theory (denoting it with H) he “recognized” its form as that of
entropy as defined in statistical mechanics and decided to use the same name (Shannon
& Weaver, 1949, pp.50-1). This “unfortunate terminology” has become well established
despite objections, and we will follow Jaynes in distinguishing between experimental
entropy to refer to the Sconcept describing the property of a state of a system and in-
formation entropy to refer to the latter Hentropy of a probability distribution (Jaynes,
2003, p. 351).
As we shall see, both Sand Hcan be written in a mathematically equivalent
way. However their derivations differ as do the philosophical implications of their
different conceptualizations. Although information entropy is typically seen as the
more general of the two, experimental entropy is particularly attractive pedagogically
and often remains the only level of explanation2. The Sentropy measure is based on
the idea describing a system’s macroproperties without having to know the underlying
microstates. This is particularly useful in the types of systems studied by physics, where
numbers of particles who’s microstates contribute to the system’s state are of the order
of 1023. We can however give a more limited example that allows the configurations of
microstates to be counted exhaustively.
If we have six people divided into two groups, we can count the number of possible
configurations and the associated microstates (first two columns of Table 5.1). The
macrostate where all six are in one category and the other group is empty can only
1Also sometimes called the Boltzmann-Planck entropy formula, it seems to have first been explicitly
formulated by Max Planck in (1901, p.556), citing Boltzmann (1877) as the originator of the theory.
2See for example Peter Gould’s in depth review of Wilson’s gravity models (Gould, 1972), which is
in fact a detailed and very helpful pedagogical review of the entropy concept, that however discusses
entropy purely from the experimental entropy perspective, despite the fact that Wilson explicitly sides
with the information approach.
58
occur in one way. We can further count the possible ways the six people can be divided
so that there is a single person alone in one of the groups and five in the other. This can
happen in six ways - six different microstates all conform to this situation. Four people
in one group and two in the other can occur in 15 possible ways and finally three people
in each group can in 20 different ways, the most of all of the possibilities considered.
The maximum entropy principle states that barring any additional information the
most likely state of the system is the one with the most microstates associated with it
— in our case this is the state with three people in each group.
Our example only has n= 2 possible groups so the numbers of microstates in Table
5.1 are calculated using the binomial coefficient. Where x1+x2=Nwe therefore have:
W=N!
x1!·x2![5.1]
Generalized for npossibilities the number of possible microstates equals the multinomial
coefficient:
W=N!
Qn
i=1 xi![5.2]
However maximizing Wfor systems with many particles becomes computationally de-
manding, therefore the logarithm of W can be maximized instead since the logarithm
function is monotonically increasing meaning that both Wand log Whave the same
maximum. The experimental entropy can thus be stated as:
S=k·log W[5.3]
where k is simply a constant that naturally does not affect the maximum of Sbut
merely determines the unit of the measure. Maximizing Srelies on what is known as
Stirling’s approximation3and can be expressed as:
S=−k·
n
X
i=1
pilog pi[5.4]
3Stirling’s approximation allows the logarithm of the factorial to be stated as log xi! = xilog xi−xi
as nincreases. The full derivation is as follows:
S/k = log W
= log N!
Qxi!
= log N!−log Yxi! Stirling’s approx.: log xi! = xilog xi−xi
≈N·log N−N−X(xi·log xi−xi) cancelling out N=Xxi
=N·log N−Xxi·log xiusing xi=N·pi
=N·log N−XN·pi·log (N·pi)
=N·log N−NXpi·log N−NXpi·log picancelling out N·log N=NXpi·log N
S/k =−NXpi·log pi
59
This expression of experimental entropy relies on the Stirling approximation. The
expression for information entropy returns the same result through a different deriva-
tion. Shannon (1949) starts out the other way round by first setting out a set of
properties that would be desirable in a measure of uncertainty . The three conditions
are that the measure H(p1, p2,...,pn) (i) should be a continuous function of pi, so that
a slight change in one of the probabilities cannot result in a dramatic change in the
value of H(pi); (ii) if all the piare equal then H should be a monotonically increasing
function of n, which is equivalent to saying H(pi) should measure more uncertainty if
there are more choices to chose from; and (iii) the measure should be decomposable
in the sense that it is additive for independent groupings of probabilities4. Given only
these three conditions Shannon showed there was nevertheless only one function that
satisfied them (Shannon & Weaver, 1949, pp.116-8):
H=−k
n
X
i=1
pilog pi[5.5]
Recognizing this measure of uncertainty in the context of information theory to
be of the same form as Equation [5.4], this analogy led Shannon to call his measure
entropy as well (Shannon & Weaver, 1949, pp. 50-1). Neither statistical mechanics nor
information theory have primacy over entropy however. This was shown by Edwin T.
Jaynes in 1957 when he showed that the entropy of Equation [5.5] was in fact a much
more primitive or basic concept: it is the only information measure that does not lead
to inconsistencies: “in the problem of prediction, the maximisation of entropy is not
an application of a law of physics, but merely a method of reasoning which ensures
that no unconscious arbitrary assumptions have been introduced” (1957, p. 630). In
short, the principle of maximum entropy allows us to find a solution that agrees with
all of the information that we know and expresses maximum uncertainty with regard
to what we don’t know.
Returning to our example we can see in the last column of Table 5.1 the values of
the entropy measure Hcalculated for each of the possible distributions. For the first
option, the value of H is zero 5, which makes sense: there is no uncertainty about the
distribution, no additional information can shed any more light on it. The remaining
options have increasing values of entropy, with the largest for the last option of three
people in each group. Following Shannon’s information theory framework this calcula-
tion uses the binary logarithm, which results in the measurement units being bits. In
practice, for maximization purposes, the choice of base is irrelevant.
4This composition law is illustrated by Shannon using the entropy of a simple distribution:
H(1
2,1
3,1
6). By grouping p2and p3together we can split the choice into two steps: in the first step the
choice is between p1and p2+p3=1
2and in case the second option is selected there is another choice
between p2|(p2+p3) = 2
3and p3|(p2+p3) = 1
3. Either way both procedures involve the same amount of
uncertainty, therefore we require that H(1
2,1
3,1
6) = H(1
2,1
2)+ 1
2H(2
3,1
3) . For a more general statement
of this conditions see Jaynes (1957).
5In limit, when x approaches 0, x·log xdoes as well: limx→0(x·log x) = 0.
60
We have seen therefore that maximizing Wor Sis formally equivalent to max-
imizing H. Mathematically the equations are identical despite the fact that one is
derived using Stirling’s approximation and the other determined directly by satisfying
certain conditions required from it. The more consequential distinction lies in the two
equations corresponding to different views of probability. As the entropy of statistical
mechanics (or experimental entropy) Sis associated with the objective view of proba-
bility, which according to Jaynes “regards the probability of an event as an objective
property [always capable] of empirical measurement”, while Hcan be said to belong
to the subjective school, which sees “probabilities as expressions of human ignorance”
(Jaynes, 1957, p.622). Be it because of philosophical reasons or due to it being math-
ematically ‘cleaner’, the latter is seen by many authors as more flexible and useful
(Wilson, 1970; Pooler, 1983; Webber, 1977).
Maximum entropy calculation thus involves maximizing Equation [5.5] while taking
into account any other known information about the distribution. This information is
included in the calculations in the form of constraints. If no prior information is known
about the probability distribution, there is still one constraint:
n
X
i=1
pi= 1 [5.6]
that is, we know all the probabilities in the maximum entropy solution must sum up
to one.
Thus entropy as a fundamental principle of reasoning expresses the amount of uncer-
tainty we have about a particular problem. In this way the maximum entropy solution
is not the one that is most likely to be true, it is the one that we may consider most
likely given our limited knowledge. Or in Jaynes’ words, it determines what is “most
strongly indicated by our present information” (Jaynes, 2003, p.370). To return to our
example, the solution p1=p2=3
6maximizes the entropy function, as one might guess
intuitively. That is to say, without any other information, Hachieves its maximum
value if all events have equal probability - the distribution is uniform. Of course that
does not mean that three people in each group is the most likely distribution, for there
might be some other constraints operating that we are not aware of. The six people
might, for example, be bridge players, in which case 3+3 grouping makes little sense
and will never occur, but a 4+2 grouping will.
The application of maximum entropy can serve two purposes: (i) estimating the
probability distribution based on known information, or (ii) if we can compare our es-
timates with actual observations it can then be used for identifying which information
is truly relevant to determining the probability distribution. Again returning to our
example of six people in two groups: (i) estimating their distribution given only this
information will lead us to predict a uniform distribution but (ii) comparing this dis-
tribution with further observations if this is possible, we might find they never agree;
61
the six will never group in the way we predicted. In this case we can deduce that our
predictions are wrong because we have insufficient constraints — we have therefore es-
tablished that an important constraint is missing in our model, and we need to gather
additional information — the fact that groups are formed to play bridge and must be
multiples of four6.
To return again to our log-linear models we can see that the configurations included
in the model (i.e. their respective coefficients) operate in fact as the constraints of the
entropy maximization. Thus, of all the possible configurations that satisfy the con-
straints — sum up correctly to the known margins — we chose the one that maximizes
the entropy of the probability distribution. This same solution is of course arrived to
by IPF. To give a simple example, in a 2x2 table with both marginal totals known ,
we need to maximize the entropy while taking into account the constraint from [5.6] as
well as the following two marginal constraints:
p11 +p12 =p1+ ( because of [5.6] p21 +p22 =p2+ = 1 −p1+ )
p11 +p21 =p+1 ( because of [5.6] p12 +p22 =p+2 = 1 −p+1 ) [5.7]
The analytical solution can be arrived at using Lagrangian multipliers, which we will
not derive here. We know of course that for this simple example the solution is the
model of independence where pij =pi+×p+j, however this is not generally the case.
The solution will always be found using IPF however. This is the only solution that
satisfies the constraints and maximizes the entropy. It uses all the known information
from the marginal totals, but does not assume anything more — for all we know a
priori all microstates are equally likely.
In this example we used an uninformative prior — a prior distribution that does
not carry any information. However if in addition to the constraints, we also have some
estimate of what the distribution is, this information should be incorporated into our
estimate. This is done by choosing a distribution that satisfies the known constraints,
but is as similar as possible to the prior distribution whereby similar is understood to
mean “has the minimum of extra information”. This principle is a generalization of the
entropy maximization principle, and unfortunately has a similar terminological vari-
ability to IPF: Good called it minimum cross entropy and the principle of minimum
discriminability (1963), Kullback used the term directed divergence (1959) and later
minimum discrimination information (Ireland & Kullback, 1968b) and in his honour
it is often refered to as either the Kullback Information (Kotz & Johnson, 1983) or
6This might seem a slightly contrived example, yet it is in complete analogy with an example
Jaynes describes regarding one of the most important applications of the entropy maximizing principle
in statistical mechanics by Gibbs: his entropy maximizing results failed to correctly predict certain
thermodynamic properties, so his conclusion was that “the laws of physics must involve some additional
constraint not contained in the laws of classical mechanics” and sure enough, decades after Gibbs’ death,
this constraint was found when it was discovered that energy values are discrete, which is the foundation
of quantum theory. (Jaynes, 2003, p.371)
62
the Kullback-Leibler distance (Hastie, 1987) as well as simply minimum information
(Snickars & Weibull, 1977), information gain (Renyi, 1970) or relative entropy (Müh-
lenbein & Höns, 2005). We will use the term minimum discrimination information
(MDI) for its descriptiveness as well as its being Kullback’s preferred term (1987). The
measure can be stated as:
I=
n
X
i=1
pilog pi
qi
[5.8]
where qirepresents the prior distribution. It is easy to see how [5.8] is a generalization
of entropy (Equation [5.5]) if we consider that entropy is maximized using a uniform
prior, which means all qiare the same:
for qi=1
n, I =
n
X
i=1
pilog pi
qi
=
n
X
i=1
pilog (pi·n) =
n
X
i=1
pilog pi+ log n[5.9]
Because log nis constant it does not affect the minimization of the function, there-
fore we are left with the familiar form from Equation [5.5], except with the opposite
sign. So when we have an uninformative prior, the maximization of entropy will lead to
the same distribution as the minimization of the discrimination information measure.
However if we have an informative prior we can include it as qiand, subject to other
constraints, minimize Ias a way of incorporating this prior information in the least
biased way.
These same results have also been derived in a more geographical context by Batty
(1974; 1976). In his spatial entropy the prior distribution is denoted by aito refer
explicitly to area size. This definition is less general than the MDI principle as defined
here, since area sizes are only one of the possible prior information one might wish to
include in estimating a distribution.
So we can see that discrimination information minimizing is a general case of entropy
maximizing: while the latter assumes an uninformative prior, the former cam also
incorporate an informative prior in addition to other constraints. This might seem like
an excessively complicated approach to the sort of estimation we found so intuitively
using classical IPF. It is however an important point to make that regardless of what—
if any—prior information is used in the form of cell entries, the underlying logic is the
same: our estimates satisfy the known constraints while introducing a minimum amount
of information — or a maximum amount of uncertainty — making them unbiased with
regard to any prior information we might have.
The entropy framework allows us to understand the nature of the estimates of the
unknown cell values of a table: the estimates can only be as good as the constraints
we impose as our prior information. In the final chapter in Part I we take a closer look
at the issues and limitations that can be drawn from this. First however, we look at
63
a simple 3-d example to bring together the different theoretical frameworks mentioned
in the previous sections in a practical application.
64
Chapter 6
Estimating Cells in Three
Dimensions
Until now all the examples shown have used two-dimensional data sets. Whilst two-
dimensional tables are convenient for illustration purposes, their estimation can appear
too intuitively simple or straightforward to justify the whole theoretical framework
outlined over the preceding sections. In this section, therefore, the focus switches to the
estimation of cells in a three-dimensional table. An example is presented drawing upon
census data on gender and long term limiting illness in three UK countries. The dataset
is introduced using a novel three-dimensional graphical visualization to allow a clearer
understanding of the interactions involved. It is then shown how IPF can answer three
kinds of estimation problems which might more conventionally be described as log-linear
modelling, maximum entropy and minimum discrimination information. The worked
example establishes the statistical equivalence of IPF to the other techniques reviewed
and clearly demonstrates how the general theoretical framework extends naturally to
higher dimensions. Finally the example is used to illustrate how lower and higher-order
odds ratios are handled by IPF as an indication of future investigations of estimate
quality.
Three possible scenarios are described of how IPF might be used to solve questions
that might traditionally fall under the headings of log-linear modelling, maximum en-
tropy and minimum discrimination information.
6.1 Original Data
The illustration of estimating cell values is based on the following dataset taken from
the 2001 UK Census. Table 6.1 gives the crosstabulation of gender and limiting long-
term illness (LLTI) for the three countries of the UK. Using mnemonic notation we
have three variables denoted as follows:
• Country Ciwith 3 possible values: i={e, w, s}referring to England, Wales and
Scotland respectively;
65
Table 6.1: LLTI by gender in countries of GB
[CLG]
[LG]
England Wales Scotland Total
Male LLTI 3,907,050 307,605 465,907 4,680,562
No LLTI 19,603,209 1,079,400 1,966,587 22,649,196
Female LLTI 4,462,124 342,463 561,965 5,366,552
No LLTI 20,275,767 1,130,021 2,067,552 23,473,340
[CL]LLTI 8,369,174 650,068 1,027,872 10,047,114 [L]
No LLTI 29,878,976 2,209,421 4,034,139 46,122,536
[CG]Male 23,510,259 1,387,005 2,432,494 27,329,758 [G]
Female 24,737,891 1,472,484 2,629,517 28,839,892
[C] Total 48,248,150 2,859,489 5,062,011 56,169,650 N
• Limiting long-term illness Ljwith possible 2 values: j={y, n}referring to Yes
and No and
• Gender Gkwith 2 possible values: k={f, m}denoting Females and Males
respectively;
Following previously established notation [CLG] refers to the complete table of
values xijk, where for example xwym refers to the number of Welsh males with LLTI.
Lower level configurations are referred to in a similar way e.g. [CG] denotes the two-
dimensional marginal configuration crosstabulating country and gender — that is the
table of xi+kwhere the entry xw+mrefers to Welsh men, both with and without LLTI.
The full [CLG] as well as the three two dimensional margins, the three one-dimensional
margins and the grand total N=x+++ are given in Table 6.1.
In order to give a better overview of the data and the ensuing analysis, the table
is also presented graphically, using mosaic cubes. Three-dimensional mosaic cubes are
a novel way of representing three-dimensional data1. The volumes of the cubes —
or cuboids rather — are proportional to the number of people in the corresponding
categories. It should be noted that similarly to two-dimensional mosaic plots, where
there are always two ways to represent the plot, depending on which variable is used
first to split the data, there are in fact six possible ways of constructing a cube. Figure
1To the author’s knowledge there is no computer programme available for this type of visualization,
so the figures here were produced manually using coordinate data from the two-dimensional mosaic
plots produced by the vcd package for R (Meyer et al., 2010).
66
England Wales Scotland
Figure 6.1: 3D visualization of gender by LLTI for the three countries of the UK
6.1 shows the three country layers — the data is split first by the country variable,
then according to the LLTI variable finally by gender. Figure 6.2 shows the complete
cube stacking together the three layers, along with the corresponding 2D mosaic plots
representing the two-dimensional margins.
It is important to note that the marginal plots do not correspond to the sides of
the cube. They represent the summing over the third category, which is not equivalent
to the sides of the cube, the latter representing only the crosstabulation in the first
layer, row or column. It is furthermore now possible to see that the independence of
two variables — which would show up as aligned mosaic tiles in the two-dimensional
view — does not mean that the variables are independent within the third category as
well2. Using this type of data presentation it also becomes quite graphically clear how
the size (weight) of a particular category can affect the marginal total.
6.2 Defining the Problem
There are three general research scenarios that may be explored using this data set.
• In the first scenario the complete crosstabulation is known. In this case the
researcher might wish to express the data using a more parsimonious model.
This can be done by removing the relevant higher terms from the saturated log-
linear model, calculating the expected cell values and comparing them to the
original table e.g. using the Pearson chi square statistic. Then if the difference
between the observed and expected cell counts is acceptable, one can conclude
the removed term is not significant and can be ignored.
• In the second scenario the complete crosstabulation is not known. One might for
2In fact, as we shall see in Section 9.3, the so called Simpson’s paradox refers to the even more
extreme case where the 2D margin exhibits the opposite relationship to the inside layers it is summed
up from.
67
Scotland
Wales
England
Female Male
No LLTI
LLTI
Figure 6.2: Mosaic cube of LLTI by gender in countries of GB
example only have data from [CL],[CG] and [LG] — the three two-dimensional
tables, but not the values of the xijk cells. In this case one could estimate the
values of the cells based only on the available information. Of course in these
circumstances there is no way of assessing how correct the estimated values are,
it is only possible to say that they are the least biased estimates given our current
knowledge.
• In the third scenario the complete crosstabulation is also not known, however
in addition to the two-dimensional margins as before, an additional sample may
be available: x′
ijk where x′
+++ ≪x+++. In this case one might wish to gain a
better estimate of the xijk cells than in the previous scenario, by including the
information from the sample as well.
All the scenarios involve using IPF to generate the cell estimates; and all can also
be stated as unsaturated log-linear models or as problems of minimizing the discrim-
ination information. In general only a limited number of configurations exist where
IPF is not necessary and the cell estimates can be calculated directly. This is of course
the case in a two-dimensional table with no interaction between the variables where
the estimates are claculated from the marginal totals. When such a solution exists, its
form is predictable and familiar from the classic two-dimensional independence test:
the numerator multiplies the sufficient configurations 3and the denominator has con-
3The sufficient statistics can be determined using the following steps: (i) select the marginal totals
corresponding to the highest level interactions in the model; (ii) select the marginal totals corresponding
68
Table 6.2: Multiplicative coefficients, fully saturated model (deviation contrast)
τCLG
ijk
τLG
jk
England Wales Scotland Total
Male LLTI 1.0000 1.0095 0.9905 0.9755
No LLTI 0.9999 0.9906 1.0096 1.0252
Female LLTI 0.9999 0.9906 1.0096 1.0252
No LLTI 1.0000 1.0095 0.9905 0.9755
τCL
ij
LLTI 0.9153 1.0843 1.0076 0.5000 τL
j
No LLTI 1.0925 0.9223 0.9925 2.0001
τCG
ik
Male 1.0048 1.0082 0.9871 0.9546 τG
k
Female 0.9952 0.9919 1.0130 1.0475
τC
iTotal 5.1536 0.3382 0.5738 1,770,363 τ
figurations caused by overlapping of these. Such closed form solutions are however an
exception and in general the cell values must be estimated using IPF. Bishop et al.
(1975, p.76-83) in fact give an extensive description of known rules for determining
whether an analytical solution exists for a particular model including a list of all possi-
ble configurations of four-dimensional models and how they should be solved. However
these rules can be considered obsolete today given the relative ease of finding the so-
lution using IPF, which is probably equivalent if not faster than calculating a large
system of equations. In fact, if direct estimates exist, IPF will converge to it in the
first cycle (ibid. p. 83).
The complete [CLG] table can be expressed as a log-linear model in its saturated
form:
xijk =τ·τC
i·τL
j·τG
k·τCL
ij ·τCG
ik ·τLG
jk ·τC LG
ijk [6.1]
and the log-linear parameters for this model — fully saturated, multiplicative, using
deviation contrast type constraints — are given in Table 6.2.
6.3 Estimating Cells with no Second Order Interaction
With a log-linear specification of a table, it is possible to remove higher order interac-
tions by setting their coefficients to one (or zero depending on the form of the model),
to the second highest interactions; (iii) delete the redundant configurations e.g. if xij++mis already
selected then xij+++ is redundant, because it is simply Pm(xij ++m);(iv) repeat steps (i) to (iii) through
all levels of the model.
69
while retaining lower ones, thus setting up an unsaturated model. The estimates can
be calculated by solving the resulting system of linear equations. While it is possible to
solve them analytically for some exceptional cases, for more complex models the equa-
tions are generally not closed form, meaning the solution can only be found numerically
— using IPF.
Scenarios one and two are both based on the same unsaturated log-linear model
with the τCLG
ijk term set to one. This is written as:
mijk =τ·τC
i·τL
j·τG
k·τCL
ij ·τLG
jk ·τC G
ik [6.2]
We can see from the original τterms of the saturated model in Table 6.2 that the
τCLG
ijk are all extremely close to one, so under scenario one this indicates removing this
term will not create significant bias. Of course under scenario two we do not know the
actual τterms of the saturated model, so we can not know how much error we are
introducing by removing them.
Table 6.3 shows the cell estimates for this model as calculated using IPF. Compared
to Table 6.1 we can see that all the marginal tables are identical, however the xijk values
differ. The odds ratios are a convenient alternative way of describing the relationships
in the estimated table. We first define the odds ratio θ(i1i2)(j1j2)in a three-dimensional
setting:
θAB|C
(i1i2)(j1j2)|k=xi1j1k·xi2j2k
xi2j1k·xi1j2k
[6.3]
which is simply an expression of the odds ratio (see Equation [3.4] on page 19) for
categories i1and i2by j1and j2at level kof variable C. This is the conditional odds
ratio of variables Aand Bgiven that C=k. Thus for example θLG|C
(yn)(mf)|wis the
odds ratio of LLTI and gender for Wales (C is the conditioning variable and its value
in the subscript is “w”). In this case both the gender and LLTI variables have only
two categories, so we can drop their respective subscripts and instead write simply:
θLG|C=w. We can calculate this conditional odds ratio for Wales using the estimated
cell values in Table 6.3:
θLG|C=w=xymw ·xnf w
xyf w ·xnmw
=302,710.4·1,125,126
347,357.6·1,084,295 = 0.9043
Which can be interpreted as follows: in Wales, the odds of being male are 0.9043
times smaller for someone with LLTI than for someone without LLTI. In fact in the
estimated table the odds ratios are the same in England and in Scotland as well - which
is equivalent to saying there is no second order interaction. Using the odds ratios we
can see the estimates, which were calculated by setting the second-order interaction to
be non-existent, are in fact showing the relationship between gender and LLTI does
not vary by country. It is also important to note that while the conditional odds are
the same across countries, they do not equal the marginal odds. In this case the overall
70
odds are:
θLG =xym+·xnf +
xyf +·xnm+
=4,680,562 ·23,473,340
5,366,552 ·22,649,196 = 0.9039
The difference is only slight in this example because the original second order interaction
was almost non-existent (the τC LG
ijk in Table 6.2 are very close to one), yet we note this
difference here to point out that the relationship between the variables is different at
different levels of aggregation, a topic that we will return to in Chapter 9.3,
We can further evaluate the estimates by comparing the cell values with the orig-
inally observed ones. The extent of this difference, or rather the goodness-of-fit, can
be tested using one of the classical methods such as the Pearson statistic, which has
a value of 470.41 in this case, with two degrees of freedom4. Another useful measure
is the Total Absolute Error which, as the name implies, is calculated as the sum of all
cell differences:
T AE =X
ijk |xij k −ˆxijk|[6.4]
In this case the value of the TAE is 59,882 and therefore TAE/2=29,941 is the number
of persons that have been misclassified. This means that assuming the second order
interaction does not exist results in misclassifying almost 30,000 persons who, out of a
total population of 56,169,650 represent 0.05 %. Whether or not this is an acceptable
level of error is of course a question to be answered under scenario one. Under scenario
two none of these diagnostics can be calculated as they rely on knowing the full table.
In these circumstances the merits of the estimates can not be judged by how closely
they represent the real data, but how well they utilize the available data and what, if
anything, they assume about the unknown configuration. As we saw in the previous
section, this is the best estimate that can be given in this case, incorporating all the
known information and not assuming anything more about the second order interaction.
6.4 Estimating Cells with a Borrowed Second Order In-
teraction
The third scenario is the case of estimating the full [CLG] table from three two way
margins: [CL],[CG] and [LG] by borrowing strength from a sample from the complete
crosstabulation, denoted here as [C LG]′with cell values x′
ijk . Just as we expressed
the saturated model for the [CLG] table in Equation [6.1], we can do the same for the
sample table [CLG]′:
x′
ijk =τ′·τ′C
i·τ′L
j·τ′G
k·τ′CL
ij ·τ′CG
ik ·τ′LG
jk ·τ′C LG
ijk [6.5]
where the prime symbol indicates the coefficients refer to the sample and not the
population. The coefficients in Equation [6.5] are all known — they can be calculated
4These and other measures will be given in depth consideration in Section 8.4.
71
Table 6.3: Estimated full table with no CLG interaction
[CLG]
[LG]
England Wales Scotland Total
Male LLTI 3,904,459.3 302,710.4 473,392.3 4,680,562
No LLTI 19,605,800 1,084,295 1,959,102 22,649,196
Female LLTI 4,464,714.7 347,357.6 554,479.7 5,366,552
No LLTI 20,273,176 1,125,126 2,075,037 23,473,340
[CL]LLTI 8,369,174 650,068 1,027,872 10,047,114 [L]
No LLTI 29,878,976 2,209,421 4,034,139 46,122,536
[CG]Male 23,510,259 1,387,005 2,432,494 27,329,758 [G]
Female 24,737,891 1,472,484 2,629,517 28,839,892
[C] Total 48,248,150 2,859,489 5,062,011 56,169,650 N
from the sample directly. For convenience we repeat the equation of the population
log-linear model here:
ˆxijk =τ·τC
i·τL
j·τG
k·τCL
ij ·τLG
jk ·τC G
ik ·τCLG
ik [[6.1]]
Because we don’t know the interaction coefficient for the whole population, we make
the assumption that it is the same as the interaction in the sample: τC LG
ijk =τ′C LG
ijk .
Inserting the sample interaction coefficients from Equation [6.5] into [6.1] we get:
ˆxijk =τ
τ′·τC
i
τ′C
i·τL
j
τ′L
j·τG
k
τ′G
k·τCL
ij
τ′CL
ij ·τLG
jk
τ′LG
jk ·τC G
ik
τ′CG
ik ·x′
ijk [6.6]
This equation is of course subject to the usual constraints associated with deviation
type multiplicative coefficients, as well as to the constraints of the known marginals:
ˆx+jk =x+jk ,ˆxi+k=xi+kand ˆxij +=xij +[6.7]
As before, finding the estimated cell values for this model involves solving this
system of equations which can be done using Lagrange multipliers or, more conveniently,
using IPF. For the purposes of this illustration we take a simple random sample of size
x′
+++ = 1000 from the total population x+++ = 56,169,650. The sample cells (x′
ijk )and
the cells estimated using IPF (ˆxijk) are presented in Table 6.4. The margins are not
given explicitly, as it is clear from Equations [6.7] that all the margins for the estimated
table are identical to the ones presented in the original census table (Table 6.1).
72
Table 6.4: Random sample of 1000 people and estimated full table using the sample’s
CLG interaction
[CLG]′d
[CLG]
E W S England Wales Scotland
Male LLTI 60 7 9 3,829,903.6 415,948.6 434,709.7
No LLTI 334 16 34 19,680,355.4 971,056.4 1,997,784.3
Female LLTI 71 4 15 4,539,270.4 234,119.4 593,162.3
No LLTI 380 23 47 20,198,621.0 1,238,365.0 2,036,355.0
As in the previous example we can again assess the quality of this estimate by com-
paring it to the known full table. Pearson’s Chi squared equals 108,233 with 2 degrees
of freedom, while the TAE is 866,749 meaning that 0.78 % of the total population were
misclassified. While this might seem like a relatively small number, the performance of
this estimation was much worse than the previous example. To see why this is so, we
need to have a closer look at the log-linear coefficients.
Instead of presenting the log-linear coefficients of the sample and resulting estimates
in tabular form, they are presented graphically, in order to allow direct comparisons.
Figure 6.3 plots three sets of coefficients: (i) terms from the saturated model of the
original data (also given in Table 6.2) are depicted using asterisks; (ii) terms from the
sample are shown using small circles and (iii) terms from the estimated table are are
shown using large circles. The horizontal axis indicates which group the terms fall into.
For our purposes here, the most important are the second order interaction terms
on the right hand side (τCLG). These are the ones that were borrowed from the sample.
This is clear in the graph as the sample coefficients (small circles) align perfectly with
the estimated coefficients (large circles). They are not however aligned with the true
coefficients, indicated by the asterisk symbols. The actual second order coefficients are
all extremely close to one (see Table 6.2) which is indicated on the graph by the dotted
horizontal line. We can see therefore, that our sample poorly reflected the second order
interaction of the data and this interaction was of course included in the estimated
table.
In order to give a better understanding of this second-order interaction we may
again take a closer look at the second order coefficients and the associated odds ratios5.
We have seen this in the two dimensional case: borrowing the first order interaction
terms preserves the odds ratios. In three dimensions then, borrowing the second order
interaction terms preserves the ratios of odds ratios. In order to illustrate this concept
and its relationship to the borrowed log-linear coefficients, we can use Equation [6.3]
5The interpretation of log-linear parameters in terms of odds and odds ratios for the 2x2 example
was described in section 4.4.
73
replacemen
|{z}
τC|{z}
τL|{z}
τG|{z }
τCL |{z }
τCG
|{z }
τCG | {z }
τCLG
0
1
2
5Correct coefficients
Sample coefficients
Estimated coefficients
Figure 6.3: Comparison of log-linear coefficients from the original data, the sample
data and the newly estimated data
to first calculate the conditional odds ratio for Wales from the sample data using the
data given in Table 6.4:
θ′LG|C=w=x′
ymw ·x′
nfw
x′
yf w ·x′
nmw
=7·23
4·16 = 2.52
Which can be interpreted step by step as follows:
• in the sample, in Wales, the odds of someone with LLTI being male are 7 : 4
• in the sample, in Wales, the odds of someone without LLTI being male are 16 : 23
⇒in the sample, in Wales, the odds of being male are 2.52 times greater for someone
with LLTI than for someone without LLTI.
We are now in a position to define the second order odds ratio or ratio of odds ratios:
θAB|C
(i1i2)(j1j2)|k1k2=θAB|C
(i1i2)(j1j2)|k1
θAB|C
(i1i2)(j1j2)|k2
[6.8]
This expression can also be referred to as the second-order odds ratio (Rudas, 1998)
and represents the ratio of two conditional odds ratios. If we insert Equation [6.3] and
then express the xijk terms using the log-linear formulations (Equation [6.1]) all the
lower order terms cancel each other out and we are left with:
θAB|C
(i1i2)(j1j2)|k1k2=τi1j1k1·τi2j2k1·τi1j2k2·τi2j1k2
τi1j2k1·τi2j1k1·τi1j1k2·τi2j2k2
[6.9]
74
Table 6.5: Comparison of odds ratios of LLTI and Gender
Original Sample Estimate
(xijk ) (x′
ijk ) (ˆxijk)
Marginal Odds Ratio θLG 0.90 0.99 0.90
Conditional Odds Ratios
θLG|C=e0.91 0.96 0.87
θLG|C=w0.94 2.52 2.27
θLG|C=s0.87 0.83 0.75
Ratios of Odds Ratios
θLG|C
(ew)0.96 0.38 0.38
θLG|C
(es)1.04 1.16 1.16
θLG|C
(ws)1.08 3.03 3.03
Thus the ratio of the odds ratios is directly related to the second order log-linear coeffi-
cients and can equally be calculated from them as well as from the cell frequencies. To
return to our example, the ratio of the conditional odds ratios for Wales and Scotland,
calculated from the sample cell frequencies is:
θ′LG|C
(ws)=θLG|C=w
θLG|C=s=x′
ymw ·x′
nfw
x′
yf w ·x′
nmw
/x′
yms ·x′
nfs
x′
yf s ·x′
nms
= 2.52/0.83 = 3.03
This ratio of the two odds ratios is interpreted as the extent to which the country
variable influences the interaction between the gender and LLTI variables. In this case
the effect is three times stronger in Wales than in Scotland. If on the other hand the
ratio of the odds ratios had equalled one, this would indicate that the country has no
effect on the relationship between gender and LLTI.
We can now explore this second-order interaction in more detail by comparing the
values between the original data, the sample and the estimates. Table 6.5 presents
the odds ratios of LLTI and gender: first the unconditional or marginal odds ratios,
then the three conditional odds ratios for each of the three countries and finally three
second-order odds ratios for each pair of countries6. The bold numbers highlight the
fact that the estimate has the same marginal odds ratio (first row) as the original data
– which is the direct consequence of keeping the same marginal totals – and has the
same ratios of odds ratios as the sample – a consequence of using it as a prior for the
second order interaction.
6For the purposes of this illustration the odds ratios are calculated conditional on the country
variable, which makes for a meaningful interpretation. However all other combinations are equally
valid and have been omitted here only for clarity. For simplicity we also do not show all the derivative
(ratios of) odds ratios: there is an inverse version of each of the ratios mentioned and each has its
appropriate interpretation.
75
Comparing the ratios of odds ratios in Table 6.5 it becomes clear why borrowing the
sample’s second order interaction resulted in a much poorer fit than using a uniform
prior as in the previous subsection. The original data exhibits a relatively weak second
order interaction, which is indicated by its second order coefficients being very close
to one. As we know from the derivation in Equation [6.9] such terms mean the ratio
of odds ratios is very close to one as well, which is also confirmed in Table 6.5. Thus
assuming a uniform prior, which is equivalent to setting the second-order interaction
terms to one and is consequently equivalent to setting the ratios of the odds ratios
of one, was closer to the real data than the coarser second order coefficients from the
sample.
6.5 Lessons from 3D Estimation
The main feature of this section is the practical illustration of the preceding theoretical
framework using a real-world three-dimensional data set. The data analysis was con-
ducted under three hypothetical scenarios: the first a classical modelling exercise and
the second two data estimation problems. Of these two, the first could be termed an en-
tropy maximizing exercise and the second an information discrimination minimization
one. However all three have also been expressed as log-linear models and furthermore
all three were executed using IPF.
These examples demonstrated a common base for solving the various problems, as
well as showing these principles extend to higher dimensions with little extra effort.
This is at least true formally and technically, although we can see that interpreting
higher order ratios of ratios of odds ratios will become increasingly difficult.
The two estimation scenarios are exceptional in that, unlike real world applications,
we were able to compare the estimates with the observed data. Thus we found what
might seem counter-intuitive, namely that using additional prior information in the
shape of a sample resulted in weaker estimates than assuming a uniform prior. Based
on this example we cannot of course see this as a rule, rather a cautionary example. By
investigating its origins it also becomes obvious one might attempt to avoid it. This
involves so-called subjective prior information and is based on the researcher’s knowl-
edge and experience. Although it can often not be quantified, in this particular case
an investigation of the sample’s second order coefficients and the related higher odds
ratios would have probably resulted in the researcher dismissing the sample prior in-
formation. Unless there is some reason to believe Wales has such dramatically different
gender-LLTI odds ratios, this can only be ascribed to sampling error, and should be
dismissed. Arguably many real-world situations will not have such obviously aberrant
coefficients to allow data to be rejected so easily.
Before dealing with these issues more in depth we first overview some more geography-
specific literature to provide a basis for further investigation.
76
Chapter 7
IPF in Geography —
Applications and Limitations
IPF — be it in the context of log-linear models, entropy maximizing or any of the
classical applications — is a general data analysis methodology, applicable to any and
every type of categorical data; geographical data being no exception. Many of the
applications of IPF that have been mentioned throughout this exposition have in fact
come directly from the field of geography or can be linked to it indirectly simply by
having a spatial dimension.
To simplify only slightly, four separate strands of IPF related literature can be
distinguished in geographical disciplines to parallel the developments described in the
previous sections: (i) log-linear modelling of (geographical) contingency tables (ii) spa-
tial interaction modelling (iii) entropy-maximizing and (iv) classical IPF for estimating
cells in geographical contingency tables. Of course we have already noted that all four
approaches are equivalent, however in practice there seems to be a sharp divide between
modelling and estimation applications.
Modelling (known) cell values in geographical contingency tables is almost invari-
ably done within a log-linear framework, with IPF usually only mentioned in a technical
note, if at all, and with no mention of concepts such as odds ratios or entropy. Spatial
interaction modelling is either explicitly based on Wilson’s gravity model or implicitly
relies on the maximum entropy justification using the balancing methods equivalent to
IPF, and in these contexts log-linear formulations are rare (but see Willekens, 1980).
More general cell estimation applications also tend to lack log-linear formulations and
are either based on the maximum entropy formulation following Wilson (e.g. Johnston
& Hay, 1982, 1983) or are simply estimated using IPF without giving any more justifi-
cation than the fact it produces consistent results . With a few notable exceptions (e.g.
Rogers et al., 2005), log-linear models are rarely mentioned in these contexts, and then
only as a technical solution for computing IPF in statistical software (e.g. Simpson &
Tranmer, 2005).
Using log-linear models to model cell values — that is to say using IPF to fit a model
77
and compare the estimated values to the known ones — is not our focus here. These
models were introduced to geographers fairly early (Wrigley, 1980; Upton & Fingleton,
1979; Fingleton, 1981a), and in such applications it is model selection that is the main
issue with IPF being a purely mechanical means to this end i.e. the procedure itself
does not introduce any uncertainty or error in such applications.
Spatial interaction modelling and similar applications have already been elaborated
upon in the section on gravity models. Although such applications represent an im-
portant section of the literature, they will not be dealt with more specifically seeing
as they are essentially special cases of the more general entropy maximization class of
applications. The former are restricted to two-dimensional square tables, whereas the
latter encompass all shapes and sizes of tables.
The last two strands of geographic IPF applications are of course distinguished in
terminology only, although in practice they also have slightly different focuses. The
classical IPF literature tends to be framed in a “combining census and survey data”
language, while entropy maximizing tends to focus more on using aggregate data to
analyse variations at lower geographies. These are of course simply two sides of the same
coin, and the differentiation is not entirely fair, as both literatures do usually engage in
both aspects. This was exemplified in an almost too predictable exchange that followed
David Wong’s publication of a paper demonstrating the reliability of IPF, where he
claimed that IPF “has not received much attention from geographers until recently”
and that with few exceptions “ most geographers [..] have not recognized the utility and
potential of IPF” (Wong, 1992, pp.340-1). This evoked a sharp reply from Johnston
and Pattie “introducing the readers to a wider literature than Wong references” of what
“following the innovative lead of Alan Wilson [they] refer to [..] as entropy-maximizing”
(Johnston & Pattie, 1993, p.317). We shall treat the two separately only for a while
longer, as this conveniently allow us to focus on two separate issues that arise in IPF
applications.
David Wong’s paper The Reliability of Using the Iterative Proportional Fitting Pro-
cedure is a classical example of the “combining survey and census data” literature,
exploring in depth the effect of sampling error on the estimates. His test case can
not be said to be geographical as such in the sense that he only deals with a two-
dimensional table crosstabulating the age and income of a single population. Thus the
analysis focuses on one of the possible sources of error in IPF estimation: sampling er-
ror, or to phrase it in the language of entropy maximizing, the quality of the prior data.
Although his computations are limited in power, he shows how increasing the accuracy
of the prior information i.e. increasing the sample size, leads to better estimates, as
well as on the other hand finding extreme examples of where the prior information is so
poor that better estimates are produced if one assumes a uniform prior i.e. ignores the
sample data. Of course Wong does not express his results in this manner, and some of
78
his results seem spurious due to lack of convergence due to table sparseness, however
this paper represents a very precise formulation of one of the issues concerning practical
applications of IPF and deserves further investigation.
The entropy maximizing literature in geography can be attributed almost exclu-
sively to Ron Johnston who has, with a couple of co-authors, published dozens of
articles and a few books using this methodology from the early 80’s onwards (see bib-
liography for a full list of references). These applications almost exclusively deal with
voting geography: estimating either voter transitions or split ticket voting at lower
levels of geography. They are therefore all three-dimensional and geographic in the
sense that one of the dimensions considered is a spatial variable (e.g. regions or con-
stituencies). In some cases these applications also include sampling error of the type
Wong investigated: voter transitions for example are not recorded nationally, but must
be estimated from a survey1. In this case two of the three constraints are known to
be accurate, but the third might suffer low quality2. In other applications all three
constraints are known to be accurate (but for measurement error, which we will ig-
nore here). In this case the estimates do not suffer from any error associated with the
accuracy of the constraints, but might instead suffer error from insufficiency of con-
straints. Their estimates are thus consistent with all known information, however the
information that is unknown is implicitly assumed to be of little consequence. There
is of course no way of knowing whether a second order interaction exist between split
ticket voting and different constituencies short of casting election ballots publicly.
This possible source of error is not always clearly explicated and in particular in
Johnston’s writing is often obfuscated by language along the lines of ‘the national trend
clearly does not apply locally (as others have claimed), rather regional variations devi-
ate from the national trend (which entropy maximization apparently shows)’. However
this fails to convey explicitly that in using entropy maximization the regional variations
were in fact estimated using the national trend alone, as no other data is available to
constrain the maximization or provide an informative prior. The question is there-
fore: to what extent can entropy maximizing that assumes no second order interaction
properly estimate geographical variation? Is an uninformative prior sufficient to pro-
vide realistic estimates and is it then legitimate to use these estimates for analysing
ecological variation?
It should be noted that neither of these two issues — error related to the accuracy
1Johnston and Pattie do recognize this in at least one paper (1991), where they evaluate the quality
of this constraint by comparing results obtained by using three different surveys, however this evaluation
is not nearly as comprehensive as the simulations performed by Wong.
2One such example by Berg curiously embodies the some of the terminological confusion noted
throughout this text by stating that the national level transition table is estimated from survey data
using the Deming-Stephan proportional fitting algorithm while the total table is then estimated the
iterative proportional scaling algorithm, both falling under Johnston’s maximum entropy method. If
anything, the first should be referred to as information minimizing and the second entropy maximizing.
79
of constraints and the error arising from insufficient constraints — are intrinsically
geographical. However exploring them in a spatial context should help shed some light
on the issues of ecological inference and ecological fallacy, which still attract controversy
and misinterpretations.
This concludes the historical and practical overview of the most important contribu-
tions to the development and application of IPF along with the description of the most
important theoretical frameworks within which it is formalized and applied. All of the
authors mentioned in the schematic chart in Figure 1 on page 3 have been covered,
and although this is the most comprehensive overview of the topic to date, it is quite
possible that there are other inventors that have been missed. Part I concludes with the
introduction of two possible limitations or issues with IPF in geography in particular,
which deserve further examination. This is done in Part III dealing firstly with insuffi-
cient prior information and then with sampling error or accuracy of prior information.
But first we must define the methodological aspects of such an analysis and give special
consideration to the geographical nature of the data we will use, which is the focus of
Part II.
80
Part II
Methodology and Data
81
Chapter 8
Measures and Methods
8.1 Introduction
In Part III we investigate the performance of IPF under several different scenarios.
These applications require a dataset, metrics for evaluation, and a software implemen-
tation. All of these components and the decisions associated with them are described
in the following sections. The first section describes the Small Area Microdata dataset
based on the 2001 UK Census and defines the subset of the data used in the IPF ap-
plications as well as its preparation. Issues relating to the geographical nature of the
data are dealt with separately in the following chapter.
The issue of metrics is split into two sections, the first dealing with measurement of
association strength, and the second with goodness-of-fit. Although to a certain extent
similar measures can be used for both tasks it is thought conceptually important to
explicitly separate the two tasks. The measures of association strength are applied in
Chapter 9 to investigate geographic variation of variable associations. The goodness-of-
fit measures on the other hand form the main evaluation tool in Part III. As will quickly
become clear it is also impossible to justify any single choice of measure for either task.
Still, both sections try to identify a selection of the most important measures, as well as
identifying some similarities and differences that can hopefully lead to a more informed
employment of them.
Finally the last section discusses possible software choices and the decision finally
adopted, namely to write a function in R. The code for this bespoke programme is
given in Appendix D (page 279).
8.2 Data
The ideal requirements for the dataset to be used in this analysis is that it be a full
population of socio-economic microdata with high geographic resolution. The closest
available equivalent to this comes in the form of the 2001 UK Small Area Microdata
(SAM) (ONS, 2006). The SAM is one of five sets of Samples of Anonymised Records
83
taken from the 2001 UK Census, which vary by their accessibility, the number of
variables and variable categories, level of geographic detail and sample size. This 5
percent sample comprises of 2.96 million individual records with 70 individual, family
and household level variables as well as geographical indicators for Local Authorities
for England and Wales, Council Areas for Scotland and Parliamentary Constituencies
for Northern Ireland. In order to be able to work with a single consistent geographic
hierarchy the dataset was reduced to include only England and Wales1. This also lead to
the removal of several variables that were coded specifically for Scotland and Northern
Ireland. The final dataset consists of 2,621,560 individual cases and 57 variables, a list
of which is given in Table 8.1 while a full list of the variables along with their respective
categories and their univariate distributions is given in Appendix A (page 251).
Each entry on this microdata list also has a local authority (LA) identifier. In
the following three chapters one of the main aims is to examine the quality of the
IPF estimates in a distinctly geographical situation. At the same time we wish to
explore the whole range of data structures that might arise from census type data.
We therefore create a working dataset by crosstabulating all possible pairs of the 57
variables against each other with the third dimension being the geography. This creates
a set of (57 ×56)/2 = 1596 three-dimensional tables. The table sizes range from 4-
celled (2 ×2) to 238-celled (14 ×17). Each of these 1596 tables can be said to have
373 layers - the local authorities of England and Wales. A complete list of the LAs
with their county and regional hierarchy as well as their geodemographic classification
(ONS, 2007) are given in Appendix B, page 263.
Table 8.1: SAM Variable List (ONS, 2006)
Name Variable Label
1. acctypa Accommodation Type
2. agea Age of Respondents
3. bathwc Use of Bath/Shower/Toilet
4. carsh Cars/Vans Owned or Available for Use
5. cemtyp Type of communal establishment
6. cenheat0 Central Heating
7. ceststat Status in Communal Establishment
8. cobirta Country of Birth
9. densitya No. of Residents per Room
1Due to an unfortunate coding mismatch, the dataset also does not include the merged LAs for the
Isles of Scilly and Penwith in Cornwall. The error stems from the ONS’ uncoordinated recoding of
the area’s ID code in two different datasets, giving it two different names. This meant that when the
SAM dataset and the LA geodemographic classification dataset (ONS, 2007) were merged, this LA got
dropped out. In no way does this invalidate any of the following analysis, and it should only serve as
a reminder that this is a sample dataset that is being used for demonstrative purposes, not to provide
any definitive analysis on the UK.
84
Table 8.1: (continued)
Name Variable Label
10. distmova Distance of Move for Migrants (km)
11. distwrka Distance to Work
12. econach Economic Activity (last week)
13. ethewa Ethnic Group for England and Wales
14. everwork Ever Worked
15. famtypa Family Type
16. fndepcha Dependent Children in Family
17. freconac Economic Position of Family Reference Person
18. frnssec8 Social-Economic Classification of Family Reference Person
19. frsex Sex of Family Reference Person
20. genind Generation Indicator
21. health General Heath Over the Last Twelve Months
22. hedind Household Education indicator
23. hempind Household Employment indicator
24. hhsgind Household housing indicator
25. hhtlhind Household health & disability indicator
26. hmptpuk Hhd headship (ODPM)
27. hncarers Number of Carers in the Household
28. hnearnra Number of Employed Adults in Household
29. hnllti Number in Household with Limiting Long-term Illness
30. hnprhlth Number of Household Members with Poor Health
31. hnresida Number of Usual Residents in Household
32. hourspwg Hours Worked Weekly
33. hrsocgrd Social Grade of Household Reference Person
34. lastwrka Year Last Worked
35. llti Limiting Long Term Illness
36. lowflora Lowest floor level of household living accommodation
37. marstata Marital Status
38. miginda Migration Indicator
39. migorgn Region of origin
40. occupncy Occupancy Rating of Household
41. profqual Professional Qualification (England and Wales)
42. provcare Number of Hours Care Provided per Week
43. qualvewn Level of Highest Qualifications (Aged 16-74, EWN)
44. relgew Religion (England and Wales)
45. reltohra Relationship to HRP
46. roomsnum Number of Rooms Occupied in Household Space
47. selfcont Accommodation Self-Contained
48. sex Sex
49. stahuka Household with Students Away During Term Time
85
Table 8.1: (continued)
Name Variable Label
50. student Schoolchild or Student in Full-Time Education
51. supervsr Supervisor/Foreman
52. tenurewa Tenure of Accommodation, England and Wales,
53. termtima Term time Address of Students or Schoolchildren
54. tranwrka Transport to Work, UK (Including to Study in Scotland)
55. workforc Size of Work Force
56. wrkplcea Workplace
57. nssec8 NS-SEC 8 Classes
onscode Local authority
county County
GOR Government Office Region
Subgroups Lowest level ONS Area Classification
Groups Middle level ONS Area Classification
Supergroups Highest level ONS Area Classification
The basic dataset used in this analysis therefore consists of 1596 three-dimensional
tables, each crosstabulating 2,621,560 people in 373 LAs by a different pair of variables.
Two issues arise with regard to this dataset that need to be addressed further: the
independence of the observations and the issue of missing values. Given that the SAM
is a sample of the census population, the question arises as to the independence of the
observations i.e. the probability of clustering of individuals from the same households.
There are in fact two possible answers to this. The first is that it does not actually
matter. Ideally the analysis that follows would use a whole population and not a
sample, but since this is not possible, the SAM is instead treated as if it were a full
population. No attempt is made to make statistical inferences on the population based
on the sample, so in this strict sense it does not matter if the individuals in the SAM
are not perfectly independent. The second answer to the question is that the level
of clustering cannot realistically be very high anyway with an estimated 95% of the
individuals in the sample coming from unique households.2
The second data related issue is the question of missing values. A quick glance
at the variable categories in Appendix A makes it clear that a great majority of the
variables have a not applicable category. There is therefore a temptation to remove
2The expected proportion of individuals that do not represent a unique household in the sample
depends of the precise distribution of the household sizes and the combinatorics of the calculations are
too complex to be attempted here. As a proxy calculation the 1991 Household SAR was taken and 5%
simple random samples were taken repeatedly and the proportion of unique households noted. This
is a sample of just over half a million individuals with an average household size of 2.51. On average
94.59% of the individuals in each sample were from unique households which can be assumed to be the
approximate proportion for SAM as well.
86
the offending categories and leave only true variable values. Although this might be
desirable if we were dealing with a single variable or variable pair, in this particular
dataset it is not even possible. The ‘-9’ values refer to a whole range of situations:
persons that are not usual residents, residents in communal establishments, students
living away, respondents out of a particular age range... In fact every individual in the
dataset has a minimum of four missing values and the average is almost 19. Not only
would it be impractical to remove these values, keeping them in the dataset actually
has an added advantage: once crosstabulated they often lead to structural zeros. As we
are interested in working on a dataset that has a whole range of realistic distributional
characteristics, these missing value categories – although perhaps not interesting from
a content point of view – are quite indispensable structurally.
8.3 Measuring the strength of bivariate associations
The first set of measures discussed in this section are metrics for the geographical vari-
ation of the bivariate associations. Since we are using a large census-based microdata
dataset it can be assumed that this real-life data will capture a wide spectrum of geo-
graphical variation in the bivariate associations that one might realistically encounter.
On the one hand there should be certain variable combinations that have a relatively
constant relationship across all local authorities, while at the other end there should
be pairs that vary significantly. These measures are used to analyse the geographic
variation of association strength in Section 9.2 in the next chapter.
There are several measures of association that describe the strength of association
between two categorical variables using a single summary value, and all of them have
their weaknesses. Many of them, including the odds ratio, only apply to 2x2 tables.
Other so-called asymmetrical measures require one of the variables to be treated as
dependent and are similarly unfit for our purpose. Three types of measures will be
considered here: (i) Chi-square based measures, (ii) Proportional reduction in error
(PRE) measures and (iii) measures based on the information-theoretic approach.
8.3.1 Chi-square based measures
Pearson’s Chi square statistic measures the strength of an association by comparing
the observed frequencies (xij ) and those expected under independence (ˆxij ):
χ2=X(xij −ˆxij )2
ˆxij
[8.1]
Because the value of χ2is directly proportional to the size of the population and the
table size, the values are not directly comparable for different tables. Several corrections
have been proposed to correct this, of which Cramer’s V is perhaps most popular, since
it scales to values between 0 and 1:
87
V=sχ2
N×min((r−1),(c−1)) [8.2]
There are still issues with Cramer’s V and other Chi square based measures, namely
the fact that there is no intuitive interpretation of their meaning. In the words of
Goodman and Kruskal: “The fact that an excellent test of independence may be based
on χ2does not at all mean that [some function of] χ2is an appropriate measure of
degree of association.” (Goodman & Kruskal, 1954, p. 740). A further practical issue
is that splitting tables into smaller geographical units inevitably results in some tables
where a certain category is completely absent. This leads to the expected values being
zero and hence makes any Chi square based statistic impossible to calculate.
Figure 8.1 juxtaposes the geographic distribution of Cramer’s V for three selected
crosstabulations. They are all drawn in the same scale as to be directly comparable.
The first panel summarizes the 373 crosstabulations of the variables Family type by
Cars/Vans Owned or Available for Use. The values for Cramer’s V range from about
0.4 to 0.6 across the LAs. The relationship is therefore rather strong and has some
geographical variation. The second crosstabulation exhibits even more geographic vari-
ation with a range from zero to about 0.55 in some LAs. The relationship between
Accommodation Type and Term time Address of Students and Schoolchildren as mea-
sured by Cramer’s V actually has one of the highest levels of geographical variation
in the dataset. The last panel exploring the association between Age and Region of
origin seems to exhibit a relatively weak relationship with less geographic variation.
However the problem with this histogram is that it describes the said relationship for
only 175 local authorities. The reason for this is the fact that it is a relatively large
table with 17 categories for region of origin and 13 age groups. This produces sparse
tables with over half of the local authorities having at least one row or column empty,
making Cramer’s V (or any Chi square based measure) inappropriate for assessing its
geographic variation.
0 20 40 60 80
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8
0 20 40 60 80
0.0 0.2 0.4 0.6 0.8
Family type by Cars owned Accommodation type
by Term time address Age by Region of origin
Figure 8.1: Cramer’s V values for three crosstabulations across 373 LAs
88
An alternative statistic that is asymptotically distributed as χ2, but does not have
any issues with zero counts is the Freeman-Tukey chi square statistic (Bishop et al.,
1975, p.508 ff):
F T 2= 4 ×X(√xij −√ˆxij )2[8.3]
No division by zero is possible thereby removing the main disadvantage of Pearson’s
Chi square and its derivatives. To produce a measure that is comparable across tables of
different sizes and totals, we can employ Cramer’s formula to get an adjusted Freeman-
Tukey statistic:
adj.F T 2=sF T 2
N×min((r−1),(c−1)) [8.4]
0 50 100 150 200
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
0 50 100 150 200
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Family type by Cars owned Accommodation type
by Term time address Age by Region of origin
Figure 8.2: Adjusted Freeman Tukey values for three crosstabulations across 373 LAs
As with Cramer’s V there is an issue with interpretation, however the empty cells is-
sue is sidestepped and a valid measure can be calculated for all tables. Figure 8.2 shows
the distribution of the adjusted F T 2values for the same three bivariate combinations
as before. This time all three histograms have a total frequency of 373.
8.3.2 Proportional reduction in error (PRE) measures
Proportional reduction of error (PRE) type measures do not have this problem with
empty rows and columns and have the further attractive nature of allowing us to in-
terpret their values meaningfully. As their name implies, these are measures of the
amount of error in the prediction of one variable that is reduced by knowing the value
of the other variable. The basic lambda measure is asymmetrical - one variable is
dependent(B) and one independent(A):
λB=P(error in B)−P(error in B|A)
P(error in B)[8.5]
89
Therefore λBmeasures the proportionate increase of accuracy in guessing the value
of B given the known value of A as opposed to simply guessing B with no knowledge of
A. If the variables are independent then knowledge of A is not at all helpful in guessing
B in which case the corresponding value of lambda is 0. Lambda is operationalized3
by Goodman and Kruskal (1954) where the guessing proceeds in the following way:
in predicting B with no knowledge of A, we guess B to be the largest group. So for
example if the modal category of B contains 40 % of the cases, our probability of error
is 60 % or to state it formally P(error in B) = 1 −p+max. If we know the value of A,
then we predict B to be the group that is largest in that particular row of A. Stated
formally we define pimax as the proportion of cases in the largest cell in row i and
therefore P(error in B|A) = 1 −Ppimax. The formula can then be stated fully as:
λB=(1 −p+max)−(1 −Ppimax)
1−p+max
[8.6]
This refers only to the prediction of B given A and there is an equivalent formulation
for predicting A given B. Importantly, λAand λBare not the same except in rare
circumstances and in order to make the lambda measure symmetrical i.e. to measure
the prediction in both directions at the same time, both formulas are pooled to give:
λAB =Ppimax +Ppmaxj−p+max −pmax+
2−p+max −pmax+
[8.7]
0 50 100 150 200 250 300 350
0.0 0.1 0.2 0.3 0.4
0.0 0.1 0.2 0.3 0.4
0 50 100 150 200 250 300 350
0.0 0.1 0.2 0.3 0.4
Family type by Cars owned Accommodation type
by Term time address Age by Region of origin
Figure 8.3: Goodman and Kruskal’s lambda values for three crosstabulations across
373 LAs
Again taking the three bivariate combinations from before, Figure 8.3 plots their
lambda values, this time truly including all 373 local authorities. This time the first
3According to the authors their development of the lambda measures is very similar to one given by
Guttman in 1941 (Goodman & Kruskal, 1954, p.742). They go further in developing the symmetrical
lambda as opposed to two directional ones and because of this the literature usually refer’s to the
measure as Goodman’s or Goodman and Kruskal’s lambda.
90
Table 8.2: Type of accommodation by Term time address in Bexley (Kent)
Living Not living n/a not
with parent(s) with parent(s) resident student Total
Detached/semi 1236 39 4889 6164
Terraced house 652 21 2385 3058
Flat and other 175 36 1488 1699
n/a in comm. est. 0 0 54 54
Total 2063 96 8816 10975
panel shows the most geographic variation in the relationship, while most local author-
ities in the second panel have a lambda value of zero. To understand why this is we
can have a look at one of the local authorities. Table 8.2 tabulates the accommodation
type and term time address variables for Bexley, one of the LAs with a lambda value of
zero (the value of Cramer’s V is 0.08). In guessing the term time address value with no
knowledge of accommodation type one would have to choose “n/a not resident student”
as this is the modal value with over 80 % of respondents. Knowing the accommoda-
tion type does not change this however, as this category is also the modal category for
each row individually (these are in bold typeface). The same applies for guessing the
accommodation type. The modal category in the margin is “detached/semi-detached”,
but it is also the modal category for each column, so knowing the term time address
does not improve the prediction. For these two variables 210 of the 373 LAs faced this
situation where lambda was unable to differentiate between the level of association.
When all the maximum cells in each row are all in the same column and vice versa,
knowing the values of one variable does not increase the accuracy of the guess. The
error of prediction is not reduced at all and lambda consequently has a value of zero.
The issue lies in what the measure’s baseline is. Chi square measures use independence
as the baseline, whereas lambda’s baseline is for the variables to be in accord.
8.3.3 The information-theoretic approach
Because the lambda measure only takes into account the modal categories it ignores
information on the entire distribution. This is not an issue for measures based on the
information-theoretic approach, which explicitly use the whole known distribution. The
most popular of these is variously known as Theil’s U, the uncertainty coefficient (UC)
or the entropy coefficient, which is the terminology used here. Technically it is a PRE
measure as the basic framework of Equation [8.5] applies in principle; only instead of
a reduction in prediction error, the reduction of uncertainty as measured by entropy is
used. Again starting with the asymmetrical version of the measure, the prediction of
Bdepends on the entropy of B’s marginal distribution:
91
H(B) = −X
j
p+j·log2p+j[8.8]
This uncertainty is reduced if the row value of A is known to be ai:
H(B|ai) = −X
j
pij
pi+·log2
pij
pi+
[8.9]
Averaging across all values of A this gives the conditional entropy of B given A:
H(B|A) = X
i
pi+(−X
j
pij
pi+·log2
pij
pi+
) [8.10]
This means that the value of H(B|A) will be smaller than that of H(B) unless each
and every row has the same entropy as the marginal total. This can of course only
happen under independence, in which case the uncertainty is not reduced as H(B|A) =
H(B). The reduction of uncertainty in predicting B based on A can therefore be written
as:
UB=H(B)−H(B|A)
H(B)[8.11]
Because the uncertainty about B given A is equal to the total uncertainty about A and
B reduced by the uncertainty about A - this is the chain rule of entropy:
H(B|A) = H(AB)−H(A) [8.12]
we can rewrite Equation [8.11] to remove the conditional entropy altogether:
UB=H(B)−H(AB) + H(A)
H(B)[8.13]
Switching the variables around would give UAas the reduction of uncertainty that
knowledge of B gives in predicting A. Pooling both measures together gives the sym-
metrical measure that takes into account both directionalities:
UAB =2(H(A) + H(B) + H(AB))
H(B) + H(A)[8.14]
The entropy coefficient has none of the weaknesses of the previous measures consid-
ered. It is possible to calculate it for any table, no matter how sparse it is, and it takes
into account all of the marginal information. It is comparable across table and popu-
lation sizes and takes a value between 0 and 1 with zero equivalent to independence.
It’s only weakness can be seen in the composite nature of the symmetrical measure4:
while each of the directional measures UAand UBhave meaningful interpretations, the
interpretation of UAB is not so intuitive. In essence it is describing the reduction of
4The same of course applies to the lambda measure.
92
uncertainty when predicting A given B half the time and B given A the other half of
the time. Despite this shortcoming it has theoretical and practical advantages over the
other measures described.
0 50 100 150 200 250
0.0 0.1 0.2 0.3 0.4
0.0 0.1 0.2 0.3 0.4
0 50 100 150
0.0 0.1 0.2 0.3 0.4
Family type by Cars owned Accommodation type
by Term time address Age by Region of origin
Figure 8.4: Entropy coefficient values for three crosstabulations across 373 LAs
The geographic variation of the entropy coefficient values for the same three variable
combinations as before is shown in Figure 8.4. According to these figures, accommoda-
tion type and term time address has the lowest level of association as well as the least
geographic variability across the local authorities. The highest levels of association are
found in the age by region of origin tabulation. The tabulations of accommodation
type and term time address variables tend to have a weaker association, but exhibit
the greatest geographical variation, as measured by the standard deviation.
8.3.4 Choice of association descriptors
Three measures of association have been described, all of which have the required
properties of being summary measures, of being symmetrical (not requiring a directional
relationship between the variables), being applicable to nominal variables and of being
computable for sparse tables. Although many other measures could have been included
it is felt the the adjusted Freeman-Tukey Chi square statistic, symmetrical lambda and
the uncertainty coefficient are representative of the main types of measures available.
On the other hand there is the temptation to settle for a single measure. This has
the obvious advantage of simplifying the analysis both with regard to calculation and
presentation.
Looking only at the histograms for the three variable pairs presented in the previous
section is it clear that the three measures do not always give consistent results. This
is summarized in Table 8.3. For these three tables all three measures give consistent
results about the average strength of the relationship. Looking at the geographical
variation of the strength of association all three measures rank Family type by Cars to
be highest, while the Freeman-Tukey measure is the odd one out in determining the
93
Table 8.3: Ranking of three variable pairs by the three different measures
Average strength Standard deviation
of association accross LAs
adj.F T 2λAB UAB adj.F T 2λAB UAB
Family type by Cars owned ◦ ◦ ◦ + + +
Acc. type by Term-time address – – – ◦– –
Age by Region of origin + + + – ◦ ◦
lowest.
Instead of limiting the analysis to just three tables, we can look at the whole data
set. Calculating the tree measures for all 1596 tables allows us to see how average
association strength varies depending on the measure used (Figure 8.5). The charts
are plotted by calculating the association strength of a pair of variables in each local
authority and then finding the average across all 373 LAs. Figure 8.5 gives an indication
of how disparate the different measures are - had they been measuring the same thing
one would expect the data points to line up along the 45◦angle.
In general it would seem that the lambda and the uncertainty coefficient are the
most closely related (bottom right panel) with an correlation of r= 0.96. This is
perhaps surprising considering the fact that the lambda is based on a much more
limited account of the variable distribution. The Freeman-Tukey measure stands out
with several conspicuous outliers and its correlation with the other two measures are
lower: r= 0.82 with the entropy coefficient and r= 0.79 with lambda. These graphs
confirm more generally what was found before: that the measures are not congruous
and that selecting just one of them could lead to loss of information.
As there is no theoretical or practical justification for picking any one of the mea-
sures, all three are used in the following analysis. In addition to their straightforward
function as a measures of the strength of the bivariate associations, their standard
deviation across the LAs is particularly useful as a proxy for geographical variation
of associations. This means that regardless of the strength of association, a variable
pair that exhibits a large standard deviation of strength across the local authorities
is considered to have greater geographical variability than a pair where the standard
deviation is low. A more in depth analysis of the levels of geographical variation of
association strength as a means of describing the SAM dataset is presented in Section
9.2 in the next chapter.
94
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
Average value of Freeman Tukey statistic
Average value of lambda
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1.0
0.0
0.0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
Average value of lambda
Average value of entropy
Figure 8.5: Average strength of association for 1596 pairs measured by the three mea-
sures
8.4 Measuring goodness-of-fit
In assessing the goodness-of-fit of the IPF estimates the choice of measure presents sim-
ilar problems as seen in the previous section. Numerous researchers have inadvertently
or deliberately found themselves in situations where different measures of goodness-of-
fit have led to conflicting conclusions. Comparative reviews of various measures are
conclusive only in finding that different measures measure different aspects of fit and
that no single measure can be recommended as the best tool to assess the error for all
situations (see e.g. Knudsen & Fotheringham, 1986; Fotheringham & Knudsen, 1987;
Read & Cressie, 1988; Voas & Williamson, 2001).
Goodness of fit statistics provide measures of the discrepancy between the observed
frequencies and those expected under a model. In the case of IPF estimates the ter-
minology is equivalent to comparing actual data to the synthetic estimates produced
using IPF. Knudsen & Fotheringham (1986) further distinguish between two purposes
served by goodness-of-fit statistics. On the one hand they can be used for model com-
parison: juxtaposing two or more models to compare how well they replicate observed
95
data. In addition to this relative approach to goodness-of-fit, we can also speak of an
absolute approach where we are interested in the quality of a single model – or to phrase
it in classical terms: whether the discrepancy produced by the model is statistically
significant or not.
In the context of model comparison a common mistake is to assume that the values
of the statistic are proportional to the level of error. This means a goodness-of-fit
statistic that is twice as large for one model is assumed to mean the model is half as
accurate as the alternative. This is however often not the case and most goodness-of-fit
statistics do not exhibit such a linear relationship with the errors (ibid.).
In the context of hypothesis testing the usual requirement is that the underlying
sampling distribution of the test statistic is known. Thus statistics with unknown
sampling distributions are seen as unfit for significance testing, while statistics that
have e.g. an asymptotically chi-squared distribution are often seen as unproblematic in
this regard. In fact neither of these statements are true as we shall see, but are rather
relics of the past.
In addition to this there are many other issues to be taken into consideration when
choosing a goodness-of-fit statistic, many of which are subjective and rely on some
intuitive notion of good fit or are contingent on the application at hand and are linked
to specific types of errors one might wish to avoid. Voas & Williamson (2001) use
seven different criteria by which they compare alternative measures of fit, ranging from
the possibility of extracting cell based components of the measure and of comparing
different (sized) tables, to practical issues such as ease and speed of calculation, and
finally very subjective criteria such as familiarity and intuitiveness. And these are only
general criteria – with any particular application additional criteria such as handling
of sparse tables might come into play.
The following section does not attempt to identify a single ideal statistic, but rather
tries to cover a selection of representative measures that have been used in similar
analysis in the past. All the criteria mentioned above are taken into account except
perhaps the most subjective ones. We start with the most intuitive measures such
as the Total Absolute Error and Z-scores which are then linked on to Pearson’s chi-
squared statistic and finally to less familiar measures from the Power Divergence family
of statistics.
8.4.1 General distance based measures
General distance measures can be seen as ones that are simple functions of the cell
errors: xij −ˆxij . Here we consider the Total absolute error (TAE) the Root mean
squared error (RMSE) and their standardized versions of proportion misclassified and
standardized RMSE respectively.
96
Proportion misclassified
The first measure the simplest and most straightforward – the total absolute error is
defied as:
T AE =X
ij |xij −ˆxij |[8.15]
which is a simple sum of the absolute difference between the cells of the estimates and
the original data. This value is of course affected by the population total Nmaking
it incomparable between tables5. Taking half of the TAE and dividing it by the total
population gives the proportion misclassified (Cleave et al., 1995):
∆=T AE
2·N[8.16]
which has the advantage of being comparable across tables of various sizes as well as
being easy to interpret: it represents the smallest proportion of cases that have to be
moved to another cell in order to achieve a perfect fit. This statistic is also sometimes
referred to as the index of dissimilarity based on the more well known measure proposed
by Corrado Gini for determining the segregation levels of two groups6(Agresti 2002,
p.329-330; Kuha & Firth 2010). Following these authors we denote the statistic with
the Greek letter Delta, but will refer to it by the more descriptive percent misclassified
so as to not confuse it with any segregation indices. A value of zero means the fit is
perfect and a maximum value of ∆= 1 is theoretically possible, indicating that all 100
% of cases are in the wrong cell(s).
Despite being a relatively straightforward measure, treating each misclassification
across the table as being equally serious and being standardised to the table and pop-
ulation sizes, ∆is actually not completely comparable across different tables. This is
because the maximum theoretically possible value of ∆is not always 100 percent. In
5Some authors use the term Standardized average error - SAE (Voas & Williamson, 2001) or Relative
number of wrong predictions - RNWP (Thorsen & Gitlesen, 1998) to refer to T AE /N . These measures
simply standardise TAE for the table size. Unfortunately it seems this approach can lead to some
issues with interpretation. RNWP or SAE can range from 0 to 2 (or from 0% to 200% if the values are
treated as percentages). This is because each person is counted twice: once as missing from the cell
they were supposed to be in and a second time as superfluous in a cell they do not belong in. Reporting
these values can implicitly or even explicitly give the impression that they refer to the proportion of
misclassified persons. Smith et al. (2009) in particular start out by stating that their threshold will be
SAE < 10% and then continue to refer to their results as ‘percentage misclassified’ which should quite
possibly be half of the values they are reporting.
6The segregation index of dissimilarity is formally equivalent to the measure defined in equation
[8.16] only traditionally expressed using proportions rather than counts:
D=Pi|pAi −pBi |
2[8.17]
where pAi is the proportion of group Athat live in area iand pBi that of group B. If the proportions of
both groups are the same in all locations the error is zero. Analogously with percentage misclassified,
Gini’s segregation index can be interpreted as the percent of people that would have to move to another
area in order to achieve a completely even – unsegregated – distribution of both groups.
97
fact in model testing of the kind proposed here it can never be 100 %. Using a uniform
(no interaction) model as the worst possible model7– any other model will by definition
have better fit – we can establish the theoretical maximum value of ∆in the following
way. In a uniform i.e. no interaction model, all cell estimates are equal:
ˆxij =N
I·J[8.18]
The maximum number of misclassifications will occur if all the cases are in a single
cell so that x11 =Nand for all (i6= 1, j 6= 1) we have xij = 0. Then the proportion
misclassified is
∆max =N−N
I·J
2·N+(I·J−1) ·N
I·J
2·N[8.19]
∆max = 1 −1
I·J[8.20]
where the first term refers to the error in the first cell and the second term to the errors
in the remaining I·J−1 cells. So instead of 1, the maximum value of ∆is smaller
than one by the proportion of cases in one cell. Thus in a – not impossible – case of
a 2 ×2 table where all the observed cases are in one cell, ∆max = 1 −1/4 = 0.75. In
this example, the uniform model with 25% of cases in each cell can in the worst case
scenario lead to 75% of the cases being misclassified, but never more than that.
If we extend this logic to more complicated models we find that the worst possible
misclassification occurs if all the observed cases are in the cell with the smallest expected
frequency. We can therefore write the generalization of Equation [8.20] as:
∆max = 1 −min(ˆxij )
N[8.21]
To what extent this could present an issue is not clear. For large tables the propor-
tion of cases in the smallest cell will be relatively small, reducing the range of possible
∆values by only a small amount. At the same time it should be kept in mind that
for any reasonably well fitting model the values of ∆obtained will be at the lower
end of the spectrum, so to give an example, the difference between 2% misclassified
and 3% misclassified is still very telling, despite the fact that one might have a possi-
ble maximum of 95% and the other 98%. Standardizing the ∆statistic to the range
7The concept of worst possible model should be seen in the context of IPF - of course there are
always worse possible models where it is possible to achieve 100 percent misclassified, but such a model
would never occur in the IPF setting. The models here start out with no constraints (i.e. a uniform
or no interaction model) and any additional constraints will by definition improve the estimate. The
maximum applies for the overall misclassification of a table. In individual segments of the table (e.g.
cells, columns, layers) it is possible to achieve 100 percent misclassification, but overall the maximum
is always going to be lower.
98
would therefore achieve little with regard to information content, but would sacrifice
the elegant interpretation of ∆as the percent of cases in the wrong cells. 8
Standardized root mean squared error
Another commonly advocated approach is to use the Root mean squared error or
RMSE. This is another distance statistic, this time taking the square of the error
instead of the simple distance. Taking the square gives a disproportionally large weight
to larger errors compared to small errors. The measure is standardized by table size to
allow values to be compared between tables with different numbers of cells:
RMSE =v
u
u
tX
ij
(xij −ˆxij )2
I×J[8.22]
Since RMSE is only standardized by I×Jits range will depend on the average
cell count. In order to be able to compare tables that not only have different numbers
of cells but different entry totals, the RM SE can be divided by the mean cell size to
account for this (Pitfield, 1978) This gives the standardized RMSE or SMRSE:
SRM SE =RM SE
N/(I×J)[8.23]
A complete fit will produce a SRM SE value of zero, while a value of one indicates
the average error is equal to the average cell size. Of course values larger than one are
possible for particularly poor fitting tables, due to the fact that the errors are squared.
According to Knudsen and Fotheringham (1986) who examined the sensitivity of eight
types of goodness-of-fit statistics, SRMSE was found to be most accurate. This is in
the sense of the relationship with the accuracy of the table being most linear. Despite
this, there are issues with the SRM SE as well. For one, taking the square of the error
- despite having a long and venerable history - is quite arbitrary. There is no reason
why a different exponent might not be used e.g. 1.5 for a less dramatic accentuation of
large errors.
Both SRM SE and ∆ have the advantages of being simple to calculate and relatively
familiar and easy to understand. The individual contributions of each cell can easily
be established and the values are standardized as to be comparable between different
sized tables - tables with different totals as well as different numbers of cells. They
are also insensitive to empty cells and can be calculated for sparse tables with no
adjustments. Their sampling distributions are unknown, which means their use for
significance testing is more complicated (computationally intensive but not impossible
8It is worth noting that the above derivation of ∆max applies only to the statistic being calculated
for the whole modelled table – regardless of the number of dimensions. If however the proportion
misclassified is being calculated only for one part of the table – e.g. for each local authority layer
separately – then ∆could theoretically be larger, although still not 1.
99
as we shall see). Their main disadvantage lies in the fact that errors of the same size are
treated the same way regardless of the actual value of a cell. Thus a single misclassified
person in a cell with an actual value of 4 (25%) is treated with the same weight as a
misclassification in a cell with 1000 people (0.1%).
8.4.2 Z-scores
Contrary to simple distance-based measures Z-scores or standardised scores treat errors
relative to the cell sizes. Instead of taking a simple function of the difference between
observed and expected values, this difference is divided by the standard error of that
observation:
Z=X−E(X)
SE(X)[8.24]
The larger the cell, the larger the standard error of that cell, which gives propor-
tionally less weight to the observed error than if it had occurred in a smaller cell. The
concept of standardised scores is again a familiar one, although its use as a goodness-
of-fit measure is less common (but see Birkin & Clarke, 1988; Williamson et al., 1998;
Voas & Williamson, 2001).The attractiveness of the Z-score lies in the fact that if we
assume the errors are normally distributed, then Z-scores have a unit normal distribu-
tion (with a mean of zero and a standard deviation of one) and therefore by definition
the sum of squared Z-scores has a chi-squared distribution. There are however a few
approximations involved along the way which is why we derive the Z2statistic from
scratch.
We start out by considering the estimated table as a multinomial probability model:
each cell has a probability ˆpiattached to it (where i= (1,2, .., k)), and all the probabili-
ties sum up to 1. We have observed Nindividuals, who fall exclusively and exhaustively
into one of the kcells 9. The question is therefore how well do the estimated probabil-
ities conform with the observed values.
As the multinomial distribution is simply an extension of the binomial distribution,
each individual cell has a binomial probability distribution. Each individual has the
probability ˆpiof being in cell iand (1 −ˆpi) of not being in that cell. Calculating the
expected value of a cell is then:
E(Xi) = ˆxi=N·ˆpi[8.25]
The standard error, by definition, is the square root of the expected value of the
square of the errors. For one individual then we can write:
SE(n1) = qE((n1−E(n1))2) [8.26]
9The dimensionality of the table is irrelevant here, we simply consider each cell to be a category,
regardless of the number of attributes represented by it.
100
Starting out with a single individual (ni), this person has the probability pthey will
fall into one specific cell e.g. single with two cars. For this person then pis the expected
value, although the observed value can only be 1 or 0: they either are single with two
cars or they are not. If they fall into into this category, the error will be (1 −p). On
the other hand the probability of them not falling into this category is (1 −p). If this
happens, we observe a value of 0 when the expectation was p, so the error is (0 −p).
In the first case the expected value of the square of the error is therefore p×(1 −p)2
and in the second it is (1 −p)×(0 −p)2, making the standard error as defined above:
SE(n1) = qˆp×(1 −ˆp)2+ (1 −ˆp)×(0 −ˆp)2[8.27]
=qˆp(1 −ˆp) [8.28]
Assuming all the N individuals are independent10 the standard error is then simply
the square root of the sum of the square errors of all N individuals:
SE(Xi) = qN·ˆpi·(1 −ˆpi) [8.29]
In this manner we can calculate the expected values and their standard errors based
on a set of probabilities and the total number of people to be categorized in the table.
Table 8.4 shows a set of hypothetical expected probabilities (ˆpi) in the first column,
the expected frequencies based on N= 1000 in the second column and the standard
errors calculated using Equation [8.29] in the third. The fourth column gives a set of
observed values to be evaluated.
To evaluate the degree of the discrepancy between the observed and estimated values
we could calculate the exact binomial probability. If we take the example of the last cell
where pi= 0.005 and n= 1000: the expected frequency is therefore 5 but the observed
frequency is 2. The question is therefore: “how unlikely is it to observe a frequency of
2 or less?” The probability of observing exactly 2 people is:
P(x1= 2) = 1000
2!·0.12·0.9998
P(x1= 2) = 1000!
2! ·(998)! ·0.12·0.9998
P(x1= 2) = 0.0839
Then the same has to be calculated for the probability of observing exactly 1 person,
10This assumption underlying the binomial distribution can often be found to be violated e.g. certain
characteristics of individuals will correlate within households; we can also expect geographical areas to
exhibit clustering. To what extent this is a problem is unclear although it should be noted that for
the same reasons, the exact same assumption applies to Pearson’s X2as well, where it is generally not
seen as an issue.
101
Table 8.4: Step by step calculation of z2
ˆpiˆxiSE(ˆxi)xiziP(zi)Pexact z2
iχ2
(%) (%)
0.05 50 6.89 40 -1.45 7.34 8.06 2.11 2.00
0.05 50 6.89 60 1.45 7.34 8.67 2.11 2.00
0.1 100 9.49 90 -1.05 14.59 15.82 1.11 1.00
0.1 100 9.49 110 1.05 14.59 15.83 1.11 1.00
0.2 200 12.65 185 -1.22 11.09 12.54 1.49 1.12
0.02 20 4.43 11 -2.03 2.10 2.04 4.13 4.05
0.02 20 4.43 29 2.03 2.10 3.28 4.13 4.05
0.25 250 13.69 270 1.09 13.66 14.49 1.20 0.90
0.07 70 8.07 70 0.00 100.00 95.06 0.00 0.00
0.135 135 10.97 138 0.27 39.07 40.44 0.08 0.07
0.005 5 2.24 2 -1.34 8.93 12.40 1.81 1.80
1.00 1000 1000 19.28 17.99
and observing none. The sum of these probabilities is then:
P(x1≤2) =
2
X
a=0 1000
a!·0.1a·0.91000−a
P(x1≤2) = 0.1240
So enumerating all the possible combinations and permutations of 1000 individuals,
assuming there is a 0.005 chance of being in the last cell, the chances of observing 2 or
fewer people in that cell is 12.40%. This can become a cumbersome and computationally
hungry task for larger sample sizes, which is why the normal approximation of the
binomial can be used instead (Rahman, 1968, pp. 328-32). This is done by calculating
the standardized score or z-score of an observed value, which transforms ˆxiinto a
normally distributed variable with a mean of 0 and a standard deviation of 1. The
z-score is calculated by subtracting the expected value and dividing it by the standard
error:
Z=X−E(X)
SE(X)[8.30]
Since we have calculated the expected values and standard errors of the estimated
multinomial distribution, we can calculate the standardized values for each individual
cell:
Zi=xi−N·ˆpi
pN·ˆpi·(1 −ˆpi)=xi−ˆxi
qˆxi·(1 −ˆxi
N)[8.31]
102
0 2 5 10 15
0.00 0.05 0.10 0.15 0.20
Figure 8.6: Binomial distribution for p=0.005 and N=1000 and normal approximation
The z-scores can be easily interpreted. The last cell for example, has an expected
frequency of 5, but an observed frequency of 2:
Z1=2−5
p5·(1 −0.005) [8.32]
Z1=−1.345 [8.33]
The discrepancy is divided by the standard error to give a z-score of -1.35 indicating
that the observed value is over one standard error away from the expected value. On the
unit normal distribution the probability of such a result is P(Z1≤ −1.345) = 0.0893.
The normal approximation finds a probability of 8.93% which is to be compared with
the exact probability found using the binomial (12.40%). There are two reasons for
this discrepancy. One obvious one is the fact that it is an approximation. The second
reason is that a binomial distribution is a discrete probability distribution, whereas the
normal distribution is continuous. To take this into account we can apply a continuity
correction (Yates, 1934).
The logic behind this is best explained with reference to Figure 8.6. The vertical
lines represent the (discrete) binomial distribution and the sum of the three leftmost
values are the exact binomial probability of a frequency of 2 or less (12.40%). The
curve represents the normal approximation of the binomial and as such is continuous.
The value of 8.93% calculated above corresponds to the area shaded in red. But this
leaves the area between the values of 2 and 3 unaccounted for: if we had used the same
method to calculate P(x1≥3) both probabilities would sum up to less than one! We
can overcome this by moving the observed frequency towards the mean by 0.5 - this
103
is indicated in the Figure green shaded area. This “smoothing out” of the discrete
distribution then results in the following normal approximation where 0.5 is added if
the observed value is below the expected (xi<ˆxi) or subtracted if the observation is
larger than the expectation(xi>ˆxi):
Zi=xi−ˆxi±0.5
qˆxi·(1 −ˆxi
N)[8.34]
Using Equation [8.34] with our example we find the adjusted z-score to be 1.121 and
the corresponding probability P(Z1≤ −1.121) = 0.1312. The normal approximation of
the binomial probability of the frequency being two or less is 13.12%, which is slightly
more than the exact probability. Again it is easy to see why from Figure 8.6: while
the normal distribution is perfectly symmetrical, the binomial is positively skewed.
Observations that are higher than the expected value – towards the right of the graph
– will therefore be underestimated, while observations lower than the expectation –
towards the left of the graph – will be overestimated as our example showed.
The importance of the skew of the binomial distribution for relatively small proba-
bilities can be demonstrated with another example from Table 8.4. The first two cells
both have an expected frequency of 50 and they both have an error xi−ˆxi= 10. Con-
sequently they both have the same z-score and are judged as being equally (un)likely.
However their exact probabilities calculated using the binomial are not equal: the
chance of a frequency that is 10 or more lower than expected is 8.06% while the chance
of a frequencies that is at least 10 higher than expected is 8.67%. It is clear from the
third and fourth cell that as the expected probability becomes larger, the skew becomes
less and less noticeable.
The central limit theorem states that by increasing the number of ‘repetitions’
i.e. increasing N, the binomial becomes more and more smooth and as N→ ∞ it
becomes normal. This occurs faster the larger the expected probability as this makes
the binomial distribution more symmetrical. However for very small probabilities this
approximation is much slower to converge which is why the usual rule of thumb is to
not use it for expected frequencies that are smaller than 5.
Assuming the conditions for normal approximation are reasonably met11, using z-
scores is particularly attractive because it allows a familiar and easily interpreted way of
assessing the degree of error in each individual cell. Furthermore, since we are dealing
with unit normal variables, the square of such a variable has the χ2probability distri-
bution with one degree of freedom and the sum of ksquared z-scores are distributed
approximately χ2with kdegrees of freedom. For a table we can therefore define the
sum of squared z-scores:
11In this case we assume the pi·Nare large enough for the binomial to be symmetrical and that N
is also large enough to make the smoothing to the continuous normal distribution negligible.
104
Z=XZ2
i=X
xi−ˆxi
qˆxi·(1 −ˆxi
N)
2
=X(xi−ˆxi)2
ˆxi·(1 −ˆxi
N)[8.35]
On closer observation, this formula is very familiar. It is in fact the same as Pear-
son’s X2with the added (1 −ˆpi) in the denominator:
Z=X(xi−ˆxi)2
ˆxi·1
(1 −ˆpi)=X2·X1
(1 −ˆpi)[8.36]
The final two columns in Table 8.4 compare the z2
iand X2for the individual cells
and the table as a whole. The numbers confirm what can be deduced from Equation
[8.36]: the Z2
ivalues are consistently larger than X2. We can also see that the smaller
the expected probability, the less difference there is between the two statistics. For a
large table with many small probabilities, (1 −ˆpi) approaches unity and therefore Z
approaches Pearson’s X2.
8.4.3 Pearson’s X2and the Power Divergence Family of Statistics
Despite being probably the most widely used goodness-of-fit statistic, the logic behind
Pearson’s X2is rarely explained in an accessible manner. In this section we attempt
to correct this by following on from the above derivation of z-scores as the normal
approximation of the binomial and the result that followed showing that the sum of
squared z-scores has a chi-squared limiting distribution with kdegrees of freedom. This
allows a comprehensive derivation of Pearson’s X2, which will then be further expanded
by framing it within the Power Divergence family of statistics.
The systematic and elaborated exposition attempted here is in particularly stark
contrast to Pearson’s original 1900 article which produced the now classic formula12:
X2=X(xi−ˆxi)2
ˆxi
[8.37]
More than 80 years later Plackett (1983) makes a valiant attempt at ‘translating’
Pearson’s work - and to do so must stumble through allusions, archaic terminology,
omissions and errors that make the derivation of Pearson’s chi-squared statistic so
inaccessible. Pearson for example never mentions multinomial probabilities, nor does
he give any explanation on the approximation through the normal distribution. He
talks of the ‘probability of the observed system’ in the case of an even number of
categories and of the systems ‘improbability’ when the number of categories is odd
(sic). Furthermore, at the time of his writing, the concept of degrees of freedom had
not yet been discovered, so the question of whether or not he got them right is moot
at best (Stigler, 1999, chapter 19.)
12In order to distinguish between the statistic and the distribution we use the latin X2for the former
and the greek χ2for the latter.
105
In modern notation, his derivation is based on the property of certain quadratic
forms having chi-squared limiting distributions. In matrix notation Pearson defines
the X2statistic as (1900):
X2=x′V−1x[8.38]
where xis a vector of errors and Vis the covariance matrix of the supposedly normally
distributed variables. To paraphrase Placket: three pages of algebra later Pearson
arrives at the classic textbook formula (1983, p.63,). The full derivation is not given
here. Instead we give the simple example of a two category case that extends by
analogy.
We consider a single binomial random variable X1∼Bin(N, p1). Then let p2=
1−p1which defines X2=N−X1. In a “two cell” situation such as this one it is clear
that one only needs to assess one probability (p1) as the other follows automatically.
If for example X1is the event of a die toss coming up with 6 points – a probability we
would expect to be p1= 1/6 – then X2is the event that any other number of points
are thrown. Then any measure of the degree of discrepancy – is the die performing to
our expectations? – needs to only look at the observed occurrences of X1. Or to put it
in other words: there is only one degree of freedom.
Using the normal approximation (Equation [8.31]) we can define Z2, which we know
has a χ2distribution with one degree of freedom:.
Z2=(X1−N·ˆp1)2
N·ˆp1·(1 −ˆp1)[8.39]
This can then be rewritten by expanding the numerator:
Z2=(X1−N·ˆp1)2·(1 −ˆp1) + (X1−N·ˆp1)2·ˆp1
N·ˆp1·(1 −ˆp1)
Z2=(X1−N·ˆp1)2·(1 −ˆp1) + (N−X2−N·ˆp1)2·ˆp1
N·ˆp1·(1 −ˆp1)
Z2=(X1−N·ˆp1)2·(1 −ˆp1)
N·ˆp1·(1 −ˆp1)+(N−X2−N·(1 −ˆp2))2·ˆp1
N·ˆp1·(1 −ˆp1)
Z2=(X1−N·ˆp1)2
N·ˆp1
+(N−X2−N+N·ˆp2)2
N·(1 −ˆp1)
Z2=(X1−N·ˆp1)2
N·ˆp1
+(X2−N·ˆp2)2
N·ˆp2
[8.40]
which is of course easily recognisable as Pearson’s chi-squared statistic for k= 2 and
has, as noted above, one degree of freedom. This result can be generalised to all k > 1.
Regardless of the number of categories the last, kth category is determined by the
preceding k−1 categories and hence an analogous derivation as above will result in
a test statistic with an approximately chi-squared distribution and k−1 degrees of
freedom (Hogg & Craig, 1978, p.269-72).
106
k= 11
k= 10
X2
(p= 0.055)
Z2
(p= 0.056)
Probability Density
15 20 25
0
0 10 30
0.1
20
0.00 0.02 0.04 0.06
Figure 8.7: Comparison of X2and Z2p-values
The difference compared to the above definition of the sum of Z2statistic (Equation
[8.35]) is in the removal of this one degree of freedom which is, in the case of X2,
superfluous. Thus we can return to the example from Table 8.4 (page 102) and compare
the Z2and Pearson’s X2statistics. There are 11 categories in this example so the Z2
is asymptotically χ2with k= 11 degrees of freedom, while Pearson’s X2has k= 10.
The Z2value from the last row is 19.28 and is – as was shown in Equation [8.36] –
larger than the X2value of 17.99. Figure 8.7 plots the probability densities of both χ2
distributions as well as the locations of the respective values of the test statistics. The
shaded area represents the p-values: the probability of a configuration with an even
more extreme value of the test statistic, given that the expected probabilities are true.
In this case both statistics give similar results: the chance of getting this distribution
given the assumed theoretical distribution is 5.6% or 5.5%, depending on the statistic13.
The main reason for focusing on Z2in this context is the fact that it is conceptually
closely related to Pearson’s X2. It has already been mentioned however, that Z2is not
a common goodness-of-fit measure, and historically the likelihood ratio test statistic
(G2) has probably been the main competitor to X2(Read & Cressie, 1988, p.133).
In addition to G2, several other statistics have been proposed such as the already
mentioned Freeman-Tukey statistic (Equation [8.3]) and the minimum discrimination
statistic, all of which have the same asymptotic chi-squared distribution. Cressie &
Read (1984) show that all of these statistics are in fact members of the power-divergence
family of goddess-of-fit statistics.
A substantial addition to the theory of goodness-of-fit statistics, the power-divergence
13Although it follows from Equation [8.36] that Z2is always larger than X2, it does not follow that
the p-value for the Z2is also always smaller than the p-value for X2(as this example shows). Because
their asymptotic distributions have different degrees of freedom, the relative size of the statistics is
no indication of the relative size of the p-values and hence it can not be concluded that Z2is more
conservative (cf. Voas & Williamson, 2001, p.186)
107
statistic is defined as:
2nIλ=2
λ(λ+ 1)
k
X
i=1
xi xi
ˆxiλ
−1![8.41]
where the λparameter is a real number (−∞ < x < ∞). In the cases of λ= 0 and
λ=−1 Equation [8.41] does not have a solution, so there the statistic is defined in the
limits (λ→0 and λ→ −1). It is easy to show that Pearson’s X2and all the other
above-mentioned statistics are members of the power divergence family with specific λ
values (Read & Cressie, 1988, p.16):
λ= 1 Pearson’s X22nI1=P(xi−ˆxi)2
ˆxi
λ= 0 loglikelihood ratio statistic G22nI0=2 ·Pxi·log xi
ˆxi
λ=−1/2Freeman-Tukey statistic F T 22nI−1/2=4 ·P(√xi−√ˆxi)2
λ=−1minimum discrimination statistic I2nI−1=2 ·P(ˆxi·log ˆxi
xi)
λ=−2Neyman modified X2statistic NM 22nI−2=P(ˆxi−xi)2
xi
In this way Equation [8.41] represents a convenient generalization of several common
108
measures14.
Furthermore, Cressie and Read prove that not only these five, but all power-
divergence statistics - regardless of the value of the λparameter - converge to a χ2
distribution (1984).This is an important point as it gives a new perspective to the
arguments about which measure is the correct one. In this sense the logic behind Pear-
son’s X2, whereby the statistic is based on the square of the difference, seems rightly
arbitrary. Instead of e.g. comparing Pearson’s X2and the loglikelihood ratio G2as two
separate statistics, we can consider the whole spectrum of intermediate possibilities as
well.
Returning to the example from Table 8.4 (page 102) we can calculate the value of
the power-divergence statistic for a range of λvalues. The results are shown on the left-
hand side of Figure 8.8. When λ= 1 the power-divergence statistic equals 17.99, which
is of course the previously obtained value of Pearson’s X2. The loglikelihood ratio
statistic G2(λ= 0) has a higher value of 18.87, while the Freeman-Tukey (λ=−1/2),
the minimum discrimination (λ=−1) and the Neyman modified X2(λ=−2) statistics
have higher values still. To the other side there are λvalues that result in the statistic
having a lower value than Pearson’s (the minimum is at approximately λ= 2.38)
14 It should be noted though, that all these equivalences hold only for the actual measures themselves,
not for their individual cell components or summands. Thus the power divergence statistic where λ= 1
equals Pearson’s X2, but if we calculate each individual cell’s contribution to the total, these will not
be the same. The Cressie-Read family of measures are so-called signed measures, which means that
individual components of the error may be negative, although their sum will always remain positive.
In order to make this distinction clear we start by expanding Pearson’s X2as follows:
X2=X(xi−ˆxi)2
ˆxi
=
=N·X(pi−ˆpi)2
ˆpi
=
=N·X(p2
i
ˆpi
−2·pi+ ˆpi) =
=N·(Xp2
i
ˆpi
−Xpi−Xpi+Xˆpi) [8.42]
Because the probabilities in a table will always sum up to one, the last two terms in Equation [8.42]
are therefore −Ppi+Pˆpi=−1 + 1 = 0 and can be removed. If the expression is then contracted
again:
X2=N·(Xp2
i
ˆpi
−Xpi) = [8.43]
=N·X(p2
i
ˆpi
−pi) =
=N·Xpi(pi
ˆpi
−1) =
=Xxi(xi
ˆxi
−1)
we get the Cressie Read statistic for λ= 1 (see Equation [8.41]). The two terms that were removed
from X2effectively complete the square i.e. make all the Pearson X2summands positive. Without
them though, the Cressie Read statistic summands can also be negative. So although Equations [8.42]
and [8.43] are equal, their summands i.e. the individual contributions of each cell, are not.
109
λ2nIλ
-10.0 529.60
-5.0 45.02
-3.0 27.09
-2.0 22.97
-1.5 21.55
-1.0 20.42
-0.5 19.54
0.0 18.87
0.5 18.36
1.0 17.99
1.5 17.75
2.0 17.62
3.0 17.67
5.0 18.87
10.0 30.64 λvalue
p-value
-5 5
1 2-1-2 0
0.00 0.01 0.02 0.03 0.04 0.05 0.06
Figure 8.8: Power-divergence statistic for various λvalues and corresponding p-values
before increasing again. All of these statistics are asymptotically χ2distributed with
10 degrees of freedom. If we were to use p= 0.05 as an acceptable significance level
then the critical value for the statistics is 18.31. So using Pearson’s X2we would not
be able to reject the null hypothesis. In fact if we used any power-divergence statistic
with 0.56 < λ < 4.33 the p-value would be above 0.05 (see shaded area of graph in
Figure 8.8) and we would be unable to reject the null hypothesis. Statistics with λ
values outside this range, and this includes the other four statistics mentioned above,
would on the other hand lead us to reject the null hypothesis.15
So which statistic should one use? In order to examine how and why they behave
differently, we need to remove the element of approximation and look at the exact values
of the limiting distribution. Before we turn to that it is worth noting that Cressie and
Read recommend the choice of λ= 2/3 as generally the most preferable of options
(1984, p.463), defining what we will call the Cressie-Read statistic as:
CR2= 2nI2/3=9
5
k
X
i=1
xi xi
ˆxi2/3
−1![8.44]
To the question why are the results from different power-divergence statistics appar-
ently inconsistent, the answer is that they are measuring different things. In particular
Read and Cressie find that choosing the right value for λcan guard against specific
types of extreme behaviour (1988, p. 80; Basu et al. 2002, p. 383)
15It is worth noting that this situation, where Pearson’s X2is the least conservative of the five, is
an artefact of this dataset and should in no way be seen as a rule.
110
• large values of λare more sensitive to outliers - cells which have a higher frequency
than predicted i.e. where the ratio of observed/expected is large;
• small values of λare more sensitive to inliers - cells which have a lower frequency
than predicted i.e. where the ratio of observed/expected is near-zero;
This also means that we can determine which type of cells are dominating the
power-divergence statistics by observing the speed at which the values of the statistics
grow (Read & Cressie, 1988, 88-97). In our example we can see from the table in Figure
8.8 that the statistic grows faster for negative values of λthan for positive ones. This
means the cells are dominated by an inlier - a cell with a small ratio of xi/ˆxi. Looking
back at the original data we can see that the most dramatic ratio of observed/expected
is found in the last cell with an expected value of 5 and an observed frequency of 2.
The choice of λtherefore depends on the importance we wish to attach to either large
or small ratios of observed to expected frequencies. The larger |λ|, the more weight will
be given to the most extreme outliers/inliers, which means the lack of fit of a single
cell will come to dominate the statistic.
The authors also find that using λvalues larger than 5 or smaller than -5 is not
advisable as the increase in power becomes small while at the same time the χ2approx-
imation becomes worse. Lastly, empty cells in large sparse tables mean that power-
divergence statistics for λ≤ −1 are undefined due to xi= 0 being in the denominator,
so they advise against using values smaller than -1 in sparse tables.
8.4.4 Exact Permutation Distributions of Goodness-of-fit Statistics
Except for the general distance measures described at the beginning of this section
all the statistics that have been mentioned rely on the normal approximation of the
multinomial. As we have seen, even in the simple case of a binomial distribution
calculating the exact underlying probabilities is a computationally rather intensive
procedure, at least compared to a quick look-up of the normal probabilities. In the
case of the multinomial distribution, the procedure is even more complex and resource
hungry. This is of course the driving principle for statistics such as Pearson’s chi-
squared as they allow quick estimation of goodness-of-fit.
Because of the central limit theorem these approximations are accurate enough for
large values of Nto override any minimal benefits of extra precision that could be
achieved at the cost of computational resources. But with contemporary processing
power this is no longer true for small samples, where calculating the exact multinomial
probabilities can often give significantly different results than the approximations. Fur-
thermore, there is no other way to compare the behaviour of different measures except
by comparing them to the values they should actually be giving.
The exact binomial test does not scale up as simply as one might hope. This has in
111
fact been a source of much confusion and contention in the literature (Radlow & Alf,
1975; Read & Cressie, 1988, p. 136 ff.). In the binomial case above we followed the
simple principle: (i) find all the possible configurations that are even more discrepant
than the one observed (ii) sum up their probabilities. The crucial point lies in the
fact that more discrepant does not necessarily mean less likely. The goal of all these
goodness-of-fit measures is to establish how unlikely the discrepancy is and not how
likely is it that something even more unlikely were to happen. This is best demonstrated
with a simple example.
We consider a three cell example with expected probabilities ˆp1= 0.2, ˆp2= 0.3
and ˆp3= 0.5. In order to be able to enumerate all the possibilities we take the total
number of observations to be N= 4 only. This gives us 15 possible outcomes – there
are 15 different ways in which 4 people can be divided between 3 cells (all are given in
Table 8.5). Say we observe x1= 0, x2= 0 and p3= 4 i.e. all four people are in the last
category. We can calculate Pearson’s X2that gives a value of 4 and, with two degrees
of freedom, a p-value of 0.1353. So given that the null hypothesis is true, Pearson’s
X2is interpreted as there being a 13.53 percent probability of getting an even more
extreme result 16.
But what is the exact probability of getting a more extreme result? We first need
to calculate the probabilities of each of the 15 possible outcomes, given that the null
hypothesis is true. We use the multinomial formula:
P(xi|ˆpi) = N!·Yˆpxi
i
xi![8.45]
So if the ˆpivalues are true, the probability of our observation is:
P(xi|ˆpi) = 4! ·0.20·0.30·0.54
0! ·0!4! = 0.0625 [8.46]
After calculating all 15 probabilities, the next step is to calculate the corresponding
Pearson’s X2statistics for each possibility and rank them from smallest to largest.
Table 8.5 provides this ordered result. The observation 0 0 4 is the eighth most extreme
result: the first seven possibilities all have lower values of Pearson’s X2. The remaining
seven outcomes below are all more extreme than our observation. It should be noted
that the ordering of the X2values is not the same as the ordering of the probabilities!
Our observation is in fact more likely to occur than outcome 7. but has a larger
discrepancy from the expectation as measured by Pearson’s X2. In order to get the
exact p-value we therefore have to sum up all the probabilities for cases 8. to 15 (printed
in bold in the table). This gives a value of 0.195 to be compared to 0.1353 from the χ2
approximation, indicating that Pearson’s X2was overly conservative in this case.
16A more extreme result means a result with a higher value of Pearson’s X2statistic, not a less likely
result!
112
Table 8.5: Calculation of exact p-value (N= 4, k= 3, ˆp1= 0.2, ˆp2= 0.3)
x1x2x3P(xi|ˆpi)X2
1. 1 1 2 0.1800 0.08
2. 1 2 1 0.1080 1.08
3. 0 1 3 0.1500 1.33
4. 0 2 2 0.1350 1.33
5. 1 0 3 0.1000 1.75
6. 2 1 1 0.0720 2.33
7. 2 0 2 0.0600 3.00
8. 0 0 4 0.0625 4.00
9. 0 3 1 0.0540 4.00
10. 2 2 0 0.0216 4.33
11. 1 3 0 0.0216 4.75
12. 3 0 1 0.0160 7.75
13. 3 1 0 0.0096 8.08
14. 0 4 0 0.0081 9.33
15. 4 0 0 0.0016 16.00
The same calculation can be repeated for any goodness-of-fit statistic. For example
using Cressie and Read’s preferred λ= 2/3 our observation would have a p-value of
0.1207. However the exact p-value is 0.141. This happened because according to C R2
the ninth outcome (0 3 1) has a lower value than (0 0 4) and overtook our observation
in the ranking, reducing the p-value by 0.054. This makes sense as we know that lower
λvalues give more weight to inliers and our observation has two cells with the most
extreme observed/expected ratio of zero.
The same calculation can also be repeated for Z2where the p-value from the χ2
approximation (3 degrees of freedom) is 0.0815 and the exact probability is also 0.141,
again making it an overly conservative estimate.
8.4.5 The importance of significance
The choice of goodness-of-fit statistics for a particular application is notoriously diffi-
cult, not least due to the number of options available. A further consideration that is
specific to this application is the requirement that the statistics be comparable between
tables of different sizes, both in terms of numbers of cells (i.e. degrees of freedom) and
in terms of total population (N). This is not usually an issue since in most cases re-
searchers are comparing goodness-of-fit for a single table and various models or at most
for several populations but still for a single table.
In the case of the simple distance measures TAE and RMSE this is solved by
standardizing the measures to produce the proportion misclassified (∆) and the stan-
113
dardized RMSE respectively17. In the case of Pearson’s chi squared and the rest of the
power divergence family of statistics, all of which have the same asymptotic chi-squared
distribution, the standardization is achieved by calculating the p-value.
The p-value was defined in Section 8.4.3 as the probability of a configuration with
an even more extreme value of the test statistic, given that the expected probabilities
are true18. It has been known for a long time that in practice, when dealing with large
data sets, p-values can end up being of little use. Over 70 years ago Joseph Berkson
made the following observation (1938, p.526):
I believe that an observant statistician who has had any considerable ex-
perience with applying the chi-square test repeatedly will agree with my
statement that, as a matter of observation, when the numbers in the data
are quite large, the P’s tend to come out small. Having observed this, and
on reflection, I make the following dogmatic statement, referring for illus-
tration to the normal curve: "If the normal curve is fitted to a body of data
representing any real observations whatever of quantities in the physical
world,then if the number of observations is extremely large–for instance, on
the order of 200,000 – the chi-square P will be small beyond any usual limit
of significance."
The fact is that the p-value is partly a function of the number of observations, which
means that for large samples the null hypothesis will generally be rejected i.e. the p-
values will tend to be close to zero and below whichever significance level is chosen. But
what this also means is that as a way of standardizing goodness-of-fit across tables, the
p-value will not have a lot of power of discrimination. In fact, in a preliminary test of
IPF estimates using the simplest (i.e. worst) model all the tables had p-values smaller
than 2.22 ×10−16, which is the smallest value that R reports accurately19.
For all practical intents and purposes then, the p-values of these tables are all zero.
This simply means that if the null model (the no interaction model in this case) were
correct, it would be virtually impossible to observe the values that have been observed
as the probability of such an even happening is quite negligible. But the extremely
high significance of such a result is of little use as it was never expected for the model
to fit perfectly anyway – the question really is how poorly it fits. Furthermore, the
17Although see section 8.4.1 for some caveats with regard to the actual comparability of ∆between
tables.
18Many authors have pointed out the commonly held but erroneous beliefs that the p-value in any
way indicates the probability of the hypothesis being true, the improbability of the observed results
being due to error, the degree of faith that the results are real and many similar examples of wishful
thinking (see e.g. Bakan, 1966; Gigerenzer, 2004). Or to quote Bakan’s view on the test of significance:
"a great deal of mischief has been associated with its use" (ibid. p. 423).
19This is the so called machine epsilon – the smallest number that when added to one gives something
different to one i.e. the smallest difference between two numbers that the machine recognises (Press
et al., 1992, p.882).
114
p-values are only telling us that all the tables fit extremely poorly, but offering no way
to discriminate between the levels of lack of fit.
Another way of standardizing the power divergence statistics is therefore needed.
Very little is to be found in the literature regarding solving this problem. In Discrete
Multivariate Analysis, the authoritative work on categorical data analysis, the authors
describe the situation where one wishes to compare chi squares for tables with large
but different values of Nin which case they suggest dividing the statistic by Nitself
(Bishop et al., 1975, p. 330). This is however of no help when dealing with tables of
different sizes.
A more useful suggestion comes from the literature on modelling in psychology
or rather psychometrics. In structural equation modelling the concept of relative or
normal chi-squared is often used to describe the ratio of Pearson’s chi-squared statistic
and the degrees of freedom. Originally proposed by Wheaton et al. (1977, p.99) using
X2/df as a measure of fit this literature introduces the distinction between statistical
measures such as X2and practical measures such as X2/df (Byrne, 1991). Since df
is the expected mean of the asymptotic chi-squared distribution a ratio value of 1
would indicate average fit. Wheaton et al. (ibid.) describe it’s value as giving a rough
indication of “fit per degree of freedom”. There is however no real consensus on the
interpretation of this rather informal measure and researchers may treat anything lower
than 2 and up to 5 as a plausible fit (Munro 2005, p.346; Mueller 1996, p.83-4).
Furthermore there is little theoretical backing for this practical measure as its use
seems to stem simply from the data not conforming to proposed models using statistical
measures of fit20. To demonstrate just how rough this measure is, Figure 8.9 compares
the densities of a selection of χ2distributions (top panel) with the densities of those
same distributions divided by their respective degrees of freedom (bottom panel). It is
clear that χ2/df aligns all the distributions around a mean of one, but they still have
vastly different standard deviations making their values quite incomparable. Thus
critical values for p= 0.01 range from 1.20 for k= 300 to 1.87 for k= 20 (indicated in
red) instead of being similar.
A second option is to use the central limit theorem whereby χ2, being a sum of k
random variables, becomes approximately normal as kincreases – with a mean of k
and a standard deviation of √2k. For small values of kthe distributions are still quite
skewed but for k > 50 the approximation is considered good (Box et al., 1978, p.118).
This means we can transform χ2values into a standard normal variable:
20It has been suggested that the logic behind using X2/df is to do with reducing the effect of sample
size on chi squared (Carmines & McIver, 1981, p.80). This is in contrast with the original Wheaton et
al. article, where they judge a ratio of 5 as reasonable “[f]or our sample size” making it clear the new
measure only allows the relative (and rough) comparison of models of different complexity, but should
be interpreted differently depending on the sample size.
115
k= 20
k= 50
k= 100 k= 200 k= 300
k= 300
k= 200
k= 100
k= 50
k= 20
χ2
0.01/df
0
0.0 0.5 1.0 1.5 2.0
100 200 300 400
0.00
0.02
0.04
0.06
0.08
0
1
2
3
4
5
χ2
χ2/df
Figure 8.9: Selected χ2distributions and densities of χ2/df
χ2−df
√2df ≈N(0,1) [8.47]
Other transformations have been suggested that are more precise and reduce the
skew of chi-squared distributions. The first is R.A. Fisher’s approximation, originally
used to calculate critical values that were not available in published tables (Fisher,
1950(1925), p.81):
q2χ2−p2df −1≈N(0,1) [8.48]
An even closer approximation was offered by Wilson and Hilferty (1931) this time
taking the cubic root of χ2:
3
pχ2/df −1 + 2
9df
q2
9df ≈N(0,1) [8.49]
Figure 8.10 compares the three normalizations for the same χ2distributions as
before (k= (20,50,100,200,300)). The standard normal distribution is superimposed
in red and it is clear that the central limit asymptotic approximation (top) offers the
least correction of skew, with the Fisher normalization (middle) already significantly
better and the Wilson & Hilferty one (bottom) clearly best.
As before the critical values for p= 0.01 are also shown for all 5 distributions, this
time enlarged for clarity (right side of Figure 8.10). Again visual inspection shows the
Wilson & Hilferty normalisation is the best as all five values are closest together - and
closest to the correct value shown in red. Adopting the notation Zχ2|df for Equation
116
1 2 3
-1-2
-3 0
0.0
0.0
0.0
0.1
0.1
0.1
0.2
0.2
0.2
0.3
0.3
0.3
0.4
0.4
0.4
0.5
0.5
0.5
Asymptotic
(Eq. [8.47])
Fisher’s
(Eq. [8.48])
Wilson & Hilferty’s
(Eq. [8.49])
2.3 2.4 2.5 2.6 2.7 0.00
0.00
0.00
0.01
0.01
0.01
0.02
0.02
0.02
0.03
0.03
0.03
0.04
0.04
0.04
0.05
0.05
0.05
Figure 8.10: Three normalizations of selected χ2distributions (left) and their respective
critical values for p= 0.01 (right)
[8.49] allows us to express any value of a chi-squared variable with any number of degrees
of freedom as a standard unit normal or z-score. For example the critical values for
the selected degrees of freedom range from Zχ2
0.01|20 = 2.323731 to Zχ2
0.01|300 = 2.326056
(bottom right panel). So using a standard normal approximation of Zχ2|df , these values
correspond to p= 0.01006996 and p= 0.01000778 respectively.21
The Zχ2|df values – like any z-score values – can be transformed into p-values.
But the crucial point is that they do not have to be. A computer with a machine
epsilon of 2.22 ×10−16 cannot distinguish between the p-value of a z-score of 9 and a
z-score of 10. In both cases the z-score is smaller than 2.22 ×10−16 , so the programme
cannot differentiate them. But we do not need to transform them as both values are
already directly comparable as unit normal variables: it is clear that being 10 standard
deviations from the mean represents a worse fit than being 9 standard deviations away.
Using the Wilson-Hilferty normal approximation of chi-square allows us to measure
the lack of fit with any of the power divergence statistics, regardless of how bad the fit
21Other, even more precise, normal approximations of chi-squared have been proposed (e.g. Peizer
& Pratt (1968); Canal (2005). However the formulas for these are significantly more complex and it is
deemed the Wilson Hilferty approximation offers sufficient accuracy while being relatively straightfor-
ward to compute and thus appropriate for this application. Its accuracy (maximum absolute error) is
to 5 decimal places for degrees of freedom above 100 and this accuracy only increases for larger values
(Canal, 2005, p.806). The smallest number of degrees of freedom in our application is 372.
117
is. Bypassing p-values in this way means there is no confusion about the significance
of the fit when we are not in fact testing a hypothesis. At the same time normalization
also has the advantage that the metric – number of standard deviations from the mean
– is still a familiar one.
8.5 Software Solutions
Any statistical software that has a function to calculate log-linear models automatically
includes some sort of IPF functionality. Its use with standard software is however
usually not straightforward and does not allow for the sort of flexibility anticipated
here. The issues and limitations of existing software solutions are discussed in this
section along with the final decision to write bespoke code in the R programming
language. The main special features of the function are described here, while the full
code with liberal commentary can be found in Appendix D.
If SPSS uses IPF to calculate the fit of log-linear models22 , why can it not be used
for IPF applications in general? It can, but not directly. When we say general IPF
applications we mean situations where the data is incomplete - only some of the margins
are available, or some may even be sampled. In log-linear modelling practice on the
other hand, the whole table is always available and the modelling proceeds by a process
of elimination. Because of this, log-linear functions in standard statistical software will
normally require a full table as an input, in addition to selected constraints to be fitted.
But this means that if the same function is to be used for IPF when only some of the
margins are known, a fake table needs to be created first. This table has no effect
on the final result and serves almost as decoy tricking the function to run without
error. An implementation of this idea is described by Simpson & Tranmer (2005)
who give detailed instructions on how to create “any set of values that sum to the
marginal subtotals”, which is then used as an input in the SPSS commands GENLOG
or HILOGLINEAR (p.230).
If the software is available this is indeed a clever trick to use and it can be scaled
up to as many dimensions as necessary. A similar solution could be applied to any
other log-linear function. But the above solution does involve an extra step simply
because it uses a function not originally intended for the purpose. It was therefore felt
that a purpose written function would be more appropriate. This furthermore allowed
the inclusion of extra functionality, in particular the option of combining margins from
different sources i.e. defining a set of constraints as original and another set that are
either sampled or from another source. The function then first adjusts these two sets
22To be precise IPF is used to calculate the maximum likelihood estimates in the HILOGLINEAR
command, while the GENLOG command uses the Newton-Raphson algorithm (SPSS, 1995 - 2010) but
regardless of the internal implementation the results are the same.
118
1
2RE A D l i s t of cons t r a i n t s an d o p t i o n a l t a b l e s e e d
3READ or USE DEFAULT maximum number of iterations and maximum error
4
5WH I L E e r r o r is to o l a r g e A ND i t e r atio n s h a v e n ’t re a c h e d m a x i m u m
6RE P E A T f or e a c h c o n stra i n t o n th e list
7Find ratio of given constraint and actual marginal total
8Mu l t i p l y wh o l e t a b l e w i t h s aid rati o
9UN T I L y ou ’ v e g o ne thr o u g h a l l the con s t r a i n t s
10 Ca l c u l a te er r o r a n d a d d + 1 to iter a t i o n coun t
11 EN D W H ILE
12
13 WR I T E f i n a l IP F -e d tabl e con s i s t e nt wit h s e e d and l i s t of con s t r a i n t s
Figure 8.11: Pseudocode for IPF kernel
of constraints among each other, thereby making the sampled ones consistent with the
original ones, and then adjusts the whole table to the complete set of constraints.
In addition to the above mentioned misgivings, experience with daily work with
several million records in an SPSS file on a standard desktop led to the decision to
overcome a dearth of previous programming experience and learn and use the R pro-
gramming language (R Development Core Team, 2011). The result was an IPF function
that can operate as a simple IPF function without the requirement to input a bogus
table, but can also automatically handle constraints from different sources.
The kernel of the code is of course the implementation of IPF itself. The framework
for this is based on a function written by Stabler & Gregor (2003), but was extended
significantly to make the procedure more general. The original code, used internally by
the Oregon Department of Transportation, only allowed for one-dimensional margins,
although curiously there were no limits on the dimensions of the table. The code was
rewritten and elaborated to allow not only for higher level margins or constraints, but
also for constraints to be of different levels.23 This makes an extra requirement of the
user who must, in addition to specifying a list of constraints, also supply a list naming
the constraints - a map of sorts. The pseudocode for this is given in Figure 8.11 which
corresponds to the R code in Appendix D, lines 135 to 165.
The main enhancement to the code however was providing an option to allow for
sampled constraints to be input directly. This means the IPF function gains one level
of recursion: inside it the sampled constraint(s) become the seed and get adjusted to
the remaining non-sampled constraints. Returning to the top level all the constraints
– sampled and non-sampled – can be applied to the whole table. To give an exam-
ple: a four-dimensional table is sought that conforms with known margins [AG], [BG]
23That function also had the condition that no marginal values be zero, which might be a reasonable
requirement to make if using only one-dimensional margins although even then there must surely be
situations when the condition is not met. As soon as the dimesnions of the constraints are allowed to
increase, zero values must also be allowed, which is the case in the code used here.
119
and [CG], but in addition to these three marginal tables a sample of [ABC] was also
available. The R function would first run IPF using [A], [B] and [C] as constraints and
[ABC ] as the seed, thereby making the sample consistent with the known data and the
correct total size. Then this updated [ABC ] constraint is used in a second round of IPF,
using all four margins [AG], [BG], [C G] and [ABC] to find the maximum likelihood
estimate of the table that is consistent with all the known data.
The function allows for more than one sampled constraint to be used at once, but
the condition is that they do not overlap. This limitation can be overcome if necessary,
by running the function on each of the samples separately and using the output of these
runs - the updated constraints - as the input in the final run of IPF. The code could
undoubtedly be more elegant and efficient, and have more foolproof error handling,
but it is effective, has been thoroughly tested and it provides a comprehensive and
flexible tool that is to our knowledge not available elsewhere. The fully documented
programme is presented in Appendix D.
8.6 Summary
This chapter described the dataset, metrics and software implementation for the IPF
applications that follow. Together with the next chapter, which presents a detailed
portrait of the dataset, its levels of geographic variability and potential issues associated
with the data, it represents an in-depth overview of the methodological considerations
taken in the course of this thesis.
The first section of this chapter introduced the data. Striking an acceptable bal-
ance between meeting the ideal requirements and actually being accessible is the 2001
UK Small Area Microdata, a 5% sample of the UK population as enumerated in the
2001 census. The dataset, or rather the subset of it that was used, and the 57 vari-
ables that had been selected were described, with a full list along with their univariate
distributions provided as an appendix (Appendix A, pages 251 ff.). Issues relating
to the independence of individual cases and of missing values in many variables were
considered and in both cases the decision is made that they do not affect the anal-
ysis or can even be seen as an asset. The section also briefly described the creation
of a comprehensive set of bivariate crosstabulations, the total 1596 of which form the
basic working dataset. Also considered was their local authority dimension, although
more detail is presented in the next chapter, which focuses on the levels of geographic
variation present in the data.
This variation can however only be meaningfully discussed once the strength of
bivariate relationships is operationalized in some way. This provided the methodological
focus of Section 8.3, which discussed a series of metrics for measuring the strength of
bivariate associations. As with the goodness-of-fit measures, which occupy the section
that follows, these two sections go into considerable depth in discussing the requirements
120
for the measures and their behaviour with the actual dataset. Three types of measures
were considered in turn: chi-square based ones, proportional reduction in error measures
and finally the information theoretic approach. Their characteristics were explored
using examples from the SAM, taking into account issues such as empty cells and
comparability between tables.
Three representative measures were chosen: an adjusted Freeman-Tukey chi-square
statistic, Goodman and Kruskal’s symmetrical lambda measure and the entropy coef-
ficient. All of the measures have the required properties, but give sufficiently disparate
results to make their contribution interesting. Beyond the strength of the bivariate
associations the next chapter gives particular attention to the geographic variation of
said strength and uses all three measures, or rather their standard deviations across
local authorities, as a way to assess it.
Although related, goodness-of-fit statistics merit their own investigation, which is
attempted in Section 8.4. No evaluation of IPF results is possible without a measure
of discrepancy, but the definition of what a discrepancy is and what metric should be
applied turns out to be just as slippery as the issue of association strength. The section’s
narrative starts with simple distance based measures such as TAE and SRMSE, which
are contrasted with measures where the relative size of the error is important such
as Z-scores. This allowed us to slowly build up on the measures’ complexity until we
derived Pearson’s chi-square statistic, which was then presented in the framework of
Cressie and Read’s power divergence family of statistics.
With full derivations, and where possible supplemented by graphical demonstration
of the concepts, this review is intended to be comprehensive and accessible in a way
that is often found lacking in basic statistical texts. This approach again reinforces
the idea that there is no one correct measure. Instead the decision was made to use
both percent misclassified(∆) and the Cressie-Read statistic (λ= 2/3) as the two main
metrics to evaluate IPF error, although the log-likelihood ratio statistic (λ= 0) re-
emerges as a useful measure later on because of its additivity. As with all the power
divergence statistics this however leaves the problem of making them comparable across
different(-sized) tables. Although using p-values is a common solution, the extreme
discrepancies anticipated here and computing limitations make p-values unsuitable for
standardizing. This does not seem to be a situation often encountered in the literature,
but a satisfactory solution was nonetheless found by using the Wilson-Hilferty normal
approximation of chi-square. This effectively (and efficiently) standardizes the value of
chi-square based statistic and allows us to use the Cressie-Read statistic on tables with
different degrees of freedom and still be directly comparable.
Last but not least is the technical aspect of our methodology: the programmatic
solution that was chosen instead of using standard software. Section 8.5 described
the alternatives and the rationale for the decision to take up R and write a function
121
specifically for the applications used in this thesis. Although not yet a standard in the
social sciences, R is an incredibly powerful, cross-platform and open source software
environment as well as being at the core of a vibrant and growing community. Despite
being purpose-written for this application, an IPF function that can handle sampled-
constraints as this one can is general enough that others should find use for it as well.
This concludes the methodological section in the strict sense of the word, but before
we move on to the IPF applications the next chapter explores the SAM dataset in more
depth. The chapter is partly devoted to the question of the levels of geographic variation
of the variable relationships that can be found in the SAM using the measures described
above. It also provides an excellent playground for exploring some of the topics that
inevitably turn up when geographically aggregated data are analysed: the modifiable
areal unit problem, the ecological fallacy and Simpson’s paradox. Although the explicit
focus of this examination is spatial, many of the lessons learned apply equally to any
n-dimensional relationship found with in the SAM.
122
Chapter 9
Small Area Microdata And
Geographic Variation
9.1 Introduction
A chapter dedicated to the description of the Small Area Microdata (SAM) and in par-
ticular the geographical variation of bivariate associations exhibited by the population
in question presents a unique opportunity to also describe and investigate some of the
statistical issues that are endemic to geographic data. This allows an overview of issues
such as the modifiable areal unit problem (MAUP) the ecological fallacy and Simpson’s
paradox and their empirical demonstration using SAM.
One main distinguishing characteristic of the dataset of crosstabulations used here is
that we are dealing with bivariate associations between categorical variables. Issues to
do with MAUP, the ecological fallacy and Simpson’s paradox are traditionally explored
in the context of continuous variables and their correlation and regression coefficients
in particular. This is however not necessary. The MAUP in particular is not limited
in any sense to the type of variables or choice of statistics. The ecological fallacy
and Simpson’s paradox on the other hand cannot occur in categorical data unless the
variables are dichotomous, which is why in Section 9.3 the dataset is dichotomized to
allow their investigation.
The first section of this chapter is the most straightforward in its description of
the 1596 tabulations of the SAM data, the strengths of the associations produced as
measured by various measures and their level of geographic variation. This leads into
a demonstration of the MAUP, before the next section devoted to the interrelated
problems of the ecological fallacy and Simpson’s paradox. The chapter is concluded by
a systematic overview of all three issues and how they relate to each other.
123
9.2 Levels of Geographic Variation
The first part of the analysis is a descriptive overview of the strengths of associations
exhibited in the 1596 tables and the level of geographic variation across the 373 Local
Authorities. These 373 areas can be aggregated using two types of aggregation: one
geographic and one geo-demographic. The former is into 10 government office regions,
while the latter groups them into 7 Supergroups or clusters according to the ONS 2001
Area Classification1. As this analysis is performed on SAM data it gives an excellent
and comprehensive insight into the variation of relationships between variables, while
in addition exploring how this variation is expressed at various levels of geographical
aggregation as well as within different classification groups using an alternative non-
geographical aggregation.
Three different measures, discussed in Section 8.3 in the previous chapter are used
to measure the strength of the associations between the 1596 possible bivariate com-
binations: (i) the adjusted Freeman Tukey statistic (adj.F T 2) as an example of a Chi-
squared measure, (ii) Goodman and Kruskal’s lambda (λAB )as a proportionate reduc-
tion in error measure and (iii) entropy coefficient (UAB) as an information-theoretic
measure. These three measures are first used to identify tabulations which exhibit high
degrees of geographic variation. Next we use them to explore how they are affected by
the modifiable areal unit problem.
9.2.1 Variation of Association Strength between Local Authorities
The three measures of association strength that were discussed in Section 8.3 were
shown to produce rather disparate results (e.g. Figure 8.5 on page 95). We will therefore
use all three in the following analysis summarizing the variation of association strength
across the 373 local authorities in our dataset. This is done by taking each of the
1596 variable pairs and calculating the association strength between them in each local
authority. The degree of geographic variation is then measured using the standard
deviation of the association strength across the 373 LAs. This is repeated for each of
the three measures.
Table 9.1 summarises the results of this descriptive analysis of the geographic vari-
ation of association strength. The top part of the table lists the top ten variable
combinations that exhibit the largest variation in association strength. The great ma-
jority of the variables involved are household variables. Furthermore, except for a few
variables relating to communal establishments, most of them are in some way proxy
measures of quality of accommodation. Only one of the variables included on the list is
an individual level variable, namely generation. Comparing the three measures we can
1A full list of the local authorities and their corresponding counties, regions and ONS geodemo-
graphic classifications is presented in Appendix B.
124
Table 9.1: Highest and lowest levels of variation of association strength
Top 10 highest standard deviations
Freeman-Tukey St.Dev. Lambda St.Dev.Entropy Coefficient St.Dev.
1. central heating — housing indicator 0.118 type of c. est. — self-contained 0.243 type of c. est. — self-contained 0.205
2. housing indicator — occupancy rating 0.104 type of c. est. — status in c.est. 0.230 type of c. est. — status in c.est. 0.200
3. bath/shower/toilet — self-contained 0.086 bath/shower/toilet — type of c. est. 0.227 bath/shower/toilet — type of c. est. 0.189
4. central heating — self-contained 0.086 central heating — self-contained 0.165 central heating — self-contained 0.136
5. no. of rooms — self-contained 0.086 central heating— status in c.est. 0.164 central heating— status in c.est. 0.135
6. no. of residents — self-contained 0.085 bath/shower/toilet — central heating 0.163 bath/shower/toilet — central heating 0.133
7. lowest floor — self-contained 0.085 central heating — housing indicator 0.158 central heating — housing indicator 0.128
8. generation — self-contained 0.085 lowest floor — self-contained 0.144 lowest floor — self-contained 0.121
9. bath/shower/toilet — no. of residents 0.085 bath/shower/toilet — lowest floor 0.142 status in c.est. — lowest floor 0.120
10.bath/shower/toilet — no. of rooms 0.085 status in c.est. — lowest floor 0.141 bath/shower/toilet — lowest floor 0.119
Top 10 lowest standard deviations
Freeman-Tukey St.Dev. Lambda St.Dev.Entropy Coefficient St.Dev.
1. distance to work — year last worked 5.8·10−3bath/shower/toilet — care provided 0 gender — schoolchild or student 5.8·10−4
2. year last worked — transpor t to work 5.8·10−3type of c. est. — care provided 0 lim. long-term illness — gender 7.7·10−4
3. year last worked — workplace 5.8·10−3central heating — care provided 0 gender — term-time address 8.4·10−4
4. hours weekly — year last worked 5.8·10−3status in c.est. — care provided 0 country of birth — gender 9.1·10−4
5. hours weekly — transport to work 6.1·10−3residents per room — care provided 0 migration indicator — gender 9.5·10−4
6. hours weekly — workplace 6.3·10−3distance moved — hh.education ind. 0 distance moved — gender 9.7·10−4
7. distance to work — hours weekly 6.4·10−3distance moved — lim. long-term illness 0 age — gender 1.2·10−3
8. distance to work — economic activity 6.5·10−3distance moved — care provided 0 socio-economic status — gender 1.2·10−3
9. economic activity — workplace 6.5·10−3distance to work — health 0 care provided — gender 1.2·10−3
10.economic activity — hours weekly 6.5·10−3distance to work — lim. long-term illness 0 professional qualification— gender 1.3·10−3
125
see almost perfect agreement between the lambda and entropy coefficient measures,
where the top eight tables match up precisely and only the ninth and tenth switch
places. The Freeman-Tukey measure picks out three tables that are also picked up as
having high variation by the other two measures. Overall it can be said that all three
measures are pretty consistent at this end of the spectrum.
In order to further investigate the degree of geographical variation we select one
of the high ranking tables, namely the crosstabulation between central heating and
housing indicator, which ranks first as measured by the Freeman-Tukey measure and
seventh using both lambda and the entropy coefficient. This is a three by three ta-
ble where central heating has the possible values "Yes in some or all rooms", "No" or
"n/a communal establishment) and housing indicator is a derived variable indicating
the household characteristics as "Not overcrowded or lacking amenities", "Overcrowd-
ed/lacks bath/shower,wc or heating" or "Not in household" .
Central heating
n/a
Yes
No n/a
Yes
No
Housing indicator
OK
Not OK
n/a
Housing indicator
OK
Not OK
n/a
Basildon Liverpool
Figure 9.1: Local authorities with weakest and strongest association when Freeman-
Tukey variation is largest
The values of the Freeman-Tukey statistic for this table range from 0.261 for Basil-
don in Essex to 0.827 for Liverpool. The same two local authorities rank similarly high
(16th) and low (15th ) using the lambda measure, which ranges from 0.127 to 0.851 in
Tower Hamlets and Bridgnorth respectively. The same local authorities also have the
maximum and minimum for the entropy coefficient measure.
Figure 9.1 allows us to directly compare the tables for Basildon and Liverpool. If
we ignore the ‘n/a’ answers, as they behave identically in both LAs, we can clearly
see from the mosaic plots that in Basildon a person is more likely to have central
heating regardless of whether or not the housing indicator is OK or not. It seems not
126
having central heating is not the main reason for people’s households are classed as
lacking in some way. In Liverpool on the other hand, most of the people in sub par
accommodation do not have central heating2. Simply put, in Basildon the majority
of households that are “Not OK” still have central heating, just as the majority of all
households do – weak association – whereas in Liverpool a household that is “Not OK”
is likely not to have central heating, contrary to most households – a strong association.
Interestingly, the strength of association seems to exhibit a degree of spatial corre-
lation in this example as can be seen from the maps in Figure 9.2. The left-hand map
shows the strength of the association as measured using the Freeman-Tukey statistic
and the right-hand using the lambda measure3. The overall patterns are similar for
both measures.
0.2 - 0.3
0.3 - 0.4
0.4 - 0.5
0.5 - 0.6
0.6 - 0.7
0.7 - 0.8
0.8 - 0.9
0.2 - 0.3
0.3 - 0.4
0.4 - 0.5
0.5 - 0.6
0.6 - 0.7
0.7 - 0.8
0.8 - 0.9
0.1 - 0.2
Freeman-Tukey Lambda
Figure 9.2: Association between central heating and housing indicator
On the other end of the spectrum we can see from the bottom half of Table 9.1
that the three measures are less consistent. In particular the lambda measures are
least useful here, as the 10 listed tables are part of a total of 23 which have a standard
deviation of zero - in each case all of the local authorities have a lambda value of zero
as well. This is an artefact of the lambda measure and was described in Section 8.3.2.
Looking at the remaining two measures the most obvious thing to note is that all the
variables involved are individual level variables as opposed to the household level ones
before. There is also no overlap between the two lists with Freeman-Tukey picking up
mainly tables with work related variables and the entropy coefficient finding the lowest
variability in tables where one of the variables is gender.
2This does not mean that is the only reason their household was classed as “Not OK”. A household
must be overcrowded, lack a bath/shower,wc or central heating to be classed as lacking amenities, but
may have more than one of these characteristics.
3Or the entropy coefficient - in this case both maps are identical.
127
Perhaps one of the more surprising examples of low geographic variation of asso-
ciation is the crosstabulation of limiting long-term illness and gender, which ranks as
second least variable according to the entropy coefficient. It ranks as 88th least variable
by the Freeman Tukey statistic (out of 1596) and 342nd least variable using lambda.
But returning to the entropy coefficient, the maximum value of 0.0438 is measured in
Mid Sussex and the minimum of 0.00000225 in Camden.
Limiting long-term
illness
n/a
Yes
No
n/a
Yes
No
Gender
Male Female
Gender
Male Female
Camden Mid Sussex
Figure 9.3: Local authorities with weakest and strongest association when entropy
variation is second smallest
Again we can compare the mosaic plots of the two extreme local authorities in
Figure 9.3. It is clear from the alignment of the tiles in the left-hand plot that the two
variables are almost completely independent. The likelihood of suffering from limiting
long-term illness is almost exactly the same for males and females. In fact the odds
ratio for the top four cells is 0.997, confirming this near independence. Looking at the
Mid Sussex table it is immediately apparent that the variables are not independent at
all but that females are significantly more likely to have a limiting long-term illness
than males. The odds ratio in this case is 1.53 so the odds of having a LLTI are over
50% larger for females than they are for males. Still, this is not picked up by the
entropy coefficient, nor can it said to be really obvious from a Freeman-Tukey value of
0.072 or a lambda of 0.0215.
Given that gender features so prominently in the least variable tables according to
the entropy coefficient it is clear that this measure is particularly insensitive in relatively
small and uniform tables. In these cases a slight non-alignment is clearly visible using
the mosaic plot and can be confirmed using the odds ratio, but this is unfortunately
unworkable for any table larger than 2 ×2.
On the other hand for a table such as distance to work by year last worked, which has
128
the least geographic variation according to the Freeman-Tukey measure, odds ratios are
little help. Figure 9.4 plots these two variables for the local authority with the ‘weakest’
association – Merthyr Tydfil with adj.F T = 0.655 – and the ‘strongest’ one – Kingston
upon Thames with adj.F T = 0.685. In both cases it is clear that structural zeros
dramatically limit the possible association between the variables - in fact the margins
completely determine the association. This is much more clearly an example of little
geographic variation than the previous example in the sense that is conforms with our
intuitive understanding of variation of association strength.
Distance to work
0-4km
5-19km
20+km
At home
Not fixed
Not in work
Last worked
In employment
Last worked
In employment
Merthyr Tydfil Kingston upon Thames
Never
2000-2001
1996-1999
Before
1996
Out of
age
range
Never
2000-2001
1996-1999
Before
1996
Out of
age
range
Figure 9.4: Local authorities with weakest and strongest association when Freeman-
Tukey variation is smallest
Based on these results we can say that there seems to be a significant amount of
agreement between the three measures at the high end of the spectrum of geographic
variation of association strength. In particular we again note that lambda and the
entropy coefficient are clearly congruent. On the other hand the measures are less
reconcilable at picking out associations which have a low level of geographic variability.
Lambda in particular, as has been noted before, has little distinguishing power at weak
associations, which consequently leads to lambda suggesting that these tables all have
no geographic variation - even though that is clearly not the case. So far the focus has
been on geographic variation across local authorities. The next section continues in
the comparison of the three measures and the bivariate associations found in the SAM
data by applying two different hierarchical aggregations to the dataset.
129
9.2.2 Changes in association strength with geographic and geodemo-
graphic aggregation
Whenever geographic data are aggregated, either geographically or by any other group-
ing criteria, the issue of the modifiable areal unit problem or MAUP is unavoidable. The
MAUP is a statistical artefact of geographical data that was noticed at least as early as
1934, when Gehlke & Biehl published a short note describing “Certain effects of group-
ing upon the size of the correlation coefficient in census tract material”. The authors
noted that correlation coefficients between two variables changed quite dramatically
when they grouped census tracts into different contiguous or random groups. The issue
of MAUP was further investigated by many authors, most notably Stan Openshaw, who
distinguishes two aspects of MAUP: the scale and the aggregation or zoning problem
(Openshaw, 1977; Openshaw & Taylor, 1979; Amrhein, 1995) The scale effect refers to
different results occurring at different levels of the (geographical) hierarchy, while the
aggregation effect is the result of the choice of grouping at one particular level of the
hierarchy.
Effects of scale and zoning have been investigated on various statistics measured on
different types of variables. The basic wording of the problem is however simply that
results vary depending on the unit used. This can apply to univariate data where for
example the means or the standard deviations of variables might vary from one geogra-
phy to another(Amrhein, 1995). More traditionally however the issue is investigated on
bivariate continuous data, where correlation coefficients are generally found to increase
with the aggregation of units, but this is not found to be a strict rule (Flowerdew et al.,
2001; Blalock, 1964). More complex multivariate analysis results have been analysed
as well by investigating the changes in multiple regression coefficients across scales and
between alternate groupings, where clear patterns have not been found (Fotheringham
& Wong, 1991).
Given our dataset we can use the three measures of categorical association strength
described in Section 8.3 to investigate how both scale and zoning affect the bivariate
association strength on the SAM dataset. An example to operationalize the problem is
demonstrated in Figure 9.5. For this demonstration we use the entropy coefficient as the
measure and we chose the table with the largest standard deviation thereof: Communal
establishment type by Accommodation self-contained (see Table 9.1). This is pictured
in the top histogram, which plots all 373 association strengths ranging from 0.07 in
Westminster & City to 1.00 in North Warwickshire. When these local authorities are
aggregated into 10 regions, the range of values narrows considerably as can be seen from
the bottom left histogram. Alternatively, the local authorities can be aggregated into 7
Supergroups according to the ONS geodemographic area classification of LAs, as shown
in the bottom right histogram. The scale effect in both cases is one of a reduction of the
standard deviation of the strength of the association between the variables (as measured
130
maximum
mean + std.dev
mean
mean - std.dev
minimum
373
Local authorities
10
Regions
7
Supergroups
Scale effects
‘Zoning’ effect
0.0 0.2 0.4 0.6 0.8
0
1.0
20 40 60 80 100
replacemen
0.0 0.2 0.4 0.6 0.8
0
1.0
12345
replacemen
0.0 0.2 0.4 0.6 0.8
0
1.0
12345
replacemen
Figure 9.5: Scale and zoning effects on entropy coefficients for the crosstabulation of
Communal establishment type by Accommodation self-contained
by the entropy coefficient). In the case of the geographic aggregation, the reduction
is quite dramatic whereas the geodemographic grouping preserves a significant portion
of the variation. The difference between these two groupings can tentatively be called
the zoning effect - although to be precise it would actually require that both have the
same number of units, which is not the case here.
The effect of the modifiable areal unit problem on the geographic variability of asso-
ciation strength in all of the SAM data can be analysed following the above example. In
it the geographic aggregation reduced the standard deviation of the entropy coefficient
by 65% but the geodemographic aggregation kept a lot more of the variation, reducing
the deviation by only 8%. When this analysis is repeated across all 1596 tables in the
dataset the changes can be plotted as histograms in Figure 9.6 with regional aggrega-
tion on the left and the geodemographic on the right. Each histogram describes the
proportional reduction of variation as measured by one of the measures and caused by
a particular aggregation. The red lines marks zeros, indicating reductions of variation
to the left and increases of variation to its right.
The difference between the two types of aggregation is clear immediately: while
the regional aggregation reduced the standard deviation of the association strength for
almost all tables, regardless of the measure used, this was not the case for the geode-
mographic aggregations. Quite substantial numbers of tables have in fact seen the
standard deviation of their association strength increase, in some cases quite substan-
tially. These differences between the three measures are less pronounced with the most
notable difference being the spike in 100% reductions in the lambda measure (middle
131
two plots). These are again the cases where the value of lambda becomes zero for all
groups and hence its standard deviation also becomes zero.
The results suggests that geographical aggregation of units will tend to reduce the
amount of variation, although this can not be seen as a rule. Generally however this
seems to be the true regardless of the measure used. This is in line with the effect
normally observed with correlations and our intuitive understanding of it: namely that
geographical aggregation involves a smoothing effect that reduces variation (Fothering-
ham & Wong, 1991).
Regarding the zoning effect these results are too limited to be able to make any
conclusions, since the two sets of groups are not of the same size. Furthermore any
conclusions would need to be made using a random aggregation as a control - an analysis
that was not attempted here. We can conclude however that the Supergroup level of the
ONS area classification does seem to provide a significantly higher level of homogeneity
with regard to the strength of bivariate associations than a simple regional aggregation
does. This is of course despite the fact that the establishment of geodemographic
clusters only uses univariate data(ONS, n.d.).
9.3 Ecological Fallacy and Simpson’s Paradox
Research into the MAUP and in particular its scale effect aspect has contributed signifi-
cantly to the understanding that cross-level inference should never be attempted lightly.
Robinson’s 1950 article and the ensuing research on the ecological fallacy represents a
somewhat parallel strand of literature on a similar phenomenon, which espouses the
same warnings. Simpson’s paradox intuitively seems to follow a similar pattern of er-
roneous inference, yet it is rarely discussed in the same literature, and when it is the
relationship between the two is not explained in depth (Wakefield, 2004; Oakes, 2009).
For example its mention is conspicuously missing from Alker’s classic A typology of
ecological fallacies (1969).
This section discusses the logic and mathematics behind both issues in an attempt to
bring them together under the same roof. In the literature the ecological fallacy tends
to be explained using continuous data and covariance theorems, even though most
pedagogical examples tend to be categorical; Simpson’s paradox on the other hand is
usually described in texts on categorical data analysis - usually using probability logic -
even though empirical examples are just as likely to be found in continuous data. Here
it is demonstrated that framing both in the language of categorical data analysis and
using the concepts of variational independence helps to explain why they happen and
how to make the fallacy impossible to commit and the paradox less surprising. This
then allows us to return to the SAM dataset and investigate the magnitude of these
issues in the next section.
132
Aggregation to 10 regions Aggregation to 7 supergroups
−100% −50% 050%
0
Change in std.dev of entropy coefficient
200
400
600
−100% −50% 050%
0
Change in std.dev of lambda
200
400
600
−100% −50% 050%
0
Change in std.dev of Freeman-Tukey statistic
200
400
600
−100% 0100% 200% 300% 400%
0
Change in std.dev of entropy coefficient
200
400
600
800
−100% 0100% 200% 300% 400%
0
Change in std.dev of lambda
200
400
600
800
−100% 0100% 200% 300% 400%
0
Change in std.dev of Freeman-Tukey statistic
200
400
600
800
Figure 9.6: Relative changes in standard deviations of measures of association strenght
after two types of aggregation (N=1596)
9.3.1 Ecological Fallacy
The ecological fallacy was most famously demonstrated by William S. Robinson in his
1950 article on Ecological Correlations and the Behaviour of Individuals. His was not
the first demonstration of this effect nor did he use the actual term, which was first used
by Selvin several years later (1958)4. However, Robinson’s paper is widely considered to
be the most important and comprehensive contribution to this methodological issue and
it had a tremendous impact on social science as he himself predicted in his conclusion:
“this conclusion has serious consequences, and [..] its effect appears wholly negative
because it throws serious doubt upon the validity of a number of important studies
made in recent years”(ibid. p. 357). His demonstration as well as most subsequent
explanations of the principle are all framed in the language of linear regression and
correlation coefficients. It is felt however, that the framework of contingency table
4In fact the aforementioned Gehlke and Beil also make an allusion to the possible existence of such
a fallacy when they say: “A relatively high association might conceivably occur by census tracts when
the traits so studied were completely dissociated in the individuals or families of those traits.”(1934,
p.170).
133
analysis that has been described in the preceding sections along with the use of mosaic
plots can offer a far more intuitive understanding of what the ecological fallacy is as well
as provide direction in the more general problem of ecological inference. The ecological
fallacy is first explained and then demonstrated using Robinson’s original example.
The ecological fallacy refers to the act of erroneously inferring individual-level rela-
tionships from ecological results. It refers to the logically incorrect leap from a state-
ment such as “Countries with large proportions of Protestants tend to have higher
suicide rates,” to saying “Protestants are more likely to commit suicide.” The lat-
ter statement is not necessarily untrue, but it does not logically follow from the first
statement5. The confusion seems to arise from the fact that a concept such as sui-
cide or religion can be operationalized at both the individual and the aggregate level,
which might make them seem interchangeable. Therefore it is impossible to commit
the ecological fallacy if we are dealing with a concept that cannot be an individual
characteristic—for example if we know that countries with high average temperatures
tend to have higher ice cream consumption, it is impossible to mistakenly rephrase this
statement at the individual level6.
The common formulation of the fallacy as transferring aggregate-level correlations
to the level of the individual is technically not really correct. It could only be correct
if the same variables were used both levels, but the fact is that the variables at the
two different levels are not the same: on the individual level we are dealing with a
nominal (dichotomous) variable while at the aggregate level we are dealing with a rate
or proportion. Thus committing an ecological fallacy means calculating a correlation
between two continuous variables measured at the level of countries, regions, etc. and
mistakenly interpreting it as the correlation between two nominal variables that are
measured at the individual level. As these variables are usually namesakes, due to
imprecise coding of variables, this makes the transferral all the easier. It can also be
speculated that a further reason why this transfer is made with such ease is due to the
fact that both relationships are measured using a Pearsonian coefficient of correlation.
The argument being that using the same type of measure and very similar notation for
two clearly different types of relationship makes the interchange seem reasonable. If
however the data is analysed using contingency tables and the associations measured
using the variationally independent odds ratio, such an error is all but impossible.
Robinson demonstrated the fallacy using data from the 1930 US Census describing
5This is a reference to Émile Durkheim’s seminal work Suicide, where he commits this fallacy
repeatedly and flagrantly. In a special issue of The American Journal of Sociology on the 100th
anniversary of Durkheim’s birth, these methodological problems were analysed by Hanan C. Selvin,
who thereby coined the term ecological fallacy (1958).
6Additionally, the fallacy is all but impossible to commit if no sensible hypothesis can be imagined
to link the two variables — a classical pedagogical example is that of an ecological correlation between
ice cream consumption and rape rates.
134
Table 9.2: Robinson’s Nativity and Illiteracy for US (Source: US Census (1931))
Foreign Native Total
born born
Illiterate 1,306,084 2,557,026 3,863,110
Literate 11,964,722 79,852,667 91,817,389
Total 13,270,806 82,409,693 95,680,499
the relationship between nativity and illiteracy. We recreate his example here7. The
complete dataset8, is essentially a 2 ×2×9 table: Literacy and Nativity are crosstabu-
lated for the nine divisions of the United States and the data is visualised as a mosaic
cube in Figure 9.8(the full table can be found in Appendix C). Table 9.2 shows only
the top margin: Literacy by Nativity for the whole of the US. Robinson used these
four values to calculate the individual correlation. He expected the correlation to be
positive, reasoning that “educational standards are lower for the foreign born than for
the native born, and therefore there ought to be a positive correlation between foreign
birth and illiteracy. The correlation is in fact in agreement with this hypothesis, with
a value of 0.118 indicating a weak but positive relationship between foreign birth and
illiteracy.
The ecological correlation on the other hand is best described by plotting the nine
US divisions and their respective proportions of illiterate and foreign-born inhabitants
(Figure 9.7). The correlation coefficient calculated or these nine pairs of data points
equals −0.567, which is a quite strong negative correlation9as is confirmed by the
negatively sloping regression line. These two ostensibly contrasting correlations — one
slightly positive, the other strongly negative are the root of the ecological fallacy as
demonstrated by Robinson. In order to actually commit the fallacy one must claim that
the negative correlation between ‘nativity’ and ‘literacy’ depicted on the plot means
foreigners are less likely to be illiterate. The imprecision of using the terms ‘nativity’
7Robinson demonstrated the fallacy using two examples: the first one looked at the correlation
between race and literacy and the second between nativity and literacy. His first example proved
“there need be no correspondence between the individual and the ecological correlation”, by showing
dramatically different strengths of the correlation at each level, however both had the same (positive)
sign. To underscore the importance and potential danger of the fallacy, the second example was used,
where the direction of the correlation changed as well, and it is this latter example that is recreated
here.
8This is the data used in his analysis and summarized in Robinson’s Figures 3 and Table 3 (1950),
however as he did not provide the full crosstabulation, the data four our analysis had to be taken
directly from the 1930 US Census. Unfortunately, the counts retrieved from the census do not match
up exactly to the ones that can be found in Robinson, or the ones given in Subramanian et al.’s for that
matter, who recreated the analysis in 2009. It is unclear where this error occurs, but the differences are
small and do not to affect the main point of the analysis, since the correlation coefficients are almost
identical to Robinson’s and demonstrate precisely the same effect he intended to show.
9The respective values calculated by Robinson are 0.118 for the individual correlation and −.526
for the aggregate one; so despite the disparity, the thrust of the argument remains the same.
135
Percent Illiterate
Percent foreign-born
New England
Middle Atlantic
East North Central
West North Central
South Atlantic
East South Central
West South Central
Mountain Pacific
0
0 5
10
10 15 20 25 30 35
2 4 6 8
Figure 9.7: Individual-level and division-level correlation of Nativity and Illiteracy
and ‘literacy’ instead of ‘percent foreign-born’ and ‘percent illiterate’ is quite obvious.
Alternatively one might start from the individual correlation and claim that since
foreigners are more likely to be illiterate, areas with large proportions of foreign born
inhabitants will tend to have higher levels of illiteracy. This is the converse of the
ecological fallacy and has also been dubbed the individualistic fallacy (Alker, 1969).
9.3.2 Simpson’s Paradox
So how does Simpson’s paradox relate to the ecological fallacy? The term refers to the
sometimes surprising observation that two variables can have one relationship across
the population at large and yet when it is split into subgroups that relationship my
disappear or even become reversed. It has been argued that Robinson’s 1950 paper
spawned work on the Simpson’s paradox and the similarities are easily apparent (Oakes,
2009). The paradox is sometimes mentioned in conjunction with the ecological fallacy
136
New England
Middle Atlantic
East North Central
West North Central
South Atlantic
East South Central
West South Central
Mountain
Pacific
Native Foreign
born
Literate Illiterate
Figure 9.8: 3-D visualization of the original Robinson data
as a ‘closely related’ phenomenon (Wakefield, 2004) or, in equally vague terms, as a
special case of cross-level inferences which involve making the “unwarranted assumption
or expectation that aggregates should evince the same relationships as the categories,
levels, classes, items or individuals over which the aggregate was formed” (Vokey, 1997,
p. 210). But Simpson’s paradox has a very precise relationship to the ecological fallacy,
which is explored below, after first demonstrating the paradox for both categorical and
continuous variables.
The term Simpson’s paradox is another example of Stigler’s law of eponymy10 as
it refers to a paradox that had been known and described several times before being
discussed by Edward H. Simpson (1951) after whom it got named by Colin Blyth
in 1972. Terminologically the situation gets more complicated by synonyms such as
amalgamation, aggregation or reversal paradox or even the Yule-Simpson effect being
used sometimes for versions of the paradox in categorical data only, sometimes for
continuous data and sometimes generically. (Stigler 1999, p. 39; Good & Mittal
1987; Messick & van de Geer 1981). Most examples again tend to use categorical
data and furthermore also tend to be from the medical sciences11. Here we use the
10This law states that “No scientific law is named after its original discoverer” and is itself an example
of itself, since the law was first proposed by Robert Merton (Stigler, 1999, p. 277) .
11This is understandable given that medical statistics often operate with dichotomous variables such
as treatment/no treatment and outcomes such as cured/not cured, making 2x2 tables particularly
ubiquitous. It should be noted though, that Simpson’s paradox is symmetrical with regard to the two
main variables i.e. there is no need to treat one variable as dependent and the other as independent.
137
Local patients Chicago patients All patients
Treatment:Old New Old New Old New
Dead 950 9000 5000 5 5950 9005
Alive 50 1000 5000 95 5050 1095
Survival 5% <10% 50% <95% 46% >11%
ratio
Figure 9.9: Simpson’s paradox for categorical variables (based on (Blyth, 1972)
numerical example given by Blyth, which is also a medical table, but happens to have
a geographical component as well.
The data presented in the tables at the top of Figure 9.9 are the results of a ‘clinical
trial’ crosstabulating patients by whether or not they received the old or the new drug
and what the results of the treatment was: death or recovery. The doctor in this case
applied the new drug both locally (first table) as well as for some of his Chicago patients
(second table). The same tables are represented as mosaic plots below with white areas
corresponding to patients that recovered and gray areas to patients that did not. For
both groups of patients, the survival ratio was much higher on the new drug compared
to the old one. If however one ignores the location of the patients and pools all the
data (third table), the relationship is dramatically reversed and the survival rate is
much worse for patients on the new drug. The paradox is of course easy to explain
away: “[the local] patients are much less likely to recover, and the new treatment was
given mostly to [them]; and of course a treatment will show a poor recovery rate if tried
mostly on the most seriously ill patients” (Blyth, 1972, p.364).
The example given above is one of the more extreme ones as the aggregation actually
reverses the relationship. Gentler versions of the paradox are available as well: one of
the earliest examples in print is the one given by Yule (1903) of an attribute that is
not inherited through the female line nor through the male line (both layers exhibit
independence), but there is a significant - yet illusory - relationship when both are
added together. The example given by Simpson is also not one of complete reversal, but
similar to Yule’s: here the combined table shows no association, while each individual
group (dirty and clean playing cards) does show an association. And just as with the
ecological fallacy, there is no need for the relationship to reverse so dramatically: the
138
situation can still seem paradoxical if the relationship is e.g. strong for each subgroup
of the population, but only weak for the population as a whole.
Examples of Simpson’s paradox for continuous variables seem generally to be more
difficult to come by in the literature, although the earliest known mention of the paradox
is in fact continuous: Pearson et al. (1899) illustrate the fact that a “mixture of
heterogeneous groups, each of which exhibits in itself no organic correlation, will exhibit
a greater or less amount of correlation” (p. 278) using data on the length and width
of skulls found in the Paris Catacombs. These show that the correlation between both
measures was 0.0869 for the male skulls and -0.0424 for the female skulls. So a slight
positive and a slight negative relationship, both close enough to zero to lead one to
assume that overall the correlation between the length and breadth of French skulls is
zero. It is found however, that the correlation of the mixed or pooled data is in fact
0.1968 - much further from zero than either of the groups.
So while some authors seem to use the term Simpson’s paradox only for extreme
cases where the association overall is in the opposite direction to that of (all) the
associations at group level, it makes sense to follow others (including Simpson himself)
in widening the definition to include also examples such as the ones above. Thus we
can say Simpson’s paradox has occurred whenever the overall correlation between two
variables is stronger than the strongest or weaker than the weakest of the within-group
correlations.
An interesting way of connecting both concepts using a classical example arises if we
consider again the Robinson data presented in the previous section. As is clear from the
cube visualisation presented in Figure 9.8, the ecological fallacy results only from the
two-dimensional margins. This means we can remove the third order interaction τijk i.e.
remove the geographic variation of the association between illiteracy and nativity. This
means we can remove this association using IPF and the ecological fallacy – or rather
the possibility of committing it – remains intact: the individual level correlation is still
0.12 and the ecological correlation is -0.57. However now the within-region correlations
range from 0.12 to 0.21 as shown by the shaded bars in Figure 9.10 . The original
within-region correlations are shown alongside, while the red horizontal line indicates
the individual level correlation. Thus the overall correlation is smaller than or equal to
all of the post-IPF regional level correlations.
Since we removed the third-order interaction this means the illiteracy-nativity re-
lationships are the same in all regions. However the correlation coefficient is not vari-
ationally independent and so does not reflect this fact clearly. If we instead used odds
ratios we would find that in the original data the regional odds ratios ranged from
0.68 to 18.82, while the national odds ratio equalled 3.41. This means that overall,
individuals are 3.41 times more likely to be illiterate if they are foreign born compared
to natives. But after we remove the three-way interaction the odds ratio in each region
139
0.00 0.05 0.10 0.15 0.20 0.25
Original table
No 3-way Interaction
New England
Middle Atlantic
East North Central
West North Central
South Atlantic
East South Central
West South Central
Mountain
Pacific
Figure 9.10: Within-region correlation coefficients for Robinson’s data
becomes 8.28. It is the same across all regions meaning individuals in each region are
8.28 times more likely to be illiterate if they are foreign born compared to natives.
Thus we have, either using odds ratios or correlation coefficients, another example of
Simpson’s paradox - albeit not a complete reversal one.
This slightly contrived example goes to show that both phenomena may be present
in a single table, but equally the presence of one gives no indication of the possibility of
the other. Both the ecological fallacy and Simpson’s paradox involve the relationship
between either two continuous variables or two dichotomous ones and a third condition-
ing variable. This latter one is a geographical one in the case of the ecological fallacy,
although it could easily be generalized to be any other type of grouping variable. In
the case of Simpson’s paradox the conditioning variable is not usually geographical,
although it can also be. To summarize then, while the ecological fallacy and Simpson’s
paradox can be seen as related phenomena, their distinction can be made with much
more rigour.In order to do so we must first be able to express them precisely using the
correlation coefficient, which is done in the next section.
9.3.3 Correlation coefficient
Pearson’s correlation coefficient is also called the product moment correlation, the name
referring to the process of calculating the product of the z-scores of two variables and
taking the average (the moment) of these products across all cases (Chen & Popovich,
2002). In standard notation the correlation between variables X and Y is then:
r=
N
X
i=1
zXi·zYi
N
140
If this expression in expanded to include the means ( ¯
X, ¯
Y) and standard deviations
(σX, σY) we get:
r=
N
X
i=1
(Xi−¯
X)
σX
(Yi−¯
Y)
σY
1
N[9.1]
The numerator in Equation [9.1] can be expanded:
r=PXi·Yi−P¯
X·Yi−PXi·¯
Y+P¯
X·¯
Y
σX·σY·N
which simplifies to:
r=X·Y−¯
X·¯
Y
σX·σY
[9.2]
This is the computational form of of the correlation coefficient equation and we will
find it particularly useful for calculating its value for binary variables. A third way
of expressing ris by rewriting Equation [9.1] to show ras the covariance of Xand Y
divided by the product of their standard deviations:
r=
N
X
i=1
(Xi−¯
X)(Yi−¯
Y)
N
σX·σY
=σXY
σX·σY
[9.3]
The covariance is itself a measure of correlation between the two variables, but its
unit depends on the units of X and Y so it has no upper of lower limits and cannot
be compared easily. Dividing it by the variables’ standard deviations is a way of
standardizing it: Pearson’s correlation coefficient can therefore also be seen as the
standardized covariance and falls between -1 and 1.
All three Equations ([9.1], [9.3] and [9.2]) are equivalent ways of expressing the
Pearsonian correlation. It is the latter and the associated covariance theorems that are
commonly used to explain the ecological fallacy (most famously by Alker (1969)). In
the most general terms we define the variables Xij and Yij for i= 1,2,...njindividuals
in j= 1,2,...R regions, with a total of PR
j=1 nj=N. We then have regional averages:
¯
Xj=
nj
X
i=1
Xij /nj,¯
Yj=
nj
X
i=1
Yij /nj[9.4]
and overall or total averages:
¯
X=
R
X
j=1
nj
X
i=1
Xij /N, ¯
Y=
R
X
j=1
nj
X
i=1
Yij /N [9.5]
Following Alker, we can break down a deviation from the mean into the within region
deviation and the between region deviation (ibid., p.73):
(Xij −¯
X) = (Xij −¯
Xj) + ( ¯
Xj−¯
X)[9.6]
141
If we also take the equivalent expression for Yij and multiply them together and aver-
aged across Nthe following expression ensues12:
R
X
j=1
nj
X
i=1
(Xij −¯
X)(Yij −¯
Y)
N=
R
X
j=1
nj
X
i=1
(Xij −¯
Xj)(Yij −¯
Yj)
N+
R
X
j=1
nj
(¯
Xj−¯
X)( ¯
Yj−¯
Y)
N
σXY
Universal
covariance
=
σw
XY
Within region
covariance
+
σe
XY
Between region
covariance
[9.7]
So the overall or rather the individual level covariance σXY can be decomposed into
the within region covariance σw
XY and the between region covariance. This latter one
is the ecological covariance of X and Y, hence we add the superscript eto its notation:
σe
XY . The same relationship also holds for the equivalent variances of X and Y:
σX=σw
X+σe
X, σY=σw
Y+σe
Y[9.8]
which means that using the three covariances in Equation [9.7] we can now use the
variances in [9.8] to standardize them and express three different correlation coefficients:
r=σXY
σXσY
, rw=σw
XY
σw
Xσw
Y
and re=σe
XY
σe
Xσe
Y
[9.9]
The first rrepresenting the individual level correlation, the second rwthe within
region correlation and the third rethe ecological correlation. The relationship between
all three correlation coefficients can be expressed using Equations [9.7], [9.8] and [9.9],
which results in 13:
r=rw·qσw
X/σX·qσw
Y/σY+re·qσe
X/σX·qσe
Y/σY[9.10]
Confusing rand reis committing the ecological fallacy. Being perplexed when r
is not the same sign and similar strength as the average rwis falling into the trap of
Simpson’s paradox.
In order to express the Pearsonian correlation in the categorical notation of our
data cube we must first make sure variables Aand Bare dichotomous: I=J= 2. The
number of levels of C, the geographical variable, Kis then equivalent to N in the above
equations. The individual correlation for the data cube is easier to derive using the
formulation of rin Equation [9.2] This time we are dealing with the total Npopulation
12The full derivation is given in (Alker, 1969, p.75), with the caveat that the subscripting involved
in the definition of the ecological covariances in Formulas (B6) and (B9) is slightly unorthodox, using
summation over iwhen none of the elements are actually indexed with i, when what is intended
is summation over all regions; and there is a typo in formula (D2) where the second term in the
denominator should be subscripted YY instead of XY.
13See Alker (1969) and Duncan et al. (1961) for the complete derivation.
142
of the table and Xand Yare binary variables indicating whether individual xnis in
category A= 1 or B= 1 respectively:
X7→
1if xn∈x1++,
0if xn∈x2++.Y7→
1if xn∈x+1+,
0if xn∈x+2+.[9.11]
The expected values are then simply the probabilities that a person falls into that
category, which is equivalent to the proportion of observation in that category (see
Figure 9.11 for a graphical representation):
¯
X7→ p1++,¯
Y7→ p+1+ [9.12]
We can further also define the expected product of both variables as the number of
individuals with both variables having a value of 1:
X·Y7→ p11+ [9.13]
Since we are now dealing with binary variables, the standard deviations can also be
simplified. Namely because the only possible values for X or Y are 0 and 1, this means
squaring them has no effect. Since X2
i=Xiwe find:
σX7→ sP(Xi−¯
X)2
NσY7→ sP(Yi−¯
Y)2
N
7→ sXX2
i
N−¯
X27→ sXY2
i
N−¯
Y2
7→ q¯
X−¯
X27→ q¯
Y−¯
Y2
7→ q¯
X(1 −¯
X)7→ q¯
Y(1 −¯
Y)
7→ √p1++ ·p2++ 7→ √p+1+ ·p+2+ [9.14]
Now we can substitute [9.12], [9.13] and [9.14] into [9.2] to get the individual corre-
lation:
r=p11+ −p1++ ·p+1+
√p1++ ·p2++ ·p+1+ ·p+2+
[9.15]
The within-region correlations are the same as above only without the summation
over the third variable. So for region k=1 we have:
rw=p111 −p1+1 ·p+11
√p1+1 ·p2+1 ·p+11 ·p+21
[9.16]
In order to express the ecological correlation we use the variance-covariance formulation
from Equation 9.3 being careful to use the between region deviations as we did in the
last term of 9.7. The correlation we are looking for is between proportion of cases in
each area (for each k) that are in the first category of variable A or variable B:
¯
Xj7→ p1+k,¯
Yj7→ p+1k[9.17]
143
A= 1
A= 0
B= 1 B= 0
p11+ p1++
p2++
p+1+ p+2+
p+1+ p+2+
p1++ p2++
B= 1 B= 0A= 1 A= 0
C= 1
...
C=K
p++1
...
p++k
p1+1
...
p1+k
p+11
...
p+1k
Figure 9.11: Notation in three-dimensional example
The expected values of these two variables are the proportions in each category across
all levels of Cso :
¯
X7→ p1++,¯
Y7→ p+1+ [9.18]
Note that the expectation is the same as it was for the individual version of Xand
Y(Equation [9.12]), however the interpretation is crucially different: in the first case
the variable is binary, so ¯
X=.30 means “there is a 30% probability that individual xn
is in category A= 1”, but the observed value can only by 1 or 0. In the second case
the variable is continuous, so e.g. ¯
X= 0.30 means “overall 30% are in category A= 1”
however the observed value for a particular region may be 0.27 or 0.54 or anything in
the range from 0 to 1.
Using the equalities [9.17] and [9.18] we can write both the covariance and the
variances and insert them into Equation [9.3] to express the correlation coefficient for
the ecological correlation as:
re=PK
k=1 (p1+k−p1++)(p+1k−p+1+)
qPK
k=1 (p1+k−p1++)2qPK
k=1 (p+1k−p+1+)2
[9.19]
9.3.4 Simpson and Robinson combined?
Although Alker’s classic A typology of ecological fallacies (1969) is unique in its compre-
hensive derivation of the covariance theorems that underpin the ecological fallacy, any
144
mention of Simpson’s paradox is conspicuously absent. This is even more unusual as
Alker does discuss a whole range of possible fallacies arising from erroneous inferences
across levels, including what he calls the twin universal and selective fallacies. The for-
mer is committed by assuming a universally true relationship also holds in a subsample
of the population, and the latter by generalizing a relationship from a subgroup (not
one that is randomly selected). He gives an example of the former by noting there is a
universally positive relationship between land inequality and group violence, but this
does not hold for European nations. This hardly seems a fallacy, and there is certainly
nothing paradoxical about one subgroup behaving differently than the universe. Yet
this is exactly where Simpson’s paradox fits into the scheme otherwise so diligently set
up by Alker.
There are on the other hand several examples of authors not being rigorous enough
in distinguishing between the two phenomena. There seem to be quite a few reasons
why the ecological fallacy and Simpson’s paradox are often muddled together and/or
seen at least in part to be overlapping. One is the fact that one is conceptualized as a
fallacy and the other a paradox. Indeed both could be rephrased as one or the other.
Either can be seen as a paradox, because we expect the data to behave differently than
they actually do - of course we can only be surprised by their behaviour if they are
available. On the other hand, if the data are not available, mistaken expectations about
how they should behave can lead us to commit a fallacy.
Another source of confusion is the concept of cross-level inference. Both phenom-
ena can be framed in the language of cross-level inference, yet in very different ways:
the ecological fallacy is cross-level in the sense that correlations between individuals
are compared with correlations between area (or group) means. Simpson’s paradox
compares correlations between individuals with correlations between individuals, but
is cross-level in the sense that the individuals in the first case are at a higher (e.g.
national) level and the individuals in the second case are at a lower (e.g. regional)
level.
Another reason is the confusion caused by them sometimes being demonstrated
using continuous and sometimes using categorical variables. One extreme example is
given by Freedman (2002) , who calls the ecological fallacy “Simpson’s Paradox for the
correlation coefficient”. Another illustration can be found in Oakes (2009) who creates
a plot to summarize the analysis by Gelman et al. (2007) - a clear example of the
ecological fallacy - and calls it Simpson’s paradox.
Perhaps this is because the Gelman analysis is one of the rare examples of the
ecological fallacy for continuous data, and the graph is therefore so unfamiliar. Since
we have already demonstrated both phenomena using categorical data, we can now
compare them using continous data as well. A version of the Gelman chart is reproduced
on the left hand side of Figure 9.12 . In the original analysis US voters in several states
145
low
low
high
high
low
low
high
high
Figure 9.12: Simpson’s paradox or Ecological Fallacy?
are analysed for their income and propensity to vote Republican. The red line is
the regression line for all individuals indicating that individuals with higher incomes
have a higher propensity to vote Republican. These individuals actually come from
three different states as indicated by the three different sets of points in the graph
- in each of which individuals with higher incomes also have a higher propensity to
vote Republican. If we take the means for each state however (indicated by the bold
symbols) the relationship is ‘reversed’: states with higher average incomes tend to have
lower levels of Republican votes. The ecological fallacy lies in the reversal of slopes
between the individual level overall correlation (red line) and the state level ecological
correlation (dotted line).
To show how this example for continuous data could become an example of Simp-
son’s paradox, we can turn to the right-hand panel of Figure 9.12. Again there is a
positive relationship within each of the three states, but now the overall individual level
correlation (red line) is negative. The paradox here lies in the fact that the red line has
the opposite direction to all three of the black lines. The dotted line indicating the eco-
logical correlation is irrelevant here, we are only interested in the full lines representing
the two sets of individual level correlations.
9.4 ‘Magnitude’ of Ecological Fallacy Effects and Occur-
rence of Simpson’s Paradox in SAM
Robinson’s example of the ecological fallacy is quite dramatic as the actual direction
of the correlation is different in each case, making the fallacy so much more obvious.
146
Very often however, both individual and aggregate level variables exhibit similar de-
grees of correlation at both levels. If, in addition to this, a sensible hypothesis also
exists for this relationship, the fallacy is much easier to commit, such as in Durkheim’s
suicide examples. In fact Openshaw goes as far as to say that “extreme results [such
as Robinson’s] may not be typical of the range of results likely to be produced from
the restricted sets of areal units commonly used to report census data.” (Openshaw,
1984, p. 18). His analysis further develops this thread of thought, indicating that the
smaller the level of aggregation — the more homogeneous the areas — the smaller the
discrepancy between the ecological and individual correlation. Although Openshaw’s
main underlying theme is a focus on the way areas are delimited in the first place, i.e.
the modifiable areal unit problem, his analysis is dangerously close to claiming that if
the resolution is large enough, the ecological fallacy problem might not be that severe
(ibid. p.30). This section uses his 1984 analysis as a framework to investigate the
magnitude of these issues in the SAM dataset.
Openshaw (1984) uses two datasets in his analysis: a set of 53 variables for Sunder-
land and 40 variables for Florence (Italy), both from the 1981 censuses. The types of
variables are not stated in the article, but some inferences can be made from the results
presented in Tables 1.-3. There are e.g. 780 individual correlations for the Italian data,
which is equal to 40 ×39/2and therefore leads to the conclusion that all the variables
were dichotomous. There are 1431 correlations for the Sunderland data, which is actu-
ally equal to 54 ×53/2, which can either mean that all but one of the variables were
dichotomous with the remaining variable having three categories, or there were in fact
54 and not 53 variables used.
The SAM dataset used here has 57 variables, but only two of them are dichotomous.
In order to create a (manageable) set of 2×2tables from these variables we use a dummy
variable approach: each of the 1596 crosstabulations gets split into I×Jsub tables. For
example in a three by four table the first variable can be transformed into three dummy
variables and the second into four, meaning a total of 12 new 2×2crosstabulations can
be created from it14. The variable Accommodation type that originally has 4 categories
now becomes 4 separate variables: Detached or semi detached,Terraced house,Flats
etc. and Communal establishment, each of which has only two possible values: Yes and
No. This creates 42,199 tables, but some of the variables have categories with no one
in them, which effectively produces 2×1tables instead, so these are removed leaving
a total of 41,629 crosstabulations of dichotomous variables to work with15.
14This applies for all variables except the two that are dichotomous in which case only one dummy
variable is sufficient.
15This is the simplest approach to dichotomizing the variables in a way that is unbiased and still
computationally feasible. An attempt to recode the 57 variables into all possible dichotomous variables
was abandoned due to the unworkable number of tables this would have produced, as the combinatorics
of the task result in exponentially large numbers. For example a variable with 3 categories can only
be recoded into 3 binary variables, and one with 4 categories can be recoded into 7 binary ones, but a
147
r= 0.52
Ecological correlation - LA level
Individual level correlation
0
0
0.5
0.5
1
1
−0.5
−0.5
−1
−1
Figure 9.13: Individual vs. ecological correlations at LA level (N=41,629) with exam-
ples of Simson’s paradox shown in red
The starting dataset for analysing the potential for the ecological fallacy and the
occurrence of Simpson’s paradox is therefore 41,629 tables, each with a national 2 by
2 margin and 373 layers - one for each local authority. For each of these tables we can
calculate the φvalue i.e. the national individual level correlation and the ecological
correlation based on the 373 pairs of proportions that are in the first of the variable
categories. Figure 9.13 plots the results of this analysis. While both the individual
and the ecological correlations range from -1 to 1, the individual ones are very much
concentrated around zero, with fewer extreme values, while the ecological correlations
are a lot more spread out (the density plots on the top and right hand side make this
clearer). All the points that are on the 45◦diagonal are tables where the individual
variable with 10 categories can be recoded into 511 binary variables and the maximum in our dataset,
Region of origin with 17 categories can be recombined into a whopping 65,535 binary variables (of
the type e.g. Region of origin is the North East, Wales or no usual address - Yes or No). All in all
this would have resulted in 1,082,944,709 different and not always sensible 2 by 2 tables - and was not
attempted.
148
level correlation equals the ecological correlation. The slope of the linear regression line
(red dotted line) is 0.21. If we split the graph into quadrants, we can say all the points
in the top right and bottom left quadrants represent tables where both the individual
and the ecological correlations have the same sign - 62.83% of all the tables - the points
in the other two quadrants represent tables where the ecological and the individual
correlations have the opposite sign - 23.26% of the tables - and the remaining tables
are ones where the individual correlation equals exactly zero, about half of which have
a positive and half a negative ecological correlation .
The maximum discrepancy between the individual and ecological correlations is
found in the table indicated by an arrow on the chart. We can take a closer look at
that particular table. Because of the way the dummy variables were created the table
is slightly more difficult to interpret: it comes from an original crosstabulation of the
variables Number in household with LLTI and Number in household with poor health,
each of which have four categories originally: (i)None, (ii) One, (iii) Two or more and
(iv) na - not in household, etc. In the dichotomized version the first variable is Exactly
one person in household with LLTI, the alternative being therefore either no people
with LLTI or two or more, or na. In the second variable the category two or more
is selected, the alternative being less than two in poor health or na. The data for
both correlations is presented in Figure 9.14 : the individual level correlation is slightly
negative with a φvalue of −0.03. This is the correlation for the two by two table for all
2,621,560 people in the table. As is clear from the scatterplot, the ecological correlation
on the other hand is strongly positive with a coefficient of r= 0.89.
The individual correlation is represented by the mosaic plot on the right hand side
of the figure: living in a household with two or more members in poor health (right
column) makes a person less likely to also be living in a household with exactly one
member with LLTI (white mosaic tiles). This is to be expected given the two original
variables are quite similar and their crosstabulation has a very strong diagonal element.
Two or more people in poor health actually means it is most likely that the household
also has two or more people with LLTI, so having only one is in fact less likely, leading to
a slightly negative correlation. The positive ecological correlation is also not surprising:
again because poor health and LLTI measure such similar things it is not surprising
that areas with high levels of LLTI - and therefore high levels of households with one
LLTI member - will also tend to have higher levels of poor health - and consequently
also high levels of households with two or more members in poor health. The highest
level with almost 7.5% of households having two members in poor health is found in
Easington in County Durham, which also has 32.81% of households with one member
with LLTI (both extreme points are marked in red in the plot). The individual level
correlation in Easington is even more strongly negative than nationally with φ=−0.15.
At the other end of the spectrum only 0.85% of the households in Uttlesford in Essex
149
People living in households
with two or more members in
poor health
People living in households
with exactly one member with
LLTI
r= 0.89
Proportion living in households with exactly one member with LLTI
Proportion living in households with two or more members in poor health
1% 2% 3% 4% 5% 6% 7%
20% 25% 30%
Figure 9.14: Largest discrepancy between ecological (N=373) and individual correlation
(N=2,621,560)
have two or more members in poor health and 19.50% with exactly one member with
a LLTI.
Of course this example does not portray a fallacy of any sort - it is simply the
table where a fallacy could be committed most dramatically with a huge difference in
correlations that even change signs. Looking at the table in more detail makes it clear
that the individual and ecological relationships are completely reasonable, although
they might be easier to analyse if the variables were less awkward.
Of the 41,629 tables depicted in Figure 9.13, five of them are highlighted in red
indicating the occurrence of Simpson’s paradox. We define it here in the light version
of the paradox where the overall individual correlation is larger than the largest, or
smaller than the smallest, local area individual correlation:
φ > max(φk)|φ < min(φk)[9.20]
For example, the left-most of the five red points in the graph represents the crosstab-
ulation of Ethnic group - white British and Accommodation in flat, which has a slight
negative correlation nationally (φ=−0.15 ). The correlations for each of the 373 local
authorities range from min(φk) = −0.13 in Barking and Dagenham to max(φk) = 0.07
in Slough. The map in Figure 9.15 color codes the range of correlations with positive
ones shaded blue and negative ones ranging all the way to dark red. The mosaic plots
150
−0.20 to −0.15
−0.15 to −0.10
−0.10 to −0.05
−0.05 to 0
0 to 0.05
0.05 to 0.1
Proportion of White British ethnic background
Proportion living a flat
0% 20%
30%
40%
40% 50%
60%
60% 70%
80%
80% 90% 100%
Flat White
British
Slough
φk= 0.07
Barking
and
Dagenham
φk=−0.13
National
SAM
φ=−0.15
Barking
and
Dagenham
Slough
National
SAM
r=−0.78
Figure 9.15: Simpson’s paradox example found in SAM data
for both of the extreme local authorities are shown on the top right with the national
one below. In the latter plot we cans see that overall white British people (right col-
umn) are less likely to live in flats (white tiles) and a similar pattern can be seen in the
Barking and Dagenham plot just above it, with a slightly weaker negative correlation.
The pattern in Slough is quite the opposite with people who are not of a white British
background being less likely to live in flats. A look at the full crosstabulation (not
shown here) reveals this to be due mainly to the fact that the proportion of Slough’s
Indian and Pakistani population that live in detached houses is very strongly above the
national average.
This table represents a clear example of Simpson’s paradox, where the overall na-
tional relationship indicates a stronger relationship between the variables than can be
151
r= 0.32
Ecological correlation - GOR level
Individual level correlation
0
0
0.5
0.5
1
1
−0.5
−0.5
−1
−1
Figure 9.16: Individual vs. ecological correlations at GOR level (N=41,629)
found in any of the sub-groups i.e. local authorities. At the same time we can see
in the scatterplot that the ecological relationship is quite strongly negative, which is
confirmed by a correlation coefficient of r=−0.78. There is nothing unusual about
the highlighted points in the plot representing the extreme local authorities of Slough
and Barking and Dagenham. They stand out when it comes to the individual level
relationship, not the ecological values.
The same investigation of occurrences of Simpson’s paradox and the danger of the
ecological fallacy can also be carried out at the regional level. This means repeating
the same analysis on the 41,629 tables as before, only instead of having 373 local
authority layers, these are now merged into 10 government office region layers. The
graph comparing their individual and ecological level correlations is plotted in Figure
9.16 The scatterplot is similar to the one shown in Figure 9.13 , but with some important
differences: The distribution of ecological correlations is a lot more uniform which is
particularly clear from the density curve plotted above the chart. This means there are
comparatively more strong ecological correlations. Consequently - since the individual
152
People travelling between 5
and 19 km to work
South West
Wales
r=−0.95
People using a car to travel
to work
Proportion of people that travel between 5 and 19 km to work
Proportion of people that travel to work by car
14% 15% 16% 17% 18% 19%
20%
20%
25% 30%
Figure 9.17: Largest discrepancy between ecological (N=10) and individual correlation
(N=2,621,560)
level correlations are the same as before - the slope of the regression line has also
flattened and is now 0.08 (red dotted line). Compared to the local authority tables, the
proportion of tables where both correlations are positive or both negative has fallen to
54.38%, while the proportion of tables where the correlations are of opposite signs has
increased to 31.75%. The former tables are the less dramatic ones, where the ecological
fallacy would not lead to the reversal of signs of the correlations, just an increase or
decrease of the correlation strength.
Again we take a closer look at the table where the difference between the individual
and ecological correlations is largest (indicated by an arrow on the plot). That point
represents the crosstabulation of Travel to work by car and Distance travelled to work
5-19 km. The overall national correlation is positive with φ= 0.45, while the ecological
correlation of the ten regions is strongly negative with r=−0.95. The individual level
correlation can be seen in the mosaic plot in Figure 9.17 where we can see that the
great majority of people travelling between 5 and 19 kilometres to work use a car to do
so, while the proportion using a car to travel to work is significantly lower for everyone
else (this included both people that travel less and more as well as people that are
not in work). Looking at the ecological data however, we find that there is a negative
relationship between the proportion of inhabitants travelling to work by car and the
proportion travelling 5-19 km to work. In particular the South West stands out with
153
r=−0.95
r=−0.65
r= 0.08
Proportion of people that travel between 5 and 19 km to work
Proportion of people that travel to work by car
5%
10%
10%
15%
15%
20%
20%
25%
25%
30%
30%
35% 40%
Figure 9.18: Ecological correlations reverse at GOR (and Supergroup) and LA levels
only 16.50% travelling by car and having the highest proportion travelling between 5
and 19 km: 20.53% . Wales on the other extreme, where almost a third travel to work
by car has only 13.88% travelling within that distance range.
It is also important to note that this extremely strong ecological correlation was
found at the regional level only and in fact if we look at the same table at the local
authority levels, we find the correlation has in fact reversed in a nice demonstration
of the MAUP scale effect. Figure 9.18 demonstrates how this happened plotting the
same 10 points from Figure 9.17, alongside the 373 ones for all the LAs as well. While
the red regression line corresponds to the strongly negative correlation observed at the
regional level, the grey, very slightly positive one corresponds to the local authority
level relationship where the ecological correlation is r=0.08. For comparison we also
plot the geodemographic Supergroup level regression line (black points and regression
line), which is also negative, albeit less strongly so (r=−0.65), an example of the
MAUP zoning effect.
As before we also look for occurrences of Simpson’s paradox at the regional level as
well; the red points plotted in Figure 9.16 represent the 224 such tables in the dataset.
The largest difference between the national φand the closest regional φkis found in
the crosstabulation between Country of birth Wales and Region of origin Wales. The
correlations and the mosaic plots for the most extreme cases are plotted in Figure
9.19 Nationally, if a person had moved from Wales (staying within Wales or moving
elsewhere in England), they are much more likely to have also been born in Wales,
154
Wales
φk=−0.02
Yorkshire
and the
Humber
φk= 0.12
National
SAM
φ= 0.20
Born in Wales
Moved from
Wales
−0.05 to 0
0 to 0.05
0.05 to 0.1
0.1 to 0.15
0.15 to 0.2
0.05 0.10 0.15 0.20 0.250.00
North West
North East
East Anglia
London
East Midlands
West Midlands
Yorkshire & H.
Wales
South West
South East
Figure 9.19: Example of Simpson’s paradox at regional level
a positive relationship that is exemplified by a correlation of 0.20. All regional level
correlations are lower, the highest being in Yorkshire and the Humber and the lowest,
negative in fact, in Wales itself. It is worth noting that interpretation of this table is
again slightly awkward given that the opposite of Moved from Wales includes people
who have moved from anywhere else as well as people who have not moved at all.
The same analysis as described for the LAs and regions was also performed on
the Supergroup level, and the summary results for all three levels are summarized in
Table 9.3. Both aggregations behave quite similarly: at regional and Supergroup level
the ecological correlations are more spread out i.e. extreme values are more common
than at local authority level. This leads to the regression slope flattening and the
correlation between the ecological and individual correlations weakening, again the
levels are almost identical for regions and the geodemographic grouping. Because the
ecological correlations tend to be more extreme at higher levels of aggregation the
155
Table 9.3: Summary statistics on the ecological fallacy and Simpson’s paradox at 4
different levels of aggregation (N=41,629)
Unit of Local Government Super
Aggregation Authority Office Region Group
Type of geographic geographic geodemographic
Aggregation
Number of 373 10 7
groups
Std.dev. of re0.37 0.55 0.57
Slope of 0.205 0.085 0.082
regression
Correlation 0.521 0.316 0.321
of reand φ
Max abs. 0.92 1.40 1.37
re−φ
Same sign 62.86% 54.38% 55.47%
of reand φ
Diff. sign 23.26% 31.75% 30.65%
of reand φ
Occurrences
5 224 362
of Simpson’s
paradox
maximum ‘error’ is also larger: such mistaking of the ecological for the individual
correlation at regional level could be as much as 1.4 off! Similarly the chance of the
sign changing is larger as well at about 30%. The only clear difference between the
regional and the geodemographic tabulations is the number of occurrences of Simpson’s
paradox (light), which is a lot more prevalent in the latter tables - but still occurs less
than 1 percent of the time.
9.5 Summary
By this point it should be clear that the ecological fallacy is not just a unique scale effect
along the lines of the MAUP (cf. Amrhein, 1995, p. 106) or conversely: the MAUP is
not a particular form of ecological fallacy (cf. Johnston, 2000, p. 518). Neither is the
ecological fallacy just a version of Simpson’s paradox for continuous data (Freedman,
2002). While all three issues are conceptually similar, which often seems to lead to
these sort of imprecise generalizations, they are in fact clearly distinct phenomena.
156
The Small Area Microdata with the three different types of aggregations - two
geographic ones: local authority and regional and one geodemographic one using Su-
pergroups - has been shown to exhibit or hold the potential for all three phenomena.
Experimenting with all of these issues and working through them at the different hi-
erarchical levels allowed for real life empirical examples which could be systematically
analysed, and at the same time gave an in depth and interactive way to get acquainted
with the dataset and the degree of variation inherent in the data.
In the first part of the analysis in Section 9.2 we found the three different measures of
association strength showing remarkably congruent results as to which pairs of variables
exhibit most geographic variation. These tended to be tables that include household
level variables, mainly ones related to accommodation type and quality. The entropy
coefficient and Goodman and Kruskal’s lambda showed almost identical results on this
side of the spectrum. At the other end - tables with the least geographic variation - the
picture is less clear. Lambda becomes all but useless here with so many zero values.
Entropy and Freeman-Tukey also pick out systematically different types of crosstab-
ulations, the latter seeming to focus on tabulations with fewer cells, often involving
gender and the latter more complex ones with many structural zeros. Intuitively it is
perhaps Freeman Tukey’s that feels more accurate.
The natural next step in the analysis was to investigate the degree to which the
modifiable unit problem affects the results. Measuring how the geographic variation of
a bivariate association changes if aggregated geographically or geodemographically we
found the former to generally reduce variation, while the latter increased it in about 30%
of the tables. This is not surprising given a smoothing effect of geographic aggregation:
as regions become more heterogeneous the association strengths become less extreme,
whilst geodemographic aggregation is expected to create more homogeneous clusters,
which would be expected to retain more of the variation.
The second part of this chapter extended the analysis into a comprehensive review
of both the ecological fallacy and Simpson’s paradox as related phenomena. This
meant clearly distinguishing the situations - types of data and statistics - that allow for
each of these problems to occur and their association with MAUP. This is summarized
conveniently in Table 9.4 which comprehensively specifies the units of analysis, types of
data and statistics that are characteristic of each of the phenomena under discussion.
Both Simpson’s paradox and the ecological fallacy have very specific and clearly defined
data requirements and each can exist in two incarnations, which are described in the
table. The MAUP is much more general data-wise, and is therefore not restricted only
to the illustrative examples given in the table.
Conceptually the MAUP is also the broadest in that technically it only requires
two different sets of aggregations, whilst the results being compared between them can
be either individual level or unit level and the data can be uni-, bi- or multivariate.
157
Table 9.4: Variants of the ecological fallacy,Simpson’s paradox and the MAUP
Units of analysis Number of units of analysis Data typeaExample variables Statistic
Ecological fallacy
K≪N
Individual
vs.
Group
Nindividuals Continuous Age & Income Pearson’s correlation (r)
Kareas Continuous Average age & Average income Pearson’s correlation (r)
Nindividuals Dichotomous Old/young & Rich/poor Pearson’s correlation (φ)
Kareas Continuous Percent old & Percent rich Pearson’s correlation (r)
Simpson’s paradox
K≪N
Individual
vs.
Individual
Nindividuals Continuous Age & Income Pearson’s correlation (r)
(Kgroups) of NKindividuals Continuous Age & Income Pearson’s correlations (rk)
Nindividuals Dichotomous Old/young & Rich/poor Pearson’s correlation (φ)
(Kgroups) of NKindividuals Dichotomous Old/young & Rich/poor Pearson’s correlations (φk)
MAUP - scale
K < L
Group
vs.
Group
Klarger areas e.g. Continuous
(univariate)
Chi square of crosstabulation be-
tween social class and income Standard deviation (σ)
Lsmaller areas e.g. Continuous
(univariate)
Chi square of crosstabulation be-
tween social class and income Standard deviation (σ)
Individual
vs.
Individual
(Kgroups) of NKindividuals e.g. Continuous
(bivariate) Age & Income Pearson’s correlation (r)
(Lgroups) of NLindividuals e.g. Continuous
(bivariate) Age & Income Pearson’s correlation (r)
MAUP - zoning
K=L
Group
vs.
Group
Kareas e.g. Continuous
(multivariate)
Percent homeowners, percent em-
ployed, percent labour voters
Multiple reg. coeff.
(β0k, β1k, β2k...)
Ldifferent areas e.g. Continuous
(multivariate)
Percent homeowners, percent em-
ployed, percent labour voters
Multiple reg. coeff.
(β0l, β1l, β2l...)
Individual
vs.
Individual
(Kgroups) of NKindividuals e.g. Dichotomous
(univariate) Labour voter (Yes/No) Election result (e.g.
FPTP)
(Lgroups) of NLindividuals e.g. Dichotomous
(univariate) Labour voter (Yes/No) Election result (e.g.
FPTP)
aThe ecological fallacy and Simpson’s paradox data types are exhaustive - the only ones possible. The MAUP ones are examples only - as indicated by the e.g.
prefix - and numerous other possibilities exist that are not explicitly enumerated.
158
The only real limitation is that there are several units i.e. not just (a national) one.
No attempt was made here to systematically investigate the MAUP in this chapter,
although the analysis of geographic variation in Section 9.2.2 does represent an original
example of MAUP-style analysis.
Both the ecological fallacy and Simpson’s paradox can however be stated very un-
ambiguously and their data requirements and the statistics that are applicable are very
restrictive. Section 9.3 is therefore devoted to an extensive overview of both of these
concepts and this includes a review of the original Robinson data, which are also shown
to hold potential for the Simpson’s paradox. This is followed by explicitly stating the
full derivation of the relevant correlation coefficients for both continuous and dichoto-
mous variables. Since the literature tends to focus on either one or the other of these
phenomena, the main contribution of this section is addressing both at the same time
and explicitly and exhaustively delimiting them. As the result of this examination,
Table 9.4 represents a unique classification that is all-inclusive and one that is in our
opinion sorely missing in the relevant literatures.
In the final part of the analysis in Section 9.4 our focus returns to the SAM data.
Because the ecological fallacy and Simpson’s paradox require data to be either contin-
uous or dichotomous, but the SAM contains categorical variables with generally more
than two variables, this first meant our dataset had to be restructured by dichotomizing
the variables. This created a set of over 40,000 tables through which we were able to
investigate the ecological fallacy and Simpson’s paradox and compare their behaviour
on three different levels of aggregation, the results of which were summarized in Table
9.3. Perhaps most striking are the similarities between the results for the regional and
geodemographic Supergroup aggregations. From a strict MAUP point of view it must
of course be noted that due to the different number of groupings (ten vs. seven) this
comparison is not systematic. Taking this difference into account however, we can ten-
tatively infer that the geodemographic classification does in fact retain more bivariate
variation than the geographic one in that it retains the same amount in fewer groups.
Overall the depth of analysis of the ecological fallacy, Simpson’s paradox and general
MAUP issues presented here can be seen as a side-step. But it is felt it is a necessary
one in line with the comprehensive nature of this analysis, which furthermore allows a
unique prism for describing the dataset at hand. This in turn has allowed some further
light to be thrown on the complex nature of multivariate categorical interactions that
IPF estimates must attempt to capture. This chapter also resulted in a unique summary
of three phenomena that are ubiquitous in geographical data analysis in the form of
a concise yet comprehensive table clearly defining the conditions under which each of
them may occur.
159
160
Part III
Applications
161
Chapter 10
IPF and the Error of insufficient
constraints
In the first of the two applications chapters we investigate how IPF handles insufficient
constraints in the spatial context. Using the 1596 three-dimensional crosstabulations
defined in the previous section we apply both the measures of association strength and
goodness-of-fit to investigate the behaviour of four different models under IPF.
Model 4
Model 3
Model 2
Model 1
Fully saturated model
No three-factor effect
Two two-factor effects
One two-factor effect
Main effects (Independence)
[ABC ]
[AB][BC][AC]
[AB][AC] [AB][BC][AC][BC ]
[AB][C] [AC][B] [BC][A]
[A][B][C]
Figure 10.1: Hierarchy of all possible three-dimensional models (adapted from (Wick-
ens, 1989, p.67)
Figure 10.1 displays all nine possible hierarchical models in a three-dimensional
table, highlighting the four that are investigated in this chapter. The simplest model
is the bottom one with only three main effects. Moving upwards, additional terms are
added at each level, until the top fully saturated model is reached. The arrows indicate
163
the nestedness of the models: in each pair of models that is connected, the lower one
becomes the more complex higher one by the addition of a single term. Thus if the
term [AB]is added to Model 1 it becomes Model 2. It is important to note that not
all models have this hierarchical relationship, thus Model 2, while simpler than Model
3, is not nested within it. In fact as we shall see, Model 3 in fact performs much worse
than the simpler Model 2. However within the hierarchy – i.e. moving up along the
arrows – a higher model will always be at least as good as the lower one. This means
Model 3 will by definition perform at least as well as Model 1.
Each model is considered in turn and the results are compared to the known
full population. This allows us to use the two measures of fit to establish which
crosstabulations– i.e. combinations of two variables – perform extremely well and ex-
tremely poorly, thereby also shedding light on the measures themselves. For each model
the degree of fit is regressed against the measures of association strength before inves-
tigating the geographic variation of fit as well. Working through the model hierarchy
we explore the effects each marginal configuration has on the quality of the estimates
as well as how different variables and bivariate distributions affect these results.
10.1 Model 1: From [A],[B],[C]to [d
ABC]
The first scenario assumes only the univariate distributions of the three variables are
known. The IPF procedure is run to produce the maximum entropy estimates of the
full tabulations given only the three ‘edge’ margins, or to phrase the model in log-linear
terminology, we estimate ˆxijk :
ˆxijk =τ·τA
i·τB
j·τC
k[10.1]
Figure 10.2 gives a visual depiction of the model whereby the red mosaic plots rep-
resent the marginals that are to be used to estimate the cube. In this instance only
simple univariate margins are used in what will predictably be a relatively bad estima-
tion of xijk . No interactions are included in this model, meaning all three variables will
be independent in the estimated table.
ˆxijk =τ·τA
i·τB
j·τC
k
df =z}| {
I·J·K−z}|{
1−z}| {
(I−1) −z }| {
(J−1) −z }| {
(K−1)
[C]
[A]
[B]
Figure 10.2: Model 1: [1],[2],[3] →[123] with equation and degrees of freedom
164
Proportion misclassified (∆)
Minimum 0.89
Median 12.27
Maximum 65.09
Cressie-Read Z-score (ZCR2)
Minimum 67.72
Median 638.79
Maximum 2400.30
ZCR2
∆
10000 500 1500 2000
0% 10% 20% 30% 40% 50% 60%
Figure 10.3: Goodness-of-fit statistics for Model 1 (N= 1596)
Each coefficient in the log-linear model reduces the degrees of freedom by the number
of values required to fully describe its effect. For example the effect of variable Cwith
Kcategories requires (K−1) parameter values1. This means our models will have
372 coefficients describing the τC
kterm - the effect of the local authority. Using as an
example the smallest tables in our dataset with 2×2×373 = 1492 cells, we therefore
have 1117 degrees of freedom using this first model. This is equivalent to saying that
in this model, 1117 coefficient values have been set to one. There is a single τAB
ij
coefficient, 372 τAC
ik and another 372 τBC
jk coefficients and finally 372 three-way τABC
ijk
coefficients, which sums up to 1117. All of these are fixed to one and therefore have no
effect on the cell estimates. The 375 degrees of freedom that are removed correspond
to the 375 coefficients taken from the fixed margins.
The analysis of Model 1 involves using IPF to estimate 1596 full tables. At the local
authority level these tables are of size I×J×373 and these are then and compared to the
original SAM data. The measures used in this analysis are percentage misclassified (∆)
and the Wilson-Hilferty normalization of the Cressie-Read chi-square statistic (ZCR2)
(see Equations [8.44] and [8.49]) . Figure 10.3 summarises the results of this analysis
which produced a median misclassification of 12.27% and a median Z-score of 638.79.
As expected, the model performs quite badly in general, but we note two interesting
facts. One is the wide range of scores as measured by either statistic. While some
tables were estimated with only a few percent misclassified (which is still a few ten
thousand people) there are, at the other extreme, eight tables where more than 50%
were misclassified.
The scatterplot also makes it clear that the two measures are quite disparate. Al-
1Although formally all three variables are equivalent we shall adopt a convention of designating the
final variable (in this case variable Cwith levels k) as the spatial variable.
165
though there is a certain degree of linearity (r2= 0.47), the two definitions of good
fit assessed by the two measures can be quite dramatically different. A similar con-
clusion can be drawn by comparing the rankings of the tables by both measures (not
pictured). The average difference between the rankings by goodness-of-fit is 230 with
the maximum value of 1097. This latter table has a ∆value of 3.90% (ranked 66th)
and a z-score of 863 (ranked 1163th).2This finding gives a clearer perspective on the
problem of choosing a goodness-of-fit measure and reinforces the decision not to limit
the analysis to only one statistic. Although this can come at the expense of being able
to make absolute and definitive statements about the quality of some estimates it is
felt that it would be disingenuous for the analysis to maintain such a claim if it is not
possible.
According to ∆the best fit was produced for the crosstabulation of Type of com-
munal establishment and Sex, with only 0.89 % misclassified, and this is also the best
estimate according to its ZCR2value of 67.72.3The best fit example becomes quite un-
derstandable given the variable association structure, which can be seen graphically in
Figure 10.4. At the national level the overwhelming majority of the sample population
(99.91 %) do not live in communal establishments. For the small percentage that do,
gender is in fact an important factor in determining the type of establishment, with
the odds of women being in a non-NHS establishment twice as high as those of men.
In Liverpool the odds ratio is even higher: (9 ×112)/(5 ×65) = 3.10. Because
no interaction terms were included in Model 1 the variables are independent in the
estimates (right hand panel) – with an odds ratio of one, there is no difference in com-
munal establishment type between men and women. That same national relationship
between the variables applies to all the 373 local authorities. Although only 0.25 %
of the Liverpool sample was misclassified, the ZCR2value for this local authority is
2.75. So on the one hand a relatively small number of people have been misclassified,
but important information can be said to have been lost. Unsurprisingly other tables
that tabulate Sex by similarly skewed variables such as Status in communal establish-
ment,Use of bath/shower/toilet or Accommodation self-contained all score extremely
well under this model with only about one percent misclassified.
At the other end of the spectrum, judging which tables fit worst, the two measures
are not in unison. With 65.09% misclassified the tabulation of Economic Activity (last
week) and Year last worked is the worst according to ∆(top of Figure 10.5). This table
is also the 9th worst according to a ZCR2value of 1582. But according to ZC R2the
2In fact repeating the analysis using the other measures from Section 8.4 showed similar results – ∆
and Zχ2have an r2= 0.24; ∆and SRMSE have an r2= 0.67; ZCR2and SRM SE have an r2= 0.53;
Zχ2and SRM S E have an r2= 0.32 and Zχ2and ZCR2have an r2= 0.93. An r2value of one would
of course indicate that the two measures are measuring the same thing. Thus the closeness of the two
z-scores is expected as the two power-divergence statistics are the most similar.
3It should be kept in mind that these are z-scores meaning the metric is a standard deviation in a
unit normal.
166
National SAM Liverpool SAM Liverpool Model 1
Male Female Male Female Male Female
NHS 849 911 9 5 7.20 7.60
LA/HA etc. 6,257 14,638 65 112 85.53 90.19
No code req. 1,268,875 1,330,030 10,608 11,248 10,638.11 11,218.36
National SAM Liverpool SAM Model 1
Figure 10.4: Type of communal establishment by Sex
worst performing table is Migration indicator by Region of origin with a value of 2400,
even though it has only 23.83% misclassified (bottom of Figure 10.5).
The first of these two tables exhibits a very strong association with many structural
zeros. Out of the 25 cells in the bivariate crosstabulation 16 are empty - these structural
zeros refer to impossible combinations such as for example ‘employed last week’ and
‘never worked’. Because there is no interaction data in the Model 1, the IPF estimates
show the two variables as independent (top right of Figure 10.5), leading to 1.7 million
misclassified persons out of a total of 2.6 million. The chi-square based measure also
shows an extreme lack of fit with a z-score value of 1582.
The second table also exhibits a very large proportion of structural zeros (74/102)
again because of impossible combinations - if you have migrated from the North West
then you cannot have the same address as last year. The main difference from before
is that a great majority of the population is concentrated in one cell - almost 87%
did not move and hence their region of origin value is ’same address as now’. Because
both of these categories are so large, even after IPF removes the relationship between
the variables, there are still over 75% in that cell. This leaves only about 25% to be
potentially misclassified in the remaining cells - and most of them in fact are - leading
to an overall rate of misclassification of only 23.83%.
By contrast the Cressie Read statistic (as would Pearson’s X2) treats the cells in a
more balanced way so that their contribution to the total lack-of-fit is not proportional
to their size. In fact we can use a mosaic plot to visualise the relative contributions
of each cell to the total ZCR2value4(right panel of Figure 10.6). The three cells that
4Note that the Cressie Read statistics (and therefore also ZCR2) are signed measures (see footnote
167
National SAM National Model 1
Economic Activity (last week)
by
Year last worked
∆=65.09%
Rank=1596.
ZCR2= 1582
Rank=1588.
Migration indicator
by
Region of origin
∆= 23.83%
Rank=1431.
ZCR2=2400.30
Rank=1596.
Figure 10.5: The two worst performing tables under Model 1
contribute most to the lack-of-fit are highlighted and they account for almost 30% of
the total error. All three are cells whose frequency was grossly underestimated by
the model. Compared to the relative cell contributions to the percentage misclassified
(shown in the left panel) it is clear that the cells contributing most to the error are not
the same as with ZCR2.
There is of course no reconciling the two measures, but we can learn from the above
example that an extremely skewed distribution with a single large cell will necessarily
limit the possible percent misclassified. In that sense one may say the estimate is rather
good, however ZCR2will not be affected by such a cell distribution and according to it
- in this case at least - the estimate is in fact the worst.
The final question is if there are any general rules that would help predict the
performance of the IPF estimates. There is no simple way of quantifying characteristics
such as ‘single large cell’ or ‘many structural zeros’ or even combinations thereof, but we
can use the measures of association strength (see Section 8.3) as single value descriptors
of the table structure. Table 10.1 shows the results of bivariate linear regression analysis
14 on page 109). This means the concept of cell contribution differs from that of positive measures
such as ∆ so comparison of the two panels in Figure 10.6 should be regarded as only tentative.
168
Percent misclassified Cressie Read statistic
Figure 10.6: Individual cell contributions to lack-of-fit
Table 10.1: Coefficients of determination (R2)
F T 2λAB UAB
Percent misclassified (∆)61.34% 45.18% 40.63%
Cressie-Read Z-score (ZCR2)34.79% 31.96% 40.62%
of with the dependent variables of either ∆or ZCR2are being predicted by one of three
measures: Freeman Tukey Chi square statistic (F T 2), symmetrical lambda (λAB ) and
the uncertainty coefficient (UAB).
The coefficients of determination (R2) show that (in Model 1 ) the Freeman Tukey
statistic is the best at predicting the percent misclassified with an R2value of over 60
percent. This means a bivariate association that is strong in the sense that is measured
by F T 2will increase the misclassification rate of the IPF estimate. On the other hand
predicting IPF performance as measured by ZCR2none of the three measures offer such
a good level of correlation, the best being UAB explaining just over 40 percent of the
variation in error.
The strongest of these correlations is plotted in Figure 10.7 where the linear asso-
ciation between F T 2and ∆is made clear. The three tables analysed above are also
highlighted in the plot: the two with the maximum and minimum percent misclassified
are clearly visible at the top-right and bottom-left corners respectively, whereas the
table with the highest ZCR2does not stand out at all.
Geographic variation of model fit
So far we have examined the overall misclassification or error of the whole table while
focusing in particular on the AB margin. In this section we take a look instead at the
layers of the data cube: how accurate are the estimates of the relationship between
169
0.0 0.2 0.4 0.6 0.8 1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
Freeman-Tukey statistic (FT 2)
Percent misclassified (∆)
R2=0.61
Figure 10.7: Correlation between F T 2and goodness-of-fit under Model 1
20 −30%
30 −40%
40 −50%
50 −60%
60 −70%
70 −80%
80 −90%
10 −20%
0−10% 0−50
50 −100
100 −150
150 −200
200 −250
250 −300
300 −350
350 −400
400 −450
ZCR2
∆
Country of Birth
by
Accommodation type
Ethnic Group
by
Religion
Figure 10.8: Geographic variation of goodness-of-fit under Model 1
two variables in each LA and how does the level of accuracy vary across the geography.
This means focusing on AB|Cor the accuracy of estimating the relationship between
the two variables conditional on geography.
We first calculate the percent misclassified in each LA separately for all 1569 tables.
If we take the example of Economic Activity by Year last worked, which we saw above
had the highest proportion misclassified, we find that across the 373 LAs the value of ∆
ranged from 64.75% to 65.84%. This means Model 1 performed equally poorly across
all local authorities, with very little variation in goodness-of-fit.
We use the standard deviation to measure the the variation of goodness-of-fit across
170
local authorities. The standard deviation is highest for the table Country of birth by
Accommodation Type, which ranges from 5.19% in Wyre to 85.88% in Merthyr Tydfil,
while the nationwide misclassification rate is 23%. The geographic distribution is shown
in the left-hand map in Figure 10.8 where it is clear that Country of birth (where one
of the options is ‘Wales’) is a predominant reason for misclassification. In fact the 56
tables where Country of birth is one of the two variables are also the 56 tables with the
largest geographic variability of percent misclassified. Furthermore, the top 100 tables
all have either Country of birth or Accommodation type as one of their variables
The geographic variation of ∆under Model 1 seems to be steered by single variables
which exhibit strong geographic variation, with the ranking very clearly dominated by
the two above-mentioned variables, followed by Lowest floor level of accommodation,
Ethnic Group,Cars/Vans owned and Occupancy rating. These variables exhibit very
clear ‘blocks’ within the ranking, making it quite clear that the single variable variation
is more important than the variation of bi-variate associations.
Assessing the geographic variation of goodness-of-fit as measured by ZCR2produces
slightly different results. The maximum variation is found for the table Ethnic Group by
Religion which is pictured on the right-hand side of Figure 10.8. In addition to several
London LAs, large errors are also shown in Birmingham, Leicester and Bradford. These
areas therefore have Ethnic Group by Religion tables that most different (as measured
by ZCR2) from Model 1 estimates5
The overall ranking by geographic variation according to ZCR2is a lot less orderly
than the ranking according to ∆. Some of the same variables are clearly present in the
top, but so are many others, in particular the migration variables. The ranking also
shows less grouping of individual variables indicating that perhaps univariate distribu-
tions have less of an effect on fit measured by ZCR2compared to it being measured by
∆.
Overall it is difficult to ascertain what quality of these variables makes them stand
out. Using the measures described in Section 8.3 to try to describe these variables in
terms of them having higher entropy or otherwise extreme in their distributions has
produced unsatisfactory results. A few of the variables do stand out as having gener-
ally very un-uniform distributions (as measured by Pearson’s Chi squared): Migration
origin,Ethnicity and Religion. However other variables such as Accommodation type
are not extreme in that sense, making it difficult to draw any conclusions.
5This does not mean they differ most from the national average though, Model 2 will answer that
question.
171
10.2 Model 2: From [AB],[C]to [d
ABC]
Figure 10.9 gives a visual depiction of the second model again with the red mosaic plots
representing the marginals used in the estimate. Here the association between Aand
Bis used, which automatically means the individual effects of variables Aand Balong
with the univariate geographic variable C. In addition to the coefficients used in model
1 we are therefore adding the additional τAB
ij interaction term 6:
ˆxijk =τ·τA
i·τB
j·τC
k·τAB
ij [10.2]
This means the two variables are no longer independent, but their relationship will
be forced to be the same across all areas. The degrees of freedom are the same as in
the first model, minus the additional (I−1)(J−1) coefficients required to describe
the interaction between Aand B. So the smallest table covering all 373 areas will now
have 1116 degrees of freedom – only one fewer than with the previous model, but this
single degree of freedom is lost in exchange for the τAB
ij coefficient, potentially a very
informative addition to the model.
ˆxijk =τ·τA
i·τB
j·τC
k
|{z }·τAB
ij
df =df[Model 1] −z}| {
(I−1) ×(J−1)
[C]
[AB]
Figure 10.9: Model 2 with equation and degrees of freedom
As with Model 1 we can again summarize the goodness-of-fit statsistics for Model
2in Figure 10.10. This can be compared with the results for Model 1 presented in
Figure 10.3 (page 165). Although perhaps not immediately obvious from the scatter-
plot, both measures are much more in agreement in measuring the errors of Model 2
estimates, with an r2value of 0.72. The fact that there is less discrepancy between the
measures compared to Model 1 is also confirmed by the average difference in rankings
by both statistics which is now 123 (compared to 230 before) with a maximum of 630
(compared to 1097). The scatter plot also exhibits quite clear clusters of outliers .
Upon further investigation it turns out these points refer to the tables which include
6Note the difference between the notations: in bracket notation [AB] refers to the whole margin
and thereby by definition also silently includes both [A] and [B]. In the log-linear model however, we
need the interaction term τAB
ij as well as the individual τA
iand τB
jterms to express [AB]. Only using
τAB
ij without the lower terms would indicate a non-hierarchical model – possible, but not used in this
application here.
172
Proportion misclassified (∆)
Minimum 0.82
Median 8.71
Maximum 22.43
Cressie-Read Z-score (ZCR2)
Minimum 48.2
Median 343.88
Maximum 1251.29
ZCR2
∆
200 400 600 800 1000 1000
0% 5% 10% 15% 20%
Figure 10.10: Goodness-of-fit statistics for Model 2 (N= 1596)
the following variables: Migration origin, Country of birth, Ethnicity, Religion, Lowest
Floor, Transport to work and Accommodation Type. (from light red to dark red in that
order in Figure 10.10). Interestingly, these are the exact same tables that turned ex-
hibited the highest geographical variability of goodness-of-fit in Model 1. Again, most
of these tables do have variables that are in the top ranks of being un-uniform, but this
relationship only holds at the extreme end of the spectrum and not as a general rule.
By virtue of including the interaction term τAB
ij ,Model 2 performs better by both
measures as can be seen from the summary statistics in Figure 10.10 . While Model 1
misclassified an average of over 12%, with Model 2 the average misclassification rate was
8.71%. The best improvement in these terms was found in the table that showed the
worse misclassification under Model 1 :Economic Activity by Year last worked went
from 65% to just over 5% misclassified. In a similar vein, the best improvement in
terms of ZCR2was found in the tabulation of Migration Indicator by Region of Origin,
the table that performed worst under Model 1.7
Again the best fitting table as measured by both measures is Communal estab-
lishment type by Sex, with ∆= 0.82% and ZCR2= 48.2. Since Model 2 ‘forces’ the
national level relationship across all local authorities there is no point in inspecting
7Interestingly, in the case of a few tables, their performance under Model 2 seems to have deteriorated
slightly. If we take a closer look at these tables however, it would seem this deterioration is simply an
artefact of decimal rounding. For example the table Communal establishment type by Ethnicity under
Model 1 had ∆= 11.288311% and under Model 2 ∆ = 11.288403%, which translates to a difference
of just under 5 individuals misclassified in a table with over 15,000 cells. Since the IPF algorithm has
a stopping criterion of 0.001 i.e. the procedure stops when the margins of the estimates are less than
0.001 different from the original margins, this would explain how it is possible that the estimated table
might still end up slightly worse than under Model 1. However by definition, the higher model must
produce a fit at least as good as the lower nested model, so these results are not indicative of anything
other than imprecision.
173
National SAM/Model 2 Merthyr Tydfil
Accommodation type
by
Country of birth
∆=22.43%
Rank=1596.
ZCR2= 986.59
Rank=1579.
Cardiff
National SAM/Model 2
Coutry of birth
by
Region of origin
∆= 19.74%
Rank=1593.
ZCR2=1251.29
Rank=1596.
Figure 10.11: The two worst performing tables under Model 2
the mosaic plots, as they all look the same for all LAs. If we recall from the previous
section we saw that nationally the odds of women being in a non-NHS establishment
(as opposed to a NHS one) are more that twice as high as those of men. In the case of
Liverpool, the odds ratio was 3.10. Whereas Model 1 meant all LAs ended up with an
odds ratio of one, under Model 2 they now all have the same (national) odds ratio just
over two (the actual odds ratios range from 0.22 to 19.20).
The worst performing table according to ∆is Accommodation type byCountry of
birth with 22.43% misclassified and according to the Cressie Read statistic it is Country
of birth by Region of origin with ZCR2= 1251.29. In order to gain some insight into
where these errors are coming from we can look at which LAs have the largest error8.
In the case of the first table this is Merthyr Tydfil (again) with 85.89% misclassified
and in the second it is Cardiff with ZCR2= 190. The original tabuations for both of
these extreme LAs are plotted in Figure 10.11 next to their respective national margins
– these of course correspond to the estimates made unde Model 2 for these and all other
LAs. In the top two mosaic plots it is the third column that corresponds to ‘Wales’ as
8This does not mean they necessarily contribute most of the error though, since that depends on
the size of the LA. But it does mean the particular LA was least well fitted by Model 2.
174
Table 10.2: Ranking by geographic variation of Percent misclassified ( ∆)
Crosstabulation Model 1 Model 2
Country of birth by Accommodation Type 1st 2nd
Country of birth by Lowest Floor 2nd 1st
Cars/Vans owned by Schoolchild or Student 3rd 4th
Country of birth by Sex 4th 3rd
Country of birth by Status in Communal Est. 5th 7th
the Country of birth, while in the bottom two plots this answer is represented by the
third row. It is clear form the rankings (given in the figure) that both tables perform
similarly badly by both measures.
Analysing Model 1 it was found that measures of association strength could be used
to predict the accuracy of the estimate to a certain degree. With Model 2 however no
such relationship exists. This is to be expected, since Model 2 incorporates this associ-
ation rather than keeping the variables independent. Any measure of [AB]strength is
not likely to have any predictive power, since the errors under this model come mainly
from the [AC]and [BC ]margins i.e. the geographic variation, to which we turn next.
Geographic variation of model fit
In the analysis of Model 1 the geographical analysis of the model fit focused on the
[AB]relationship conditional on the geography i.e. AB|C. The same analysis was
performed on the Model 2 estimates and produced almost identical results for the
tables that produced the largest amount of geographical variation of error. This can
be seen from Table 10.2, which lists the tables where the percent misclassified under
Model 1 had the highest geographic variation alongside the rankings these tables got
under Model 2.
So instead of producing maps for the tables with the most geographic variation of
error under Model 2, we instead look at the table where the addition of the interaction
term lead to the greatest change in geographic variation compared to Model 1. The
greatest decrease of geographic variation comes about in the crosstabulation of Hours
worked weekly and Transport to work, but this decrease is not very dramatic as it
coincides with a range of ∆of 15.71 percentage points under Model 1 reducing to a
range of 12.9 percentage points under Model 2.
The more interesting change is at the other extreme: the relationship between Year
last worked and Transport to work exhibits the greatest increase in geographic variation
after the interaction term is added in Model 2. Under Model 1 local authorities ranged
from of 49.49% to 55.33% misclassification, as can be seen on the right hand panel of
175
0% 10% 20% 30% 40% 50% 60%
0 20 40 60 80 100
Frequency
20 −30%
30 −40%
40 −50%
50 −60%
10 −20%
0−10%
20 −30%
30 −40%
40 −50%
50 −60%
10 −20%
0−10%
∆∆
Model 2 Model 1
Figure 10.12: Year last worked byTransport to work goodness-of-fit under first two
models
Figure 10.12 as well as the histogram below. Including the interaction term in Model
2reduced the misclassification rate significantly, but also increased its geographical
variation which now ranges from 3.59% to 32.10% So while in Model 1 on the right
hand side the errors are an expression of how different the relationship in the local
authorities is from independence, the left-hand side of the Figure describes under Model
2how different they are from the aggregate or national relationship.
By virtue of forcing the national [AB]relationship onto each geographic layer,
Model 2 is of course also forcing the national [A]and [B]margins onto each of the local
authorities. So despite the improvement that comes from including the top margin, the
errors in the side margins [d
AC]and [d
BC]will mean the overall fit of Model 2 is not
dramatically better than Model 1. This occurs because of the principle of variational
independence underlying these models: even if a particular LA’s relationship between
the two variables is identical to that at the national level - e.g. one group is twice as
likely to have a certain characteristic than another - if the relative sizes of the groups
are different, the error need not be reduced at all9.
9The error at the level of an individual LA can even be increased although on the whole the error
cannot increase (see footnote 7).
176
Table 10.3: Highest ranking variables by Mean ∆and standard deviation across LAs
Variable Mean ∆Variable Std. Dev. of ∆
Accommodation type 15.00 Country of birth 15.04
Country of birth 12.88 Accommodation type 10.25
Soc.Econ. class of FRPa10.79 Ethnic Group 7.54
Cars/vans owned 10.56 Lowest floor 7.44
Ethnic group 10.03 Cars/vans owned 6.30
aFamily Reference Person
This means we can also separately analyse the errors found in the [d
AC]or [d
BC ]
side margins. In our particular scenario these two sets of side margins cannot be
distinguished from each other, because over the whole set of 1596 tables they are
actually identical - it does not matter if e.g. Accommodation type is designated as
variable Aor variable B- its margin with variable Cwill behave the same way. It is
also worth noting that [d
AC]and [d
BC]are estimated the same under both models 1 and
2 - in both cases IPF uses only the information contained in the edges [A]and [C]or
[B]and [C]to estimate them.
The analysis of Model 1 and Model 2 so far has shown the dramatic improvement
in the estimates the inclusion of [AB]has caused, but it has also consistently flagged
up the same variables in the tables with the worst estimates under both models. It
therefore comes as no surprise that these same variables are also at the top of the list
when it comes to the side margins i.e. how individual variables vary geographically.
They are listed in Table 10.3 along with their errors and standard deviations thereof.
A graphical examination of one of the most extreme variables is presented in Figure
10.13 showing (part of the) side margin Country of birth by LA. At the top of the
graph the national [A]margin is shown - that is the one that is forced onto all [d
A|C]
layers i.e. in the estimates for both Model 1 and Model 2 each LA has the same ratios
of people in each Country of origin category. Underneath we can see the actual [A|C]
distributions: the top five are for the local authorities where the distribution is closest
to the national and the error smallest; the bottom five are the LAs with the largest
percentage misclassified i.e. the ones that deviate the most from the national aggregate.
These are all Welsh LAs that were noted before and in fact, if one were to map these
errors, the resulting pattern is nearly identical to the one observed in Figure 10.8 in
the previous section of the error in the tabulation with Accommodation type.
This particular example makes clear the difficulty of disentangling the effects on the
error that stem from the univariate distributions of individual variables and from the
geographic variation of the bivariate relationships. Therefore the next section looks at
177
a model where only the geographic variation of the univariate distributions is included,
before finally inspecting the model with all three side margins.
Neath Port Talbot
Caerphilly
Rhondda, Cynon, Taff
Blaenau Gwent
Merthyr Tydfil
ale of
85.74%
85.44%
83.97%
83.79%
83.37%
National SAMNational SAM
······
∆
Chester
Wokingham
Vale of White Horse
Rushmoor
Bristol 3.69%
3.64%
3.50%
3.41%
3.32%
ale of
England
Scotland
Wales
Northern Ireland
All other countries
Not usual resident
Figure 10.13: Country of birth - best and worst performing LAs under Model 2 and
Model 1
10.3 Model 3: From [AC],[BC]to [d
ABC]
Figure 10.14 gives a visual depiction of the third model again with the red mosaic plots
representing the marginals used in the estimate. This time the association between A
and Bis excluded from the model, while both side faces of the cube – margins [AC]
and [BC ]– are used. In addition to the coefficients used in model 1 we are therefore
adding the additional τAC
ik and τBC
jk interaction terms :
ˆxijk =τ·τA
i·τB
j·τC
k·τAC
ik ·τBC
jk [10.3]
As in Model 1 then, the IPF estimate will force the association between Aand Bin
every LA ([d
AB|C]) to be independent. Contrary to Model 1 however, each LA will now
have the correct distribution of Aand B. This means that the national estimate ([d
AB])
will not be independent as it was in Model 1. Again an example of the calculation of
degrees of freedom for the smallest table with 2×2×373 = 1492 cells means that the
1117 degrees of freedom from the first model are now reduced by another 372 for each
side margin leaving 373 degrees of freedom.
This is in stark contrast to Model 2, which had 1116 degrees of freedom. So in the
simplest of tables from our collection, the top face of the cube [AB]that was included in
Model 2 reduced the df. by only one, while the side faces of the cube included in Model
3,[AC]and [BC ], have reduced the df. by 372 each! Under different circumstances
178
ˆxijk =τ·τA
i·τB
j·τC
k
| {z }·τAC
ik ·τBC
jk
df =df[Model 1] −z}| {
(I−1) ·(K−1) −z }| {
(J−1) ·(K−1)
[AC] [AB]
Figure 10.14: Model 3 with equation and degrees of freedom
Proportion misclassified (∆)
Minimum 0.12
Median 4.71
Maximum 64.62
Cressie-Read Z-score (ZCR2)
Minimum 3.43
Median 365.62
Maximum 1699.31
0
ZCR2
∆
500 1000 1500
0% 10% 20% 30% 40% 50% 60%
Figure 10.15: Goodness-of-fit statistics for Model 3 (N= 1596)
one would be inclined to expect this addition of 744 log-linear parameters into the
model would produce dramatically better fit, but since this is a geographical table and
these added coefficients all refer to geographic variation this expectation is perhaps less
warranted.
Again we can summarize the goodness-of-fit statistics in Figure 10.15 along with
the scatter plot of both errors. It is worth noting that the variable clusters that stood
out in Model 2 are not present in Model 3 – i.e. the clustering that can be seen in
the scatter plot is not systematic and cannot be linked with any individual variables.
The error distribution of Model 3 can be compared to those from the previous models.
While the median percent misclassified has is now 4.71% compared to 8.71% under
Model 2, the median standardized Cressie and Read score is in fact slightly higher: 366
compared to 343. The mean errors are larger for both measures, which is consistent
with the fact that the maximum errors for Model 3 are significantly larger than in
Model 2 and are in fact almost as large as in Model 1.
The differences are much easier to observe if we compare the whole distributions of
errors for the three models as is done in Figure 10.16. In the case of percent misclas-
179
sified (left panel) the Model 3 errors (red line) seems almost an amalgamation of the
distributions10 of the Model 1 and Model 2 distributions - shifting to the left, but keep-
ing almost the whole range of the lowest model as well as its shape. The Cressie Read
statistic (right panel) similarly shifts to the left, but keeps the characteristic bimodal
distribution found in Model 1. This sort of aggregate behaviour can be expected given
that Model 3 is hierarchically above Model 1 and not directly related to Model 2.
0
.0010
0% 10% 20% 30% 40% 50% 60% 70%
2 4 6 8 10 12
Model 1
Model 2
Model 3
Density
∆
500 1500 2000 250010000
0 .0005 .0010 .0015 .0020 .0025 .0030
Model 1
Model 2
Model 3
Density
ZCR2
Figure 10.16: Goodness-of-fit statistics distribution for Models 1,2 and 3 (N= 1596)
According to ∆the best fit was produced for the table of Type of communal estab-
lishment and Country of birth, with only 0.12 % misclassified (just over 3000 people)
and this table also ranked third according to its ZCR2value of 9.59. It is expected that
Country of birth, which was one of the worst performing variables under the previous
models would perform significantly better here, since it was assumed that the main
cause of error is its geographic variation, which is now included in the model. It is
perhaps surprising to find it in the best performing table, but overall the average rank
of tables that have Country of birth as one of its variables has gone from 1159 under
Model 1 and 1506 under Model 2 to 392 under this model, making it the 7th highest
ranking variable – Type of communal establishment ranking the best.
The variable association structure for this table can be seen in Figure 10.17. The
first two panels are the national associations i.e. the top faces of the cube. It can be
seen that although there is no [AB]association present in the model, the [d
AB]estimate
is visibly different from independence. An example of one of the layers is given in the
second two panels which show the original data for Oswestry and the estimate under
10 The distributions are drawn using density lines – the area underneath each of them equals one.
The density estimation function as implemented in R results in smoothed lines, which do not however
respect the bounds of the data. This leads to the impression that there are tables where ∆ is less than
zero, which is not true and is solely an artefact of the graphing function.
180
Model 3, where the variables are now independent. In this table Oswestry had the
worst estimate of all the local authorities with a misclassification rate of 0.6%, which
translates to 11 people. This can be compared to 209 people or 11% misclassified under
both Model 1 and Model 2.
National SAM
[AB]
National Model 3
[c
AB]
Oswestry SAM
[AB|C318 ]
Oswestry Model 3
[d
AB|C318 ]
Figure 10.17: Type of communal establishment by Country of birth
The largest errors of estimation under Model 3 according to ∆is Economic Activity
(last week) by Year last worked with 64.62% misclassified, just half a percentage point
less than when it scored worst fit under Model 1. According to ZCR2the worst table is
Distance to work by Workplace with a value of 1699, which also has 56.65% misclassified.
This table was also amongst the worst tables under Model 1. The top margin for both
are shown in Figure 10.18. At first glance the Model 3 estimates seem similar to
the graphs in shown in the analysis of Model 1 (Figure 10.5), where the variables
are independent. The outlines of this independent model are overlain here in red.
This makes it easier to see that the Model 3 estimates on the right hand side are
not completely independent as the tiles are slightly out of alignment with the estimate
produced by Model 1, where the top margin did exhibit independence11 All of these four
variables have only moderate geographic variation (relative to the extremes observed
in the previous section) and this combined with very extreme associations with many
structural zeros means that including the geographical side margins [AC ]and [BC]
made very little impact on the overall fit of these tables.
Both tables in Figure 10.18 are examples of tables where the [AB]margin included
in Model 2 made a great impact while including only the geographic margins [AC]and
[BC ]Model 3 had little effect. However this need not be the case. At the other extreme
we can look at a table where including the [AB]hardly improved the fit at all, while
adding the geographic margins in Model 3 made a significant difference. This happened
most dramatically in the table Accommodation type by Country of birth, where model
fit increased by 21 percentage points from Model 1 with ∆= 22.88% to Model 3 with
∆= 1.83%, while under Model 2 showed little improvement (∆= 22.43%). This is
11This counter-intuitive behaviour is due to the so-called Simpson’s paradox, which is demonstrated
more dramatically in the next example as well as being explored more in depth in Chapter 9.3.
181
National SAM
[AB]
National Model 3 (Model 1)
[c
AB]
Economic Activity (last week)
by
Year last worked
∆=64.62%
Rank=1596.
ZCR2= 1489.37
Rank=1592.
Distance to work
by
Workplace
∆= 56.64%
Rank=1595.
ZCR2=1699.30
Rank=1596.
Figure 10.18: The two worst performing tables under Model 3
unsurprising as the same table stood out as producing the most geographic variation of
error under both previous models (see pages 170 and 175), and both individual variables
were at the top of the list of univariate geographic variation (see Table 10.3 on page
177).
This table and its goodness-of-fit under the three models studied so far are elabo-
rately visualised in Figure 10.19 . On the left half of the figure the National top margin
[AB]is compared with the three estimates under all three models. On the right the
same is done for one of the layers - the local authority of Cambridge [AB|C144], which
happened to have the worse estimate of all LAs in Model 3. So for each model the right
gives an example of the local layers that add up to the top national margins on the left.
At the top under Model 1 we see the national margin with A and B independent, just
as they are in the individual layers. In the second row the National Model 2 top margin
is completely correct but it is also repeated in the individual layers, thereby making
the error remain large. Both bottom panels show the Model 3 estimates where we can
see in the Cambridge layer that the variables are independent, however the [A]and
[B]margins are now correct, making the error fall considerably. More dramatically,
all 373 layers such as this one, with independent [A]and [B]margins, add up to the
182
National Cambridge
[AB][c
AB]∆= 22.88%
[c
AB]∆= 22.42%
[c
AB]∆= 1.83%
[d
AB|C144 ]∆= 21.04%
[d
AB|C144 ]∆= 22.02%
[d
AB|C144 ]∆= 5.27%
[AB|C144 ]
Model 1Model 2Model 3
Figure 10.19: Accommodation type by Country of birth fit under three models
national top margin (bottom left) which shows a rather strong association structure
almost identical to the actual one.
This last Model 3 estimate is an nice example of Simpson’s paradox. Locally the
country of birth has no effect on accommodation type, however due to the geographic
variation of the individual variables, in the aggregate the layers sum up to a national
result where country of birth does affect accommodation type. This occurs because the
[AC]and [BC ]margins exhibit a strong association. In contrast, in Model 1, where
these two margins exhibited independence, the national [AB]margin summed up to be
just as independent as the individual LA layers.
As with Model 1 we can check again if the measures of association strength discussed
in Section 8.3 are helpful in predicting the degree of fit of this model 12. The results
of the six linear regressions are shown in Table 10.4 and again the Freeman Tukey
statistic is the best at predicting the percent misclassified with an R2value of almost
12A similar analysis for Model 2 was not included as the results showed no correlation.
183
Table 10.4: Coefficients of determination (R2)
F T 2λAB UAB
Percent misclassified (∆)73.94% 54.52% 48.96%
Cressie-Read Z-score (ZCR2)68.40% 51.45% 63.612%
0.74. This makes sense considering the importance of the [AB]relationship: it is
missing in both Models 1 and 3, hence the error can be predicted from the strength
of the [AB]association. When it is included in a model such as in Model 2 (and as
we shall see in Model 4 ), the measures of association strength are no longer helpful as
predictors of fit.
Geographic variation of model fit
Investigating the geographical variation of fit under Model 3 means investigating the
variation of the [d
AB|C]fit where both [d
A|C]and [d
B|C]are correct. Recall that under
Model 1, where [d
A|C] = [A]and [d
B|C] = [B], the largest variations were due to the
geographic variation of univariate variable distributions, hence Country of birth was one
of the covariates in all the top 56 tables. This can no longer be a factor under Model
3, so the geographic variation depends exclusively on the local relationship between
variables – and how different it is from independence.
Table 10.5: Ranking by geographic variation of Percent misclassified ( ∆)
Crosstabulation Model 3 Model 1
Ethnic group by Religion 1st 175th
Country of birth by Ethnic group 2nd 56th
Housing indicator by Occupancy rating 3rd 295th
Housing indicator by Central heating 4th 998thrd
Accommodation type by Lowest floor 5th 106thth
Table 10.5 lists the top five tables with the largest geographic variation as measured
by the standard deviation of ∆across the LAs. The largest geographic variation of
the error, ranging from 1.64% in Easington to 43.98% in Tower Hamlets, is found
in the table Ethnic group by Religion. This table also showed the largest geographic
variation of the ZC R2error. Figure 10.20 maps these errors under Model 3 and compares
them with the ones under Model 1. The overall pattern has stayed roughly the same,
indicating that the geographic variation is largely due to the variations in the variable
184
20 −30%
30 −40%
40 −50%
50 −60%
60 −70%
10 −20%
0−10%
20 −30%
30 −40%
40 −50%
50 −60%
60 −70%
10 −20%
0−10%
∆∆
Model 3 Model 1
Figure 10.20: Ethnic group by Religion goodness-of-fit under Model 3 and Model 1
relationship and that the inclusion of the [AC]and [BC]margins had a generally
uniform effect across all LAs.
This is however not necessarily the case. The table with the second largest ge-
ographic variability of fit shows a dramatically different picture. The map of errors
for Country of Birth by Ethnic group in Figure 10.21 clearly shows how Model 1 the
largest errors stemmed from the [AC]and [BC]side margins making Wales in par-
ticular stand out in addition to a few London boroughs. With the inclusion of these
coefficients into Model 3 this cause of geographic variation disappeared, leaving only
the variation of error due to the difference in the bivariate relationships between the
two variables. This error ranges from 1.91% in Easington to 37.93% in Kensington and
Chelsea. These two examples make it clear that geographic variation of errors can stem
from both the univariate and the bivariate variable distributions, and usually of course
from a combination of both.
10.4 Model 4: From [AB],[AC],[BC]to [d
ABC]
Figure 10.22 gives a visual depiction of the fourth and final model. This time, as the
red mosaic plots indicate, all three two-sided margins are used in the estimate. This
model is hierarchically above the three previous models – all three are nested within –
and is the most complete model possible short of the saturated model. Thus the only
terms missing in the equation are the three-factor interactions τABC
ijk :
ˆxijk =τ·τA
i·τB
j·τC
k·τAB
ij ·τAC
ik ·τBC
jk [10.4]
In addition to each LA having the correct univariate distributions of Aand Bas we
had in Model 3, it also includes the variable interaction term as in Model 2. Of course
185
20 −30%
30 −40%
40 −50%
50 −60%
60 −70%
70 −80%
80 −90%
10 −20%
0−10%
20 −30%
30 −40%
40 −50%
50 −60%
60 −70%
70 −80%
80 −90%
10 −20%
0−10%
∆∆
Model 3 Model 1
Figure 10.21: Country of Birth by Ethnic group goodness-of-fit under Model 3 and
Model 1
Aand Binteractions vary locally as well, but this information would be included in the
τABC
ijk term, which would completely and perfectly estimate the data. The degrees of
freedom are the same as in Model 3 minus the additional (I-1)(J-1) coefficients required
to describe the interaction between A and B. Similarly as when we went from Model 1
to Model 2, in the simplest table this means only one less degree of freedom, in exchange
for the τAB
ij coefficient. So if there were 373 df. in the simplest table under Model 3
there are now 372 under Model 4.
ˆxijk =τ·τA
i·τB
j·τC
k·τAC
ik ·τBC
jk
|{z }·τAB
ij
df =df[Model 3] −z}| {
(I−1) ·(J−1)
[AC]
[AB]
[BC]
Figure 10.22: Model 4 with equation and degrees of freedom
The model summary in Figure 10.23 shows the dramatically better fit produced by
Model 4 with a maximum ∆of just under 6 percent and a minimum value of zero i.e.
no error at all13. In fact there are a total of 77 tables that have no misclassifications
13Due to the vagaries of floating point arithmetic, the error is in fact not calculated as zero. R
uses IEEE Standard for Floating-Point Arithmetic (IEEE-754) for representing real numbers, which
means numbers are rounded to 53 binary digits, which equates to about 16 decimal points precision.
This maximum error due to this is also known as the machine epsilon. This means two numbers —
in our case the original SAM observations and the IPF estimates — “will not reliably be equal unless
186
Proportion misclassified (∆)
Minimum 0.0
Median 1.30
Maximum 5.96
Cressie-Read Z-score (ZCR2)
Minimum -258.76
Median 54.94
Maximum 382.65
ZCR2
∆
-200 -100 0 100 200 300 400
0% 1% 2% 3% 4% 5% 6%
Figure 10.23: Goodness-of-fit statistics for Model 4 (N= 1596)
at all. As we shall see below, this is due to their internal structure being composed
of the right pattern of structural zeros making the three-way interaction effectively
superfluous. The smallest error of the tables that do have a three-way interaction had
an error of ∆= 9.25 ×10−6which is equivalent to about 24 misclassified persons14.
The Z-scores of the Cressie Reed statistic for Model 4 have a maximum value of
382 and a minimum of -259. Out of the 1596 tables 260 have a negative ZCR2value.
Negative Z-scores might seem unorthodox, but in this context they are completely
expected. Since Z-scores are standardized units of the Cressie Reed statistic, a negative
value is simply equivalent to a p-value of more than 0.5. Figure 10.24 illustrates what
is happening by plotting Z-scores against the more familiar p-values. The black dotted
line marks a Z-score of 0. A Z-score of zero means a table’s Cressie and Read statistic
is the same value as the table’s degrees of freedom i.e. its expected (mean) value. To
give an example: a 2×2×373 table has 372 degrees of freedom under Model 4. A
Cressie and Read statistic of 372 would give it a ZC R2value of 0 and that is equivalent
to a p-value of 0.5.
they have been computed by the same algorithm, and not always even then”(Hornik, 2011). This is
in fact what happens, so the tables with no error actually reported ∆values of 4 ×10−13 or similar.
However these are numerical errors, not errors per se, and are unavoidable due to the computer’s finite
representation of numbers (Burns, 2009), and can therefore be safely ignored.
14In practice telling the difference between the tables with no errors and tables with the minimal
actual errors is still tricky even after accounting for the numerical error due to the computational
limitations mentioned before. An additional source of error is the IPF procedure itself, which stops
when a minimal divergence is reached. The smaller this divergence is set to be, the longer the procedure
will run and the more precise the estimate. But a level of precision must be decided upon in order
to keep the iterations within reasonable limits. In this case the maximum allowed deviation in each
margin was set to be 0.01. Over a table with hundreds of marginal cells this can in some cases add
up to result in a misclassification of a few people. This error can be disregarded as the result of the
IPF algorithm parameters, but only after visual inspection of the table, which shows the errors did not
arise from a third order interaction.
187
0.00 0.20 0.40 0.50 0.60 0.80 1.00
-4 -3 -2 -1 01 2 34
p-value
Z-score
Figure 10.24: Relationship between Z-scores and p-values
The shaded area on the right indicates the area that is outside the traditional 95%
(one-tailed) confidence interval. From a significance testing perspective one might be
interested in the tables with a p-value of more than 0.05, which is equivalent to them
having a ZCR2value of less than 1.64 (there are 271 such tables). For all of these
tables the conclusion could then be that there is no third order interaction - or to be
more precise: that given this data and using the (arbitrary) significance level of 0.05,
we cannot reject the null hypothesis that there is no third order interaction.15 The
extreme negative values of the Z-scores correspond to tables with perfect fit and hence
a Cressie and Read statistic of zero. Depending on the degrees of freedom of the table,
this zero can then be standardized as ZCR2=−259 for a table with df = 14,880 of e.g.
to ZCR2=−100 for a table with df = 2,232. Either way this is equivalent to a p-value
of 1.
We can therefore have a look at two types of tables with minimal errors. The first
is tables that effectively have no error, since there is no third order interaction. One
such table is Year last worked by Workplace and its plots are shown in Figure 10.25 for
both the national margins and the Liverpool LA 16.
The mosaic plots make it immediately clear why there is no IPF error: the table
categories are mutually exclusive. Either the respondent is currently in employment
(top row) and therefore can answer the workplace question (first four columns), or he
is not in employment (the remaining five rows) and therefore has no workplace (last
15Of these 271 tables 197 actually have a p-value of 1. If this were a sampling exercise such values
would in fact be cause for alarm as they would mean our data is so perfect that any other random
sample would produce a greater error. Since it is not, a p-value of 1, which corresponds to a Z score of
about 8, means just about perfect fit. Of course the p-values are not exactly one (just as they cannot
be exactly zero, see Section 8.4.5) but as soon as they are larger than 9 ×10−16 they are effectively
treated as one by the processor.
16This is the table with the lowest ZCR2value of -259. As has been explained above, this does not
mean it is the best fitting table, only that it has the highest df of the 77 tables with perfect fit.
188
National SAM
[AB]
National Model 4
[c
AB]
Liverpool SAM
[AB|C44 ]
Liverpool Model 4
[d
AB|C44 ]
Figure 10.25: Year last worked by Workplace
column). Since all the respondents are either in the top row or in the last column and
cannot be in both at the same time, the the [A|C]and [B|C]margins are enough to
perfectly determine the distribution. Thus the values of the τABC
ijk coefficients, were
we to calculate them, would all equal one and therefore have no effect on the cell
values. So not only are the two plots on the left identical – the marginal plots stay the
same after IPF by definition – but the Liverpool layer (two plots on the right)is also
perfectly estimated. There are an additional 76 such tables, where the structural zeros
are distributed in a way that allows for only one way to fill the tables under Model 4
and hence all have no error.
The second type of minimal error we can look at is for the table that actually
has non-trivial τABC
ijk coefficients and performs best under Model 4. This was the
crosstabulation of Housing indicator by Accommodation self-contained, the national
margin of which is shown in Table 10.6. The offending cell is the one with the 6 people
who live in accommodation that is not self-contained and yet do not have an answer in
the housing indicator question. Because of these 6 people the total table ends up with
2.4 people misclassified in what would otherwise be a perfect estimate.
Table 10.6: National SAM table of Housing indicator by Accommodation self-contained.
Not
Selfcontained selfcontained NA
Not overcrowded or lacking 2,189,593 0 0
Overcrowded or lacking 379,621 4,935 0
NA 214 6 4,7191
Looking at the worst performing tables we again have two examples: the worst table
according to ∆, which is Age by Socio-Economic Classification (NS-SEC) of Family
Reference person , and the worst by ZCR2, which is Distance moved by Migration
origin, both of which are represented in Figure 10.26. The top two panels represent
the table with almost 6% misclassified and Oxford on the right as it had the highest
error of each LA (over 15%). The bottom two panels are the worst by ZCR2with
189
National SAM
[AB]
Worst LA Model 4 (SAM)
[d
AB|C]([AB|C])
Age
by
Reference NS-SEC
∆=5.96%
Rank=1596.
ZCR2= 106.99.37
Rank=1464.
Distance moved
by
Migration origin
∆= 2.81%
Rank=1452.
ZCR2=382.65
Rank=1596.
Figure 10.26: The two worst performing tables under Model 4
the National margin on the left and the worst performing layer (Birmingham) on the
right. Interestingly we can see that a large part of this table has the correct pattern
of structural zeros to produce no errors due to the variable categories being mutually
exclusive (if you have not moved then you do not have a migration origin). The small
proportion of the table that includes actual migrants is the one that is causing the
errors. Given the variables this is to be expected: the distance of a region of origin by
definition changes across the country, therefore this table would be expected to have a
strong [AB|C]relationship and hence a large error under Model 4. But because only
about 10 percent of the table population have moved, this error is not picked up by ∆,
only by ZCR2.
Out of the four models Model 1 and Model 3 are the ones that do not include the
[AB]margin and in both cases we found that the strength of the association could
provide an indication of the quality of fit. Model 4, like Model 2 before, does include
this margin and just like with Model 2, the strength of the [AB]association turns out
to have no relationship with the goodness-of-fit.
190
Wolverhampton
[d
AB|C]([AB|C])
Tower Hamlets
[d
AB|C]([AB|C])
0−5%
5−10%
10 −15%
15 −20%
20 −25%
Figure 10.27: Family type by Number of employed adults in the household
Geographic variation of model fit
The geographic variation of model fit under Model 4 is the pure three-way interaction.
With all the two-way margins fitted correctly, the level of geographical variation in-
vestigated in this section is the measure of the local relationship between the variables
and how it differs from the national one.
The table with the highest level of variation of fit as measured by the standard
deviation of the percent misclassified is Family type by Number of employed adults in
the household. The ∆values for the individual layers in this table ranged from 0.84%
in Wolverhampton to 22.19% in Tower Hamlets. The map is shown in Figure 10.27
alongside the mosaic plots for the two most extreme local authorities.
In order to investigate these tables in more depth, we need to revisit odds ratios, or
more precisely, the ratios of odds ratios (see Section 6.4). Recall that a missing third
order interaction (τABC
ijk = 1) is equivalent to the second order odds ratios being equal
to one. This means the first order odds ratios – the odds ratios in each layer – are the
same across all layers of the cube. Since each layer in this cube is a 5×4table there
are consequently 4×3odds ratios, but we will concentrate on just one of them here.
The four cells in question are highlighted in the mosaic plots in Figure 10.27 and
their values in the various tables are given in Figure 10.28. They refer to married and
cohabiting persons with or without children that have either no employed adults in
the household or one earner in the household. The top table in Figure describes this
191
National
SAM
0 earners 1 earner
Married/cohab. 232,857 113,354
No children
Married/cohab. 79,681 305,596
with children
SAM
0 earners 1 earner
Wolverhampton
Married/cohab.1,121 434
No children
Married/cohab. 532 1,457
with children
Model 4
0 earners 1 earner
Married/cohab. 1,135 423
No children
Married/cohab. 496 1,496
with children
SAM
0 earners 1 earner
Tower Hamlets
Married/cohab. 333 292
No children
Married/cohab. 1,246 1,507
with children
Model 4
0 earners 1 earner
Married/cohab.730 320
No children
Married/cohab. 388 1377
with children
θAB = 7.88
θAB|C=63 = 7.07
θd
AB|C=63 = 8.08
θAB|C=29 = 1.38
θd
AB|C=29 = 8.08
Figure 10.28: Odds ratio analysis under high geographic variation
relationship at the national level. The odds ratio for these four cells can17 be calculated
as follows:
θAB =x22+
x21+
/x12+
x11+
=305,596
79,681 /113,354
232,857 = 3.84/0.49 = 7.88
This equation can be broken down step by step to mean: (i) a couple with children
are almost four times more likely to have one employed adult in the household than no
earners (odds are 3.84:1) and (ii) a couple without children are only about half as likely
to have one earner than none (odds are 0.49:1), which is equivalent to saying they are
17The odds ratio can of course be interpreted in a few ways, depending on what we chose as the
reference categories. The option chosen here is deemed most reasonable given the data, but should of
course not be interpreted as imparting any causal directionality on the data.
192
about twice as likely to not having an earner. Therefore the odds of a couple having
one earner in the household are 7.88 times larger if they have children than if they do
not. Or to put it another way: a couple with children is almost eight times less likely
to have no earners in the household compared to them being childless.
We can see from Figure 10.28 that a very similar relationship is true in Wolverhap-
ton with θAB|C=63 = 7.07. This means that regardless of the proportions of Wulfrunian
couples that have children or don’t or the relative numbers of earners, the pure rela-
tionship between the two variables is pretty similar to the national one: couples with
children are more than 7 times more likely to have one earner in the household compared
to couples without children. Tower Hamlets on the other hand shows a dramatically
different picture. The conditional odds ratio there is:
θAB|C=29 =x22+
x21+
/x12+
x11+
=1,507
1,246 /292
333 = 1.21/0.88 = 1.38
This means in Tower Hamlets a couple with children is slightly more likely to have one
earner (1.21:1) and a couple without children is slightly less likely to have one earner
(0.88:1). The overall odds of having one employed adult in the household is then 1.38
times higher if the couple has children compared to childless couples.
A model with no second order interaction such as Model 4 means that by definition
the second order odds ratios or the ratios of odds ratios are equal to one. This means all
the conditional odds ratios θAB|C=kare all 18 equal. So while the observed data exhibits
a second-order odds ratio of 5.13, the second-order odds ratio of the IPF estimate is 1:
θAB|C
63,29 =θAB|C=63
θAB|C=29 =7.07
1.38 = 5.13 θd
AB|C
63,29 =θd
AB|C=63
θd
AB|C=29 =8.08
8.08 = 1
because the estimated odds ratios in both LAs are the same at 8.08. Note that the
conditional odds ratios are all equal amongst themselves, but are not necessarily equal
to the national odds ratio (see section on Simpson’s paradox) although they are quite
close in this case. With the proviso that we only looked closely at one of the odds ratios
we can still see clearly how the [AB|C]relationship in Wolverhampton is much closer
to the estimated [d
AB|C]than Tower Hamlets. Since all the lower level interactions are
fixed i.e. all the 2-dimensional and 1-dimensional margins are correct, this is the only
source of the error.
Since Model 4 is hierarchically just above Model 3, the only difference being the ad-
dition of the [AB]margin, we can also take a look at the table where this addition most
dramatically decreased the geographic variation of the errors. This is the crosstabula-
tion of the Housing indicator by Central heating. Under Model 3 this table ranked as
18This applies also to the other conditional odds ratios θBC|A=iand θAC |B=j. For ease of interpre-
tation we only focus on geographical layers i.e. odds ratios conditional on C.
193
Not over-
crowded or
lacking
amenities
Overcrowded
or lacking
amenities
Not in
house-
hold
National
SAM
Cent. heating 2,189,593 199,917 194
No cent. heating 0 184,639 26
NA - comm. est. 0 0 47,191
Liverpool
SAM
Cent. heating 14,595 1,502 2
No cent. heating 0 5,283 0
NA - comm. est. 0 0 665
Model 3
Cent. heating 10,657 4,955 487
No cent. heating 3,497 1626 160
NA - comm. est. 440 205 20
Model 4
Cent. heating 14,595 1,502.7 1.3
No cent. heating 0 5282.3 0.7
NA - comm. est. 0 0 665
Figure 10.29: Housing indicator by Central heating best improvement of fit
having the fourth largest geographic variation. Under Model 4 however it ranks as hav-
ing the 80th lowest level of geographic variation of error. In fact, if we discount the 77
tables which are perfectly correct due to their internal structure and consequently have
no geographic variation, this table has the third lowest level of geographic variation of
error.
Under Model 3 this table’s ∆values ranged from 3.41% in Harlow to 37.37% in
Liverpool. Under Model 4 the range is from .000017% to 0.042%, which translates to
a few misclassified people at most. As can be seen from Figure 10.29 the dramatic
improvement in fit is due to the presence of structural zeros. Model 3 with Aand
Bindependent could not account for the impossible combinations of variable values
and led to almost 40% misclassified in the Liverpool table. With the three structural
zeros in Model 4 corrected, the Liverpool table only misclassifies 1.4 people. In fact it
seems the two cells in the final column might be the result of imputation and should
also be structural zeros. If that were the case Model 4 would have estimated the table
194
perfectly.
This second example reinforces the observation from the previous section that the
greatest improvements of fit as well as specifically the greatest reductions in geographic
variation of error were the result of a very particular class of table: one where the
pattern of structural zeros allows a single unique estimate that is invariably correct.
Similarly, near-perfect measures of fit are achieved in tables that almost have this
structure but for a few small cells. These may be judged as very good fit, however that
depends on the nature of the small cells: they may be simply the result of an error or
imputation in which case the error could be dismissed as trivial. They could also be
important despite accounting for only a small proportion of the total table, as we saw
in the Migration origin by Distance moved table, and result in a small yet non-trivial
error.
10.5 Summary of results
One aspect of over-viewing all of the models presented in the previous sections would
be to attempt to summarize which factors are the most important to include in the
models. Is it generally the [AB]interaction that is most crucial or is it one of the side
margins e.g. [AC]? Alas there can be no straightforward answer to this question. This
is in fact a difficult question even within one table and in order to see why we need to
look at the hierarchy of the log-linear models again.
There are several issues with determining the importance of a single factor: (i)
a factor’s effect varies depending on the model it is in, (ii) a factor’s effect varies
depending on the metric used (iii) this is normally attempted within the framework
of model selection where the significance of each factor’s contribution is usually the
threshold determining its importance. Each of these issues is considered in turn. Figure
10.30 reproduces the schematic overview of the models presented at the beginning of
the chapter for easy reference. Model pairs 1 and 2 and models 3 and 4 differ by the
inclusion of the [AB]interaction. This does not however mean that the improvement
caused by this factor is the same in both cases. A quick example to prove the point
is given in Table 10.7 which summarizes the goodness-of-fit statistics for the table
crosstabulating Accommodation type by Age under all four models. The differences
M1−M2and M3−M4correspond to the addition of the [AB]margin to the model
and it is clear that although there is improvement in both cases, its size is not the same
when added to Model 1 as it is when added to Model 3.
19Because we are using the Z-scores of the Cressie and Read statistics, we cannot subtract the values
directly to compare two models. Instead we subtract the values of the Cressie and Read statistics
and then find the Z-score of the difference, using the appropriate degrees of freedom. For example
CR2(M1) = 940,810 with 3 ×12 ×372 = 13,392 degrees of freedom and CR2(M2) = 698,681 with
3×12×372 −3×12 = 13,356 df. The difference between models 1 and 2 is therefore CR2(M1−M2) =
242,128 with 36 degrees of freedom, which is standardized to ZCR2(M1−M2)= 227.60 using the Wilson-
195
Model 4
Model 3
Model 2
Model 1
Fully saturated model
No three-factor effect
Two two-factor effects
One two-factor effect
Main effects (Independence)
[ABC ]
[AB][BC][AC]
[AB][AC] [AB][BC][AC][BC ]
[AB][C] [AC][B] [BC][A]
[A][B][C]
Figure 10.30: Hierarchy of all possible three-dimensional models (adapted from (Wick-
ens, 1989, p.67)
Table 10.7: Goodness-of-fit for Accommodation type by Age under all four models19
Model 1 Model 2 M1−M2Model 3 Model 4 M3−M4
∆18.52 16.51 2.01 7.66 3.93 3.73
ZCR2781.35 679.93 227.60 392.53 137.48 206.71
This means the effect of a factor changes depending on the other factors included in
the model, which means there is no single value (regardless of the measure) associated
with the factor’s effect. One possibility is to calculate all the models which differ by
a specific effect. In our example if we wanted to see the effect of [AB]we would need
to also look at the difference in fit between models [AC][B]and [AB][AC]as well as
[BC ][A]and [AB][BC]in addition to the M1−M2and M3−M4differences already
considered. All four pairs of models differ only by [AB], but for larger tables this
number would quickly rise and become unmanageable.
Brown (1976) suggests instead that only two factor effects be calculated, what
he terms the marginal and the partial association, and these two values then act as
Hilferty normal approximation. This are the correct values reported in the table instead of the simplistic
(and wrong) ZCR2(M1−M2)=ZC R2(M1)−ZCR2(M2)= 191.42. Note though, that the differences
reported here are differences in goodness-of-fit between models, which is technically not the same as
the goodness-of-fit of the differences between the models. This would require additivity, which is
discussed further on.
196
an ad hoc proxy for the bounds of the change in the goodness-of-fit introduced by a
factor. According to these definitions, the marginal test for the [AB]factor is comparing
[A][B][C]with [AB][C](the targeted factor is the only higher order factor in the model)
and the partial test for [AB]is comparing [AB][BC][AC]with [BC][AC](the targeted
factor is the only higher order factor missing from the model). These two comparisons
then give a reasonable estimate of the effect [AB]has on average, a process Brown calls
screening (cf. Upton, 1978, p. 88-90).
Before we give an example two other issues need to be addressed, the first being the
choice of metric. Throughout this chapter we have used the proportion misclassified
(∆) and the Z-score of the Cressie Read statistic (ZCR2) as parallel and complemen-
tary measures of goodness-of-fit. In tandem they have provided useful in highlighting
different aspects of model fit across tables of different sizes and structural properties.
However when comparing the effects of factors within one table it is useful to use a
measure that has the property of additivity. The likelihood ratio statistic or G2is also
equivalent to the power divergence statistic with λ= 0 (see Equation [8.41], page 108)
and is also distributed asymptotically as χ2with appropriate df :
G2= 2 Xxi·log(xi
ˆxi
)[10.5]
The likelihood ratio statistic has the property of additivity. This means that if two
models are hierarchically related so that M1is nested within M2(the margins included
in M1are a subset of the margins in M2) then the following relationship holds:
G2(M1) = G2(M1−M2) + G2(M2)[10.6]
which means the goodness-of-fit of M1, i.e. its deviance from the fully saturated model,
can be broken down into the deviance between M1and M2plus the deviance of M2
from the fully saturated model. This allows the precise partitioning of models in order
e.g. to investigate the importance of G2(M1−M2), which would in our example be
the [AB]margin. Crucially, because of the additivity property, G2(M1−M2)has the
expected degrees of freedom: df(M1)−df(M2). (Bishop et al., 1975; Ku & Kullback,
1974)
An example of this procedure is shown in Table 10.8 , again for the Accommodation
type by Age table. The first column lists the hierarchically nested models (following the
left-most path in Figure 10.30) and the second two columns each model’s G2and its
respective degrees of freedom. The final three columns display the differences between
the models: which factor they differ by, what that factor’s G2is and the correct df. We
can see e.g. that the margin [AC], which has 1,116 degrees of freedom (12×372) reduced
the G2value by 521,391 when we move from model [AB][C]to model [AB][AC]20.
20As was noted above, this is only the strength of the [AC ] effect in this particular context. If we
197
Table 10.8: A set of hierarchical models and their G2values
Model G2df Factor ∆G2df
5. [ABC]0 0 [ABC]51,465 13,392
4. [AB][BC][AC]51,465 13,392 [BC]51,442 4,464
3. [AB][AC]102,907 17,856 [AC]521,391 1,116
2. [AB][C]624,298 18,972 [AB]157,338 36
1. [A][B][C]781,636 19,008
This brings us to the third issue with attempting to estimate the strength of indi-
vidual factors: we are dealing with census data and significance testing is quite useless
here. The p-value for the [AC]margin mentioned above is 0 (or rather it is smaller than
2.22 ×10−16). So are the p-values for all the other factors. In a classical significance
testing application we would be able to compute the p-values for each of the factors
and deliberate on their relative importance in choosing whether or not to include them
in our model. In our situation however, p-values are useless, so we must again resort
to Z-scores.
We can now apply all of this to the Accommodation type by Age table: first we find
the model pairs for the marginal and partial association for each of the factors; then
we calculate the G2for each of the models and using the additivity property calculate
the ∆G2for each of the factors and finally using the Wilson-Hilferty approximation we
can normalize the G2values with the correct degrees of freedom to get a measure of
the relative importance of each of the factors. The results are plotted in Figure 10.31.
Note first of all that the marginal and partial associations only give different results
for the two-way margins21. Furthermore the differences are not large, so in order to
gauge the relative magnitude of a factor taking the average of both values is just as
informative22. Of the three single ‘edge’ margins the geographical is most important
([C]), followed by Accommodation type ([A]) and lastly Age ([B]). Of the three ‘face’
margins it is the Accommodation type by local authority ([AC]) that stands out as the
most important whereas the geographic variation of age structure ([BC]) is about as
important as the aggregate (national) relationship between accommodation type and
instead followed the model hierarchy along the right-most path of Figure 10.30 we would find the [AC ]
margin as the difference between models [AC][BC] and [AC ][BC][AC ] where its G2value would be
506,690 (with the same number of df).
21This is true by definition: for all main effects and for the highest effect the partial and marginal
tests are the same (Brown, 1976, p.40). In the three dimensional case this means only the two-way
interactions differ.
22This is also in keeping with the original intent of the screening procedure: using both marginal and
partial tests on a factor is meant as a safeguard in case a factor were not significant under one model
which might lead the researcher to accidentally discard it(Brown, 1976, p.42). Taking the average of
both values is exactly that: the average strength of a factor, which is what we want.
198
0
100
200
300
400
[A] [B][C][AB][AC] [BC ] [ABC]
Figure 10.31: Relative factor effects for Accommodation type by Age (grey bars - partial
association, white bars - marginal association)
0
600
200
800
400
[A][B][C][AB][AC] [BC ][ABC ]
Figure 10.32: Relative factor effects in 56 tables where one of the variables is A=
Accommodation type24
age. Finally the geographic variation of the relationship between accommodation type
and age ([ABC ]) is overall the least important of the seven factors, although it is still
quite high.
While the screening method for measuring the relative magnitudes of the factors in
a table works well on a single table, it is more difficult to summarize the results of all
1596 tables in our dataset. We can however try to plot a similar graph for a sub-group
of those tables, namely all the tables where the Avariable is Accommodation type.
Since there are 56 other variables in the dataset, this provides a useful summary of 56
tables, all of which share the [A],[C]and [AC]margins and differ by the remaining
margins that all involve a different B. These results are all plotted in Figure 10.32 with
the original crosstabulation with Age highlighted in red for easy reference.
24This chart consciously diverges from the standard practice of using line charts for continuous data.
199
As was noted all these tables have Accommodation type and the local authority in
common, hence the magnitude of these effects is the same for all 56 tables. As far
as the single margins are concerned it looks like over half of the other variables ([B])
are more important that Accommodation type, although the geography ([C]) is almost
always more important than either. Looking at the two-way margins we observe that
in every single case the [AB]margin is less important than the geographical variation
of accommodation type ([AC]). In a few cases the geographical variation of the other
variable is even more important, but generally it seems [AC]is the most important of
the two-factor effects. Finally the three-factor effect is generally (although not always)
the smallest and we can see the Accommodation type by Age by LA was in fact the
second largest one (red line). 25 Overall it is the geography ([C]) and the geographical
distribution ([AC]) of Accommodation type which are the (equally) most important
factors, except in the case of a small number of Bvariables, which have even higher
values (see below).
This figure does a good job summarizing the relative effect of a single variable -
Accommodation type - and its crosstabulations with each of the 56 other variables in
the dataset. In order to give a succinct description of all 1596 tables in the dataset this
information needs to be reduced even further rather than plotting another 56 graphs
– the full results of the following analysis are reproduced in Appendix E. First of all
we can look at the relative importance of the single variable margin (A). In the case
of Accommodation type above, its single factor effect is 318 (this and all other factor
effects are expressed as Z-scores of the likelihood ratio statistic). This can be compared
to an average single factor effect of 353 and the effect of the geography which is 494.
Table 10.9 lists the top variables with the strongest single factor effects along with the
variables with the weakest single factor effects.
Note that these represent single factor effects i.e. the ranking reflects only the
importance of the univariate distribution of the variables and is completely independent
of their relationship with other variables or with their geographic variability. The results
are not surprising: the strongest effects come form variables with many categories and
very unequal distributions. The dramatically weakest effect is produced by Sex with
only two categories and almost perfect uniformity. In similar vein the other lowest
ranked variables have fewer categories, all of which have significant numbers of people
in them i.e. there are no particularly small values in their margins. As has already been
mentioned, the geography itself has an effect of 494, which is quite strong: in addition
Due to the large number of data series (56) a bar chart, which would have been appropriate for this
type of data, is unfortunately unworkable. The lines connecting the data points are therefore not meant
to imply any continuity, they merely reflect the fact that a particular set of points belong to the same
table.
25In five cases the magnitude of the [ABC] factor is negative - an artefact of the normalization
procedure which effectively means the factor has zero effect.
200
Table 10.9: Top and bottom five variables by strength of single factor effect
Strongest effect Weakest effect
Variable ZG2Variable ZG2
1. Region of origin 741.33 57. Sex 24.38
2. Ethnic group 695.43 56. Dependent children 154.36
3. Religion 569.23 55. Supervisor/foreman 156.99
4. Distance moved 519.83 54. Marital status 181.19
5. Workplace 517.94 53. Social grade of HRPa196.17
aHousehold representative person
Table 10.10: Top and bottom seven variables ranked by number of tables where [AB]>
[AC]
[AB]weaker than [AC] [AB]stronger than [AC ]
Variable No. Variable No.
1= Accommodation type 0 57. Household headship 54
1= Cars/Vans owned 0 56. Sex 51
1= Country of birth 0 55. Comm. est. type 48
1= Ethnic group 0 54. Relationship to HRPa45
1= Lowest floor 0 53. Generation indicator 45
1= Region of origin 0 52. Age 43
1= Tenure of accommodation 0 51. Care provided hpw 42
aHousehold Reference Person
to the five variables mentioned above, only Migration indicator has a stronger effect,
while the remaining 51 variables are weaker.
Summarizing the strength of the [AB]effect across all variables is even trickier. One
practial way of doing this is comparing it with the [AC]effect. In the example above
(Figure 10.32) we can therefore say that the geographic distribution of Accommodation
type is more important than any of the associations Accommodation type can be in -
[AC]is larger than all 56 values of [AB]. There are in fact seven such variables: their
geographic variation invariably has a stronger effect than whatever [AB]crosstabulation
they appear in. These seven variables are listed in the left column of Table 10.10.
At the other extreme there is no variable in our dataset that would always crosstab-
ulate to a stronger effect than its geographical variability. Household headship is the
top of this list with 54 of its [AB]interactions stronger than its [AC]association. Table
201
Table 10.11: Top and bottom five variables by strength of geographic variation ([AC])
Strongest effect Weakest effect
Variable ZG2Variable ZG2
1. Region of origin 688.57 57. Sex 10.38
2. Country of birth 608.16 56. Comm. est. type 49.76
3. Ethnic group 580.13 55. Schoolchild/student 78.30
4. Accommodation type 479.03 54. Care provided hpw 88.04
5. Religion 450.60 53. Status in comm. est. 108.91
10.10 lists the other top ranking variables on the right. But we can also see the tables
where Sex is one of the variables predominantly have [AB]effects stronger than [AC ]
effects, which is not surprising as we know gender ratios will not vary significantly across
the country. To a certain degree then, these results are an artefact of the strength of
the variable’s geographic variation: the larger [AC]is, the less likely an [AB]will be
stronger and vice versa.
This is easily confirmed by ranking the variables by the strength of their [AC]effect
– their level of geographical variation which is done in Table10.11. The tables where
the [AC]effect is strongest are well known by now. As expected the weakest geographic
effect is found with the variable Sex , followed by Communal establishment type and
Schoolchild/student. Not surprisingly, this ranking correlates quite strongly with the
ranking presented above in Table 10.10 (r2= 0.60).
Finally we can summarise the relative effect of the [AB]factors by ranking the
variables by their mean [AB]effect. However this ranking is a lot less informative than
one might think, simply because the ranges tend to be so large. Instead of producing a
table of the top and bottom ranking variables by mean the mean [AB], these same ten
variables are plotted in Figure 10.33 each of them with the respective set of 56 [AB]
effects. This gives a better feel for how relevant these differences actually are.
Although the differences in the means can be quite dramatic, the large ranges of
values make the mean a less useful indicator. For example Distance moved, with the
third smallest mean ZG2for its [AB]effect (120) and most of its [AB]effects are quite
concentrated, but for two extreme outliers for when it iscrosstabulated with Marital
status (ZG2= 464) and Migration indicator (ZG2= 536).
It should also be noted that this [AB]effect ranking is completely independent from
the [AC]effect ranking above. This is particularly obvious if we remember Sex had the
lowest geographic variation, while Country of birth had the second highest one(Table
10.11) yet both have some of the weakest [AB]effects in the dataset. Another reminder
of principle of the variational independence that underlies the log-linear partitioning of
these tables into component factor effects.
202
ZG2
_
____
_____
100 300 500 700
0 600200 400
Sex
Comm. est. type
Distance moved
Country of birth
Care provided
Last worked
Generation ind.
FRP NS-SEC
Relationship to HRP
Age
Figure 10.33: All 56 [AB]effects for the top and bottom five variables as ranked by
their mean [AB]
203
204
Chapter 11
IPF and the Error of inaccurate
constraints
David W. S. Wong’s 1992 article on The Reliability of Using the Iterative Proportional
Fitting Procedure published in the Professional Geographer is an exemplary piece
of analysis, systematically evaluating the performance of IPF under various sampling
scenarios. This chapter follows in that tradition but aims to extend that analysis in
two respects: by framing the question of IPF error stemming from sampled data into
a geographical setting, and addressing the issue of sampling zeros – both issues not
covered by Wong.
This chapter proceeds naturally from our analysis in the previous chapter, more
specifically of Model 4 in Section 10.4 which is the best standard hierarchical model
short of the fully saturated model. In this chapter we use samples of the second-order
interaction to improve these estimates further. In order to analyse the effectiveness
of this sampling we first need to deal with the issue of sampling zeros. Only then
can we assess the quality of the IPF estimates when supplemented by various sample
sizes. The final section of this chapter compares these results with what is potentially
a more likely information source: samples from higher aggregations i.e. regional and
geodemographic. This allows us to compare the two aggregation types directly as well
as compare their performance against the benchmark of Model 4.
11.1 Sampling Zeros
This section addresses the issue of sampling zeros - cells in the constraints that are
empty because of the sampling. IPF will, by definition, keep empty cells empty i.e. a
zero value in a cell or in any margin will remain zero in the final estimate. This has two
important consequences. The direct consequence is that under certain circumstances,
the number of empty cells and their configuration may prevent convergence of the
algorithm. The indirect consequence is that even if convergence is achieved we end up
with empty cells when we know (or suspect) that there should be some, if only few,
205
cases in those cells.
The solution that is normally proposed is to add a small number to the empty cells
to make sure convergence occurs but at the same time keep any distortion to the data
to a minimum. Simpson & Tranmer (2005) for example suggest adding 0.001 to empty
cells1. The flip side of the sampling zero problem is the issue of structural zeros. These
are cells that have to stay empty - variable combinations that are impossible such as
married and aged 0-4. The problem is that IPF is indifferent to whether a cell is
structurally zero or is empty purely by chance resulting from the sampling procedure.
Regardless of their origin, IPF will force zeros to remain zeros. If they are all structural,
that is not a problem and there will be no convergence issues. Sampling zeros however
can make it impossible for the algorithm to converge, since they can create a formation
that makes the table logically inconsistent. A simplest example is if a sample has
all cells in one category empty, however that category has some positive value in the
margin. Both constraints are incompatible: empty cells cannot add up to the desired
margin2. This becomes more and more likely the more zeros there are: the smaller the
sample size, the larger the number of table cells and/or the more unevenly distributed
the variables are.
In David Wong’s analysis the issue of sampling zeros is not addressed, with pre-
dictable consequences: all tables converge when the variable distribution is uniform
(equal-size categorization scheme), when the variables are skewed however (equal in-
terval categorization) over 40 percent of the tables do not converge (Wong, 1992, Table
3, p. 345). These are the tables with the smallest sample size and the ones with the
largest number of cells3. But if a small value is to be added to each cell to ensure
convergence, the question that first needs to be answered is whether or not it matters
what this value is and if so, what is the most appropriate value to add?
We take as our starting point one of the simpler tables in our dataset: Gender by
Student status, the only 2 by 2 table in our dataset, and sample from the 373 local
authorities. This means a table with 2×2×373 = 1492 cells, from which we take
a sample of 10,000 people. On average our samples have just under 100 empty cells.
Repeatedly sampling from the full table we then use IPF to adjust the sample to the
correct 2-dimensional constraints. This is similar to model 4 from the previous chapter,
with the addition of the three-dimensional sample4: from [AB],[AC],[BC]and [ABC]′
1They also suggest setting empty cells in the margins to 0.1, however the reason for this is a
peculiarity of the GENLOG command in SPSS, so technically this is a different issue. In their example
they have to add 0.1 to these marginal zeros even if they are structural. Our algorithm (as well as
HILOGLINEAR in SPSS) handles zeros in the margins correctly as structural, so this is not a problem.
2Other more complex ways of tables failing to converge are possible and are discussed in Bishop
et al. (1975).
3The goodness-of-fit statistics for these tables are reported nevertheless, although their relevance is
questionable given that they result from tables that are logically inconsistent.
4We use the prime symbol to denote the configuration that is sampled.
206
∆
4% 5% 6% 7% 8% 9%
ZCR2
180 200 220 240 260 280
1000
500
100
50
10
5
1
0.5
0.1
0.05
0.01
0.005
0.001
0.0005
0.0001
0.00005
Figure 11.1: Goodness-of-fit after substitution of zeros in [ABC]′sample N=10,000 in
Gender by Student by LA table (y axis is logarithmic)
to [d
ABC]. If we ignore the zeros, the following happens: whenever the sample does
not include any students in a particular LA the procedure does not converge. The
same happens whenever the sampling misses any non-students, males or females in
any LA. Because the [AB]and [AC]margins have positive sums for numbers of males,
females, students and non-students in each LA, the empty cells for those entries cannot
be reconciled with the margins. At this sample size, this is in fact a near certainty i.e.
the procedure never converges if we keep the zeros untouched.
Replacing the zeros instead with some positive value we can then repeatedly run the
IPF procedure and assess how various values affect fit. This was done using a range of
values from 0.00005 to 1000 and the results are plotted in Figure 11.1 for both percent
misclassified (∆) and the Cressie and Read statistic (ZCR2). The results are based on
100 samples, each one had their zeros replaced with the values plotted on the y-axis
and IPF then inflated the sample back to its original populations size of 2,621,560. It
is clear from the graphs that the best results, regardless of measure, were reached when
1 was added to the empty cells. This resulted on average (the red line indicates the
median values) 5.4 percent misclassified or a ZCR2value of just over 194. The value
suggested by Simpson and Tranmer (2005) fares considerably worse: when 0.001 is
added (third series from the bottom) this results on average in 7.2 percent misclassified
and ZCR2value of 260. That is a considerable loss of accuracy due to the value of the
constant added to the empty cells. In fact adding 0.001 turns out to be just as bad as
adding 1000, the results for which are shown in the top series in both graphs.
We can take a closer look at why this happens: it seems counter-intuitive that
adding a minuscule amount such as 0.001 produces the same level of error as adding
1000 people. The explanation involves the use of ratios of odds ratios or second order
207
odds ratios that were defined in Section 6.4 , the equation for which is repeated here
for convenience:
θAB|C
(i1i2)(j1j2)|k1k2=θAB|C
(i1i2)(j1j2)|k1
θAB|C
(i1i2)(j1j2)|k2
[6.8]
As we saw, using IPF with the second-order interaction - in this case coming from
a sample - means the estimate preserves the ratios of odds ratios that are contained
in the sample. This is true regardless of which variable is given which subscript in the
equation - but we continue with our practice of using geography as the third kvariable,
if only because it makes interpretations more comfortable. This means that the sample
odds ratio in one LA will not be kept after IPF, rather it is the ratio of the odds ratios
of two LAs that will be preserved (remembering that the prime symbol indicates the
sample):
θ′AB|C=k1
θ′AB|C=k2=θAB|C=k1
θAB|C=k2[11.1]
To get a clearer idea of what is going on and why this is important when it comes
to adding ’small’ constant to sampling zeros, we take a closer look at one of the local
authorities in one of the samples plotted in Figure 11.1, one that contained a sampling
zero. The Isle of Anglesey is an example of such an LA, one where the sample did not
capture any male students. The original SAM data and the sample are both plotted
at the top of the left hand side of Figure 11.2. For any second order odds ratio we
need to select another LA to compare the odds ratios with - the choice is immaterial,
although it is convenient if it does not also have any sampling zeros5. We therefore
choose Liverpool for this example, and the mosaic plots for its SAM data and the
sample are plotted on the right-hand side of the figure. As is clear even from a cursory
visual inspection, the odds ratios for both LAs are very close to unity: you’re pretty
much just as likely to be a student or schoolchild if you are male than if you are female
and this is true in both LAs. The odds ratios are given just below their respective
plots and come to 1.056 for the Isle of Anglesey and 1.047 for Liverpool. Thus the
second order odds ratio will also be close to one: θ1/θ2=1.009 meaning that in this case
the interaction between the gender and student variables is hardly influenced by the
geography.
We can see the Liverpool sample is already pretty inaccurate: its odds ratio is half
what it should be: θ′
2= 0.489. But the bigger problem is the empty cell in the Isle
5All the second order odds ratios that this one empty cell is involved with will be preserved after IPF
- here we are using the ratio of two LA layer odds ratios simply because they are easiest to describe
and interpret. But we could use any of 1,116 other pairs of odds ratios that this ’no male students on
the Isle of Anglesey’ cell is involved in, and the same principle would hold: the sample θ′
1/θ′
2is the
same as the θ1/θ2in the IPF estimate.
208
Isle of Anglesey Liverpool
θ1= 1.056 θ1/θ2=1.009 θ2= 1.047
θ′
1= 0 θ′
1/θ′
2=0θ′
2= 0.489
∆1=∆2=
21.78% 6.34%
ZCR21=ZCR22=
83.42 15.09
θ1= 0.0003 θ1/θ2=0.00057 θ2= 0.528
θ′
1= 0.00028 θ′
1/θ′
2=0.00057 θ′
2= 0.489
∆1=∆2=
10.43% 6.94%
ZCR21=ZCR22=
11.82 16.17
θ1= 0.283 θ1/θ2=0.573 θ2= 0.494
θ′
1= 0.28 θ′
1/θ′
2=0.573 θ′
2= 0.489
∆1=∆2=
20.92% 7.42%
ZCR21=ZCR22=
45.17 17.04
θ1= 267.84 θ1/θ2=572.93 θ2= 0.467
θ′
1= 280 θ′
1/θ′
2=572.93 θ′
2= 0.489
sample
sample
IPF
IPF
IPF
IPF
IPF
IPF
0plus 0.001
0plus 1
0plus 1000
Figure 11.2: Effect of different constant added to Isle of Anglesey sample
209
of Anglesey sample. Figure 11.2 displays three possible solutions: first adding 0.001,
then adding 1 and finally adding 1000 to the empty cell in question (all three constants
are marked in red). In the first case the addition of the minimal constant has meant
the sample odds ratio is 0.00028. It is the ratio of the odds ratios that gets preserved
though, θ1/θ2therefore becomes 0.00057, although the odds ratios themselves also
remain close to the sample ones. This is an incredibly dramatic value: it means that
the odds ratio of being male if you are a student compared to not begin one are 0.00057
times lower in the Isle of Anglesey than in Liverpool (or conversely that they are 1,745
times higher in Liverpool). Either way this is not of the magnitude we might reasonably
expect a geographical effect to operate at. The result is that while the Liverpool LA is
estimated with only 6.34% misclassified, Isle of Anglesey has 21.78%.
The effect is similarly extreme at the other end (bottom of the figure), where we
add 1000 as the constant to the empty cell. This makes the sample’s odds ratio swell to
280, and since the Liverpool sample is still the same, the ratio of the odds ratios is now
573. Again the effect of changing local authorities is that the odds ratio increases (or
decreases, depending on your perspective) by a factor of 573, a set up that will clearly
also lead to a lot of error. And indeed again over 20 percent are misclassified in the
Isle of Anglesey and 7.42% in Liverpool. It should be noted that in all of these cases
the numbers of males/females and of students/non-students are all correct after IPF in
all local authorities. The errors only stem from the interaction between the variables
being wrong. And the interaction is very wrong by virtue of the two odds ratios having
to have a ratio of 573 in this case6.
It is clear now how a more moderate constant might have a more desirable effect:
and it turns out that adding 1 to the empty cell, as is done in the middle example,
produces considerably less error. Now the sample odds ratio for Isle of Anglesey is 0.28
- still four times less than it should be - and the ratio of the odds ratios is therefore
0.573. This is finally of the same magnitude as the correct value of 1.009. The Isle of
Anglesey table ends up with only 10.43% and Liverpool with 6.94% misclassified7.
The fact that IPF preserves the odds ratios - or rather the ratios of odds ratios in
this three-dimensional example - which are multiplicative in nature explains how the
addition of a minimal constant can have consequences directly contrary to what was
intended. Of course the example presented takes a very limited view of the effects of
the constant as it focuses on only one empty cell and compares that local authority
with only one other LA. In actual fact even this one empty cell means 1,116 second
order odds ratios get distorted, and the same applies to every other sampling zero that
6Again, this applies across all pairs of odds ratios, not just the one we are looking at specifically
here.
7Figure 11.2 also presents the errors as measured using the Cressie-Read statistic but these should
be regarded cautiously given that Z2
CR is a signed measure (see footnote 14 on page 109). The errors
here were calculated for each LA individually.
210
might occur in such a table.
11.1.1 Selection of constant to add to empty cells
Having made it clear that adding a small constant can have a dramatically large effect,
and that adding a trivial constant can significantly distort the estimate, the question
remains then what is an appropriate constant to add? In the example above in Fig-
ure 11.1 we found the goodness-of-fit statistics were best when the constant was one,
although a more detailed analysis (not shown) found that the results were even better
the value of 2 was added. But this constant is specific to that particular table: its
size and distribution of empty cells as well as the size of the sample and therefore can-
not be directly generalized to the rest of the tables in our dataset. Other suggestions
found in the literature are to use 0.5 as a constant - sometimes only to the empty cells,
sometimes to all of them (e.g Bishop et al., 1975; Goodman, 1970). Other suggestions
include adding 1/r, where r is the number of categories in the response variable’ (Grizzle
et al., 1969), but this makes little sense in symmetrical situations such as ours. Bishop
et al. recommend a pseudo-Bayesian procedure that they find to be superior to adding
0.5 to the cells (1975, ch.12). Without performing an extensive investigation into this
issue, is it possible to find a more general rule of thumb to decide on what constant to
add?
In order to find out we first selected four tabulations that are extreme in some re-
spect: two that have the minimum and maximum numbers of cells (and hence largest
and smallest average cell frequencies respectively) and another two that have the small-
est and the largest standard deviation of cell size (and hence have cell distributions that
are most and least uniform). Table 11.1 lists the four tables used here and the propor-
tion of their cells that are structurally zero. The constants being added do not affect
these cells, but only the cells that become empty as the result of sampling8. For each
of these tables repeated samples (50) are taken using a series of seven sample sizes
ranging from 5% to 0.005%, which corresponds to a range of just over 131,000 people
in the largest samples to just 131 people in the smallest sample. In each of them the
sampling zeros are then replaced with one of 13 possible constants ranging from 1000
to 0.005. The averages were then taken of the repeated samples in order to smooth out
sampling noise. This results in what is equivalent to the red lines in Figure 11.1.
The procedure described above was repeated on both the regional and the local
authority tables. Figures 11.3 and 11.4 respectively summarize the results. In each
figure four pairs of line graphs represent the goodness-of-fit results for each of the four
8This is because the structural zeros are cells which fall into categories which have empty marginal
totals. Therefore if xij+is empty, all cells in that combination of iand jcells from xij1to xijK will
also be empty by virtue of IPF. This scenario does not anticipate any other structural zeros that would
not be identifiable from the margins. This means we can add the constant simply to every empty cell
in the sample and after IPF the structural zeros will end up empty and the sampling zeros will not.
211
Table 11.1: Four tables used to investigate effects of adding different constants
Sex by min no. of cells GOR: 0% structural zeros
Student/schoolchild LA: 0% structural zeros
Ethnicity by max no. of cells GOR: 26.2% structural zeros
Migration origin LA: 75.0% structural zeros
Age by min st.dev of cells GOR: 0.7% structural zeros
NS-SEC of FRP LA: 12.4% structural zeros
Com.est.type by max st.dev of cells GOR: 22.2% structural zeros
Status in comm.est. LA: 35.74% structural zeros
tables that have been selected: the top ones display the ∆and the bottom the ZC R2
values. Each of the seven lines represents a sample size, the lightest are the largest and
the black is for the smallest samples. The color coding is consistent with the points in
the scatter plots at the top, which give an indication of how many sampling zeros are
on average present at each sample size9. On each line a red point indicates where the
minimum is reached: which constant being added to the sampling zeros produced the
best fit10.
Ideally we would expect to deduce from the results some indication of how the op-
timum choice of constant varies systematically with regard to sample size, average cell
size, or average proportion of sampling zeros. In fact the graphs show little such system-
atic behaviour but rather display some surprising and even counter-intuitive patterns.
The first table in Figure 11.3 (left-hand side) is perhaps closest to our expectations.
This is the smallest of the GOR tables with 40 cells: gender by student status by 10
regions. The first five samples have no sampling zeros and the percent misclassified
are all constant from 0.2% for the 5% sample (lowest light gray line) to 2.6% for the
0.05% sample. The two smallest samples (0.01% and 0.005%) have 3.25% and 15.25%
sampling zeros respectively (this can be seen from the top scatter plot). As one would
expect, the smaller sample performs worse with higher ∆values across the board. The
same general pattern holds if we look at the Cressie-Read statistic bellow. In both
cases adding 2 to the empty cells produces the best fit, although it is reasonable to
expect the result would be different had we tried more constants.
Compared to this first table however, the other tables seem to behave in a rather
unexpected manner. Most dramatically, the sample size seems to have an almost oppo-
site effect to what one would expect. The same table (i.e. Gender by Student status),
this time across 373 local authorities, is one such case, shown on the left of Figure 11.4.
9These are proportions of pure sampling zeros i.e. the denominator is the number of cells that are
not structurally empty.
10Where there were no sampling zeros and the performance was the same regardless of the constant
- these tables are represented by straight horizontal lines - there is of course no such indicator.
212
Sample
size
Sample
size
Percent misclassifiedZ-score of Cressie-Read
Percent misclassifiedZ-score of Cressie-Read
Constant added Constant added Constant added Constant added
Age by NS-SEC of FRP
I×J×K= 1,430
¯xij k = 1,883 σ(xijk ) = 2,135
Ethnicity by Migration origin
I×J×K=2,380
¯xijk = 1,101 σ(xijk ) = 13,686
Com.est. type by Status in com.est.
I×J×K= 90
¯xijk = 29,128 σ(xijk ) = 85,767
Sex by Student/schoolchild
I×J×K=40
¯xijk = 65,539 σ(xijk ) = 45,008
0
200
300
400
600
500
500
100
100
5010521.5.05
.005
1000
1500
500
500
1005010521.5.05
.005
0
1000
1500
2000
2500
3000
3500
500
500
1005010521.5.05
.005
0
150
200
250
300
500100
100
50
50
10521.5.05
.005
0.00
0.02
0.04
0.06
0.08
0.10
0.12
5001005010521.5.05
.005
0.010
0.015
0.020
0.025
0.030
0.035
5001005010521.5.05
.005
.005
0.1
0.2
0.3
0.4
5001005010521.5
.5
.05
.005
2e−04
4e−04
6e−04
8e−04
1e−03
5001005010521.5.05
.005
0% 20% 40% 60% 80% 100%
5%
1%
.5%
.1%
.05%
.01%
.005%
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
5%
1%
.5%
.1%
.05%
.01%
.005%
Figure 11.3: Effect of constant added to empty cells for four different GOR tables (shades of grey correspond to sample sizes)
213
Sample
size
Sample
size
Percent misclassifiedZ-score of Cressie-Read
Percent misclassifiedZ-score of Cressie-Read
Constant added Constant added Constant added Constant added
Age by NS-SEC of FRP
I×J×K= 53,339
¯xijk = 49 σ(xij k ) = 73
Com.est. type by Status in com.est.
I×J×K= 3,357
¯xijk = 781 σ(xij k ) = 2,637
Ethnicity by Migration origin
I×J×K=88.774
¯xij k = 29 σ(xij k ) = 409
Sex by Student/schoolchild
I×J×K=1492
¯xijk = 1,757 σ(xijk ) = 1,632
0
1000
1500
2000
2500
500
500
1005010521
.5
.05.005
0
1000
2000
3000
4000
5000
6000
5001005010521
.5
.05.005
0
1000
2000
3000
4000
5000
6000
5001005010521
.5
.05.005
0
200
300
400
600
500
500
100
100
5010521
.5
.05.005
0.02
0.04
0.06
0.08
0.10
0.12
0.14
5001005010521
.5
.05.005
0.015
0.020
0.025
0.030
0.035
0.040
5001005010521
.5
.05.005
0.1
0.2
0.3
0.4
5001005010521
.5
.5
.05.005
0.0006
0.0007
0.0008
0.0009
0.0010
0.0011
0.0012
5001005010521
.5
.05.005
0% 20% 40% 60% 80% 100%
5%
1%
.5%
.1%
.05%
.01%
.005%
0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100% 0% 20% 40% 60% 80% 100%
5%
1%
.5%
.1%
.05%
.01%
.005%
Figure 11.4: Effect of constant added to empty cells for four different LA tables (shades of grey correspond to sample sizes)
214
Here the two smallest samples (two blackest lines) actually dip down below the largest
sample (light gray line). This means that these two samples sizes of 0.01% and 0.005%
are actually producing a better fit than the 5% sample! But how is this possible? The
answer lies in the sampling zeros: these two samples are so small, that they have 83%
and 91% empty cells respectively (see scatter plot above the line chart). These cells
then get the same constant added, which produces a situation similar to having no
three-way interaction: a uniform prior. And this uninformative seed is actually less
wrong than the 5% sample!
This result is quite dramatic if we consider that Gender by Student status is the
table with the fewest cells: the average cell in the 5% sample has a frequency of 87.85,
and there were no sampling zeros. And yet the information about the relationship
between the two variables and how it varies across the LAs contained in this sample is
so wrong, that it is actually better to use a sample that is almost all zeros such that after
the addition of the constant it effectively becomes a uniform, no three-way interaction
model. Of course the fact that we know these two variables are quite independent and
that there is little geographical variation in them as well goes a long way in explaining
why a no three-way interaction model works better that even a reasonably good sample.
But other tables in our selection exhibit similar behaviour, even when the variables are
far from independent.
This is already an indication of the patterns we can expect to see in the next
section: under a certain sample size it is better to discard the sample and use a no-
interaction model instead - or a sample so small that adding the constant makes it
practically uniform. The underlying logic as to which constant is the best to add is
not resolved conclusively with Figures 11.3 and 11.4 . The examples used are clearly
not comprehensive enough to provide a general rule, nor has the expected systematic
behaviour been observed, mainly due to the samples being so extremely sparse. However
the examples do give some very extreme cases in which it seems the minima (red dots)
are disproportionally often found around the value of 1. This is true for both the GOR
and the LA tables, even though their cell numbers and sizes differ by a factor of 37!
While acknowledging the limitations of this analysis it is felt nonetheless that the
constant 1 seems particularly robust in the types of scenarios investigated here and
is therefore the one used in the remainder of this chapter. This value should not
necessarily be seen as a recommendation in other scenarios, but other important lessons
can be drawn from this analysis nevertheless. Most dramatically it is the knowledge
that a constant that is too small will cause as much distortion as a constant that is too
large. The second important point stems from the slopes of the lines in the last two
figures: they are an indication of just how dramatically the constants can affect on the
goodness of fit. This would indicate that it is worth giving their selection some careful
thought, rather than just accepting some traditionally used value.
215
11.2 Model 4A: From [AB],[AC],[BC]and [ABC]′to [d
ABC]
A natural extension to the Wong analysis is to explore the sampling issues involved in
a three-dimensional table. This is done in this section, where we use Model 4 from the
previous chapter as a starting point, and add to it the additional information that is
contained in a sample taken from the full table: [ABC]′. The samples are then used as
the seed or prior used to estimate the full table, with all the two-dimensional margins
correct:
ˆxijk =τ·τA
i·τB
j·τC
k·τAB
ij ·τAC
ik ·τBC
jk ·τABC′
ijk [11.2]
Contrary to Model 4 from the previous chapter, the [AB]relationships will now
vary across the levels of C(the local authorities). Of course the [AC]relationships
will also vary across Band the [BC]across A, but we continue in our interpretations
to give priority to the geographic variation. The degree of accuracy of this model is
naturally a function of sample size, and a sample of 100% would also lead to perfect
accuracy as the fully saturated model.
Figure 11.5: Model 4A
The results from the previous section already indicated that the range of possible
sample sizes needed to be extended by adding 10% and 50% samples, thus making
it a total of nine different sample sizes. For each of the nine sample sizes we take
five repeated samples to smooth out sampling noise and then report the average of the
goodness-of-fit statistics. These are summarized in Figrue 11.6 using boxplots for all the
sample sizes. The results of Model 4 are plotted alongside (in red) for reference. Given
the results in the previous section on sampling zeros the results are not surprising. On
average only the largest sample size of 50% produces better results than Model 4. The
grey dashed line represents its reference median value and on average all remaining 8
samples lie above it. The same pattern is true regardless of the metric used to measure
the goodness-of-fit.
Based on this summary, the inclusion of the τABC′
ijk term into the model looks only
advantageous if the sample size is 50%. Most other sample sizes produce significantly
worse results and the final two smallest sample sizes are again close to the level of error
216
∆(%)
ZCR2
Sample size (%) Sample size (%)
50
10
510.50.10.050.010.005
0
5
10
15
Model 4 50
10
510.50.10.050.010.005
−200
0
200
400
Model 4
Figure 11.6: Summary goodness-of-fit results for nine different sizes of the [ABC]′
sample with Model 4 in red for comparison (N=1596)
found in Model 4. This is due to the fact that the number of sampling zeros in those
samples, which are all replaced by ones, mean they become practically uniform and
thereby Model 4A becomes almost identical to Model 4.
The largest range in goodness-of-fit across the different sample sizes is found in the
Size of workforce by Social-economic classification of family reference person. It ranges
from 3.51% misclassified with the largest sample size of 50% to a maximum of 14.96%,
which is reached at the 1% sample size. The smallest sample on the other hand - this
is a negligible 131 people in a table with over 20,000 cells - produces an error of 4.08%.
Under Model 4 4.06% were misclassified.
Again we look at mosaic plots to get a more detailed picture. We select Sutton
as the local authority where the difference in maximum and minimum goodness of
fit was largest between the different sample sizes. Figure 11.7 presents these mosaic
plots along with the chart describing the same upside-down-U shape seen before: the
best fit is found at the largest and smallest sample sizes. At one extreme we have
the 50% sample (right hand mosaic plots) where in Sutton there were no sampling
zeros (otherwise marked red) and this resulted in only 3.70% misclassified in that local
authority. The middle two plots show the results when a 1% sample was taken, which
meant 18 category combinations in Sutton were empty and therefore had one person
added to each of them. This resulted in 23.37% getting misclassified in that LA,
although the fit was better for the table as a whole. Finally the smallest sample (left-
hand side) was so sparse that only two people from Sutton were in the sample. That
meant all the remaining 53 cells were artificially filled, to allow convergence of IPF. Yet
the result was almost as accurate as with the largest sample: only 3.99% misclassified.
This is of course almost identical to the Model 4 result, which had Sutton at 3.98%
misclassified. The slight difference is due to the fact this particular sample actually
had 2 people in one cell, which meant that after the addition of the constants the prior
was not perfectly uniform.
217
5010510.50.10.050.01
0
0.05
0.10
0.15
0.005
Sample size (%)
∆(%)
0.005% sample
IPF
∆k= 3.99%
∆= 4.07%
1% sample
IPF
∆k= 23.37%
∆= 14.94%
50% sample
IPF
∆k= 3.70%
∆= 3.54%
Sutton - SAM
Figure 11.7: Size of workforce by NS-SEC of FRP: Sutton plots for three sample sizes
before and after IPF.
The scatterplot at the top of Figure 11.7 also shows that, similarly to the averages
plotted in Figure 11.6, only the largest sample provided better fit than Model 4. This
is not true as a rule though. There are in fact 155 tables where Model 4 is better even
than Model 4A with a 50% sample size, although the differences are not large. The
largest difference is in fact found in the largest table: Ethnicity by Migration origin
has 0.17% misclassified with the 50% sample and 0.12% under Model 4. The same is
true for 117 tables using the Cressie-Read statistic, but this time the largest difference
is for the table NS-SEC of FRP by Hours of care provided with ZC R2= 11.52 when
a unifrom prior is used (Model 4 ) and ZCR2= 57.17 with a 50% sample under Model
4A.
Only two out of the 1596 tables exhibit the sort of behaviour one might expect were
it not for the sampling zeros and the effect of the constants. These two tables have
a monotonically downward slope of percent misclassified as the sample size increases.
One is Migration origin by Migration indicator, the other Migration origin by Distance
moved. The latter has the higher errors across all sample sizes as shown by the black
points in Figure 11.811. Both tables are characterized by high numbers of structural
11This was also the worst performing table under Model 4, see Figure 10.26 on page 190.
218
∆(%)
50
10510.50.10.050.010.005
Sample size (%)
0
0.005
0.010
0.015
0.020
0.025
0.030
Migration
origin
Migration
indicator
Distance
moved
Figure 11.8: Only two tables where ∆falls monotonically with increased sample size
zeros (78% and 82% respectively). Furthermore the majority of cases (almost 89%)are
in the large did not move cell, which is uniquely defined and therefore has no error,
leaving only a small proportion of cases that also have limited options possible, regard-
less of the sample size. There are another 77 tables where this pattern of structural
zeros is even more rigorous, meaning there is no error at all, not under Model 4, nor
under any of the Model 4A sample sizes. There is no second-order interaction in these
tables, so the samples are completely irrelevant as they get smothered by the marginal
structural zeros.
Apart from these extreme cases however, it can be concluded based on this analysis
that there is no feasible sampling scheme that would produce acceptable results in a
three-dimensional scenario at this degree of geographic resolution. Almost ten percent
of the time we found that even a 50% sample was not sufficient to produce a reliable
estimate; that the sampled ratios of odds ratios are too extreme and propagated more
errors than assuming there is no second-order interaction to begin with.
Due to the extreme range of table sizes there is a danger in generalizing: a 50%
sample is of course not equally accurate across all tables. Figure 11.9 plots the percent
of sampling zeros12 against the change in goodness-of-fit statistics between the uniform
Model 4 and the 50% in Model 4A. The red regression lines show the relationship is
positive for both percent misclassified and the Cressie-Read statistic: the more sampling
zeros, the less improvement is achieved my adding the sample compared to using a
uniform prior. In fact there are tables where the fit becomes worse i.e. the change
is positive. But this relationship is extremely weak making it clear that it is not
the sparsity of the sample that is the main issue. The relationship is similarly weak
with all of the measures of associations strength described in Section 8.3. Given the
simplistic nature of those measures which provide a single summary value to describe
12This is the net proportion of sampling zeros, where the number of structurally empty cells is
subtracted from both the numerator and the denominator. Only this way can this percentage truly
measure what we are interested in since the structural zeros cannot contribute to the errors.
219
R2= 0.078 R2= 0.036
% Sampling zeros (net)
0
5
10
15
20
25
30
0 0.5−0.5−1.0−1.52.02.53.0
Change in ∆(%)
0
5
10
15
20
25
30
−100 1000−200−300−400
Change in ZCR2
Figure 11.9: Improvement under 50% sample in Model 4A over Model 4 and proportion
empty cells
a whole table this might be expected. As we saw in the analysis of sampling zeros, the
multiplicative nature of the odds ratios can propagate errors quite dramatically, but
we have not found a straightforward way to measure or describe a table in such a way.
Despite the large sample size it seems the sensitivity of IPF to even small errors
in odds ratios and consequently the ratios of odds ratios means these errors often
outweigh the benefits of introducing the second-order interaction. Samples of this size
are therefore impractical and as it turns out also inefficient. The following section
investigates a more realistic prospect however: that of using a sample at a higher scale,
either geographic or geodemographic.
11.3 Model 4B: From [AB],[AC],[BC]and [ABCGOR]′or
[ABCSg ]′to [d
ABC]
Following the analysis in the previous section, which found sampling at the lowest
geographic scale to be grossly inefficient in practically all cases, the question remains
whether it is possible to borrow strength from sources other than an impractically
large sample of the table we are trying to estimate. One natural place to look is at the
regional level: could the information from a sample at the regional level provide accurate
enough information to improve on assuming there is no second-order interaction. And
reminding ourselves of one of the findings in chapter 9, this sparks a second question:
could a sample at the Supergroup level of the ONS area classification provide an even
better source of information on the three-way interaction than the regional aggregated
one? This section compares these two strategies.
In order to borrow strength from the regional or Supergroup data we need to add an
220
Figure 11.10: Model 4 with with 3D sample taken at regional/Supergroup level
extra step to the model compared to the procedure so far. Figure 11.10 demonstrates
this two step procedure. First we take a sample from the [ABCGOR ]table and use IPF
to inflate it back to the correct population size:
ˆxijk =τ·τA
i·τB
j·τC
k·τAB
ij ·τACGOR
ik ·τBCGOR
jk ·τABC′
GOR
ijk [11.3]
From this we can then take the ten [AB|CGOR =k]margins that have resulted,
all of which correctly add up to the national [AB]. Now we split the full table of 373
local authorities into ten segments, one for each region. Using the new [AB|CGOR =k]
margins we can then use IPF to correctly adjust to the [ACGOR =k]and [BCGO R =k]
margins, and to an [AB|CGOR =k]margin that comes from a sample of the regions.
When these ten segments are stacked back together after IPF the correct [AB]margin
is also ensured (the exact same process applies to using the Supergroup aggregation
with seven instead of ten segments):
ˆxijk =τ·τA
i·τB
j·τC
k·τAB|C=GOR
ij ·τAC
ik ·τBC
jk , k ∈(GOR = 1,2...10) [11.4]
This means that within each region/Supergroup there is no second-order interac-
tion: all the odds ratios are the same for all the local authorities in any particular
region/Supergroup. They add up to a sample based [AB|C=k]margin for that region
(k∈(GOR = 1,2...10)) or for that Supergroup (k∈(SG = 1,2...7)). Depending on
the sample sizes, which are the same as in the previous section, these margins are more
or less accurate, but either way ensures that all the regions stacked together sum up to
the correct national [AB]margin.
To demonstrate the results obtained we take a look again at the tabulation that had
the largest range of percent misclassified in the previous section: size of workforce by
NS-SEC of FRP. Figure 11.11 plots both sets of statistics comparing Model 4A (grey
221
∆(%)
ZCR2
5010510.50.10.050.010.005
0
5
10
15
Sample size (%)
5010510.50.10.050.010.005
0
100
200
300
400
500
600 Model 4A
Model 4B-GOR
Model 4B-SG
Model 4
Sample size (%)
Figure 11.11: Goodness-of-fit for Size of workforce by NS-SEC of FRP under all three
sampling models
∆(%)
ZCR2
0.01 0.1 1 10 100 1,000 10,000
0
5
10
15
Average sample cell size
0.01 0.1 1 10 100 1,000 10,000
0
100
200
300
400
500
600 Model 4A
Model 4B-GOR
Model 4B-SG
Model 4
Average sample cell size
Figure 11.12: Goodness-of-fit for Size of workforce by NS-SEC of FRP under all three
sampling models relative to average cell size
line) with the results form sampling at regional level (black line) and at Supergroup
level (red line). For reference purposes the result of the uniform prior (Model 4 ) is also
shown as the dashed grey line. Two important findings stand out from these two plots.
Firstly the reduction of goodness-of-fit that results from reducing sample size is much
slower for the Model 4B tables than for Model 4. This is of course to be expected to a
certain degree, since the samples taken this time around are taken from tables that are
about 37 (or 53) times smaller than the full LA tables, making the average cells sizes
considerably larger.
The second clear pattern that can be seen both using percent misclassified and
Cressie-Read statistic is that the aggregating the LAs according to Supergroup mem-
bership and fitting the appropriate [AB|CSG =k]margins produces better fit than
using the [AB|CGOR =k]margins. Although we should not dismiss the fact that there
are only seven Supergroups compared to ten regions, the difference in performance is
larger than could be attributed to the better sample quality of the Supergroup data.
This is demonstrated if we re-plot the data in Figure 11.11 to change the x-axis from
sample size to the average cell size, when we see that the red line remains below the
222
Sample size (%)
Sample size (%)
∆(%)
ZCR2
50105
1
0.50.10.050.010.005
−200
0
200
400
600
800
1000
1200
Model 4A
Model 4B-GOR
Model 4B-SG
Model 4
Model 4
50105
1
0.50.10.050.010.005
0
10
20
30
40
Model 4A
Model 4B-GOR
Model 4B-SG
Model 4
Model 4
Figure 11.13: Summary goodness-of-fit results comparing all three sampled models
(N=1596)
black one. This confirms that the Supergroup aggregation is more informative than
the geographical one after taking into account the sampling noise. So not only is the
geodemographic aggregation outperforming the geographical one, it is doing so despite
having fewer groups. But this analysis is based on only one particular table, so we now
look if these findings hold more generally.
Although using the average sample cell size on the x-axis allowed us to directly
compare the errors for the above table, we must return to using relative sample size
in order to summarily present the results across the 1596 tables. In analogy to the
summary figure in the previous section, Figure 11.13 shows the results for both measures
under Models 4B-GOR and 4B-SG(∆(%) in the top chart and ZCR2in the bottom one).
The red and black pairs of box plots show the regional and Supergroup results side by
side at each sample size. For reference the results from the previous section for Model 4
are lightly printed in the background (this is the same data that was presented in Figure
11.6 on page 217). These summary statistics seem to confirm the general patterns found
for the single table above. On average the tables using the Supergroup samples (red)
outperform the ones using regional samples (black). Compared to the LA samples
(light grey) both the regional and the Supergroup samples perform better for the five
223
Table 11.2: Proportion of tables where regional sampling outperforms the geodemo-
graphic one - top and bottom five variables
Sample sizes
Variable 0.005%0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%
Country of birth
Accommodation type
Migration origin
Central heating
Sex
NS-SEC of FRP
Family type
Status in com.est.
Age
Marital status
largest sample sizes, then for sample sizes of 0.1% or less the LA samples perform
better as a result of the number of sampling zeros being replaced with a constant value
(1). Finally these results also show that the two largest regional samples and the three
largest Supergroup samples all perform better on average than the no second-order
interaction Model 4, which is plotted on the left hand side of both plots and whose
median is also shown as a dashed line running across all the sample sizes.
These patterns are based on the summary values presented in 11.13, so they will
not hold for all tables. Of the 1596 ×9 = 14,364 tables the Supergroup is more reliable
than the regional sample in all but 1,371 table-sample combinations. There are only
five tables where the regional sample is more reliable than the Supergroup one for all
sample sizes if percent misclassified is used as the measure, and 15 of them using the
Cressie-Read statistic. All of these tables, without exception, contain either Migration
origin or Country of birth as one of the variables. But for over 60% of the tables the
Supergroup sample was consistently better than the regional one across all nine sample
sizes. In fact most of the cases where the regional sample outperforms the Supergroup
one happen at the smallest of sample sizes (as we saw in the example of the table
described in Figure 11.11), where the sampled tables become incredibly sparse even at
this scale. If we disregard these two sample sizes then 85% of the time the Supergroup
samples provide better fit than the regional ones. So although not invariably true, there
is a pretty clear pattern indicating that the geodemographic grouping provides more
accurate estimates of relationships between variables than geographic proximity.
Although Supergroup aggregation provides more accurate estimates on the whole,
224
Table 11.3: Proportion of tables where Model 4 outperforms regional (darker grey) and
Supergroup sampling (lighter grey) - top and bottom five variables
Sample sizes
Variable 0.1% 0.5% 1% 5% 10% 50%
Comm.est. type
Status in comm. est.
Bath and WC
Hours of care
Students away
Tenure
Sex of FRP
Economic act. of FRP
Marital status
Dependent children
it is worth noting the tables or variables where the geography is more successful at
capturing the variable associations. Migration origin and Country of birth that have
already been mentioned as two such variables. Table 11.2 lists the top and bottom
five variables ranked according to how often Model 4B-GOR provided better estimates
to Model 4B-SG. For each sample size the shaded are represents the proportion of
tables where that was true (each variable occurs in 56 different tables). Thus of the
56 crosstabulations of Country of birth, 44 performed better using a 50% sample from
the geographic aggregation than from the geodemographic one. These results were
calculated using ∆(%) as the measure, but the ranking remains the same using Cressie-
Read, although the proportions are not exactly the same. The variables are ranked by
the results for the largest sample. A complete table for all 57 variables and nine sample
sizes is provided as Appendix F.
Finally we take a closer look at the comparison with the no second-order interaction
Model 4. We found with Model 4A that only a 50% sample could outperform the uniform
prior and even then there were over a hundred tables where that was not true. At the
10% sample the overwhelming majority performed worse than Model 4. So how much
better do the regional and Supergroup samples fare in this respect? Table 11.3 lists the
top and bottom ranked variables according to how often their crosstabulations were
better off using the uniform prior of Model 4, compared to either the geographically
or geodemographically based sampling. The three smallest sample sizes are not shown
as they are all without exception better under Model 4. We can then see for example
that tables where Communal establishment type is one of the variables are generally
better off with a uniform prior - even using a 50% sample Model 4 is better than Model
225
4B-GOR in 38 out of 56 tables and better than Model 4B-SG in 33 out of 56. However
beyond the two communal establishment variables the 50% sample do perform well, as
do most of the 10% samples and for Supergroups the 5% samples also. The full table
for all 57 variables is given in Appendix G, where it is clear that for the great majority
of variables the 5% samples using the Supergroup aggregation are preferable to using
a no second-order interaction model, whilst at smaller sample sizes the uniform prior
is more often the safer choice.
11.4 Summary
This chapter addressed a number of practical issues and limitations encountered with
using IPF in a multidimensional census data scenario, where some of the constraints are
inaccurate due to sampling. The issue of sampling zeros, which can quickly prevent the
convergence of the algorithm was tackled first via the addition of a constant to each
starting zero cell count. This analysis established a framework for making a better
informed decision on what constant to add to these empty cells. This framework was
then applied in the next section, sampling from the fully tabulated population at local
authority level. When even impractically large samples produced poor results, the next
section instead looked at borrowing strength from geographically or geodemographically
aggregated local authorities and was able to compare the two to each other as well.
The issue of sampling zeros is often resolved by simply adding a small constant to
ensure IPF converges and without giving much additional thought as to how this might
affect the result. A common-sense approach would have one add as small a value as
possible in order to minimise whatever negative impact adding a constant might lead
to.In fact we show this that approach can cause dramatic distortion of the estimates -
the same kind that adding an incredibly large constant would have. This is the result
of the multiplicative nature of the odds ratios and is a crucial element in understanding
the propagation of error in IPF. Our analysis intended to find a rule to chose the best
constant to add unfortunately did not produce a systematic result. One was chosen as
the most neutral constant based on a limited analysis of the most extreme tables. This
may not generally be the optimum value to add, and this understanding of the quite
serious errors that can be generated through what might be thought of as only a trivial
decision is perhaps the most important message from this section.
Section 10.4 tests the assumption that adding more information to the model im-
proves its fit. Sampling from the full [ABC]table in addition to the three side margins
already included in Model 4 in fact produced some quite startling results. It was already
anticipated that very sparse samples would paradoxically perform quite well, due to
them effectively behaving almost like uniform priors due to the constants being added.
At the other end of the spectrum of sample sizes the patterns were less surprising:
as a rule the larger the sample the better the fit. However comparing the fit to the
226
benchmark Model 4 with a uniform prior found that generally only the largest sample
of 50% could reasonably compete. This is not an encouraging result as this size of
sample is only really hypothetical. Sample sizes that might conceivably be encountered
in the real world almost invariably produced results that were dramatically worse than
assuming no second order interaction. Again it would seem the multiplicative nature
of the odds ratio means sampling noise can cause so much distortion as to counteract
the extra information gained.
The final section looks at another source of extra information: samples with higher
aggregations than local authorities. Our dataset allows us to aggregate the data into
regions and geodemographic Supergroups, from where we can sample more accurately.
The idea is that the odds ratios will be more homogeneous within geographic and/or
geodemographic groups so that this information might provide more accuracy than
the previous model. This does in fact prove correct although again the sample sizes
which are adequate are generally only as small as 5% or 10%. The results also show
that geodemographic aggregation is more successful in capturing the variation of the
bivariate relationships than regional aggregation, despite the fact that Supergroup ag-
gregation used has fewer groups.
The results of this chapter do seem to make a strong case against using any but
the largest sample sizes for second- (or higher) order interactions. There is at least one
qualification to that conclusion however. That is that we are in fact already operating
with a 5% sample of the true population as our base population. It should therefore
be kept in mind that it is the extreme and noisy ratios of odds ratios that cause the
error of sampled estimates, and that our investigation actually involves sampling from
a sample, thereby possibly increasing the chances of such extreme scenarios. Another
caveat is that our analysis assumes the maximum availability of the marginal data
i.e. we are using Model 4 as our starting position instead of one of the lower, less
constrained models. Without the availability of these correct margins the estimates
would suffer accordingly.
227
228
Part IV
Evaluation
229
Chapter 12
Conclusion
This thesis presents a comprehensive review and analysis of the theoretical background
and practical applications and limitations of the Iterative proportional fitting (IPF)
algorithm. Through an in-depth historical and practical examination it provides an ex-
tensive and yet accessible overview of a method that is all too commonly used without
fully understanding its logic or limits. In social sciences the statistical and technical
aspects of this and other algorithms too often distance them from users, thereby pre-
venting them from being used to their full potential. The applied sections of this thesis
in particular provide several surprising results that we hope are convincing enough to
show that a black box approach to IPF (and other methods) is potentially dangerous.
In Part I of the thesis we propose three levels of understanding IPF: the classical
utilitarian approach, which we argue is intuitive, but insufficient, then the log-linear
approach, which we believe is crucial for formalizing the procedure systematically –
something that is indispensable as soon as we move beyond the simplest two dimensional
tables – and finally the maximum entropy conceptualisation, which explains the elegant
logic underlying the procedure. This uniquely comprehensive and interdisciplinary
approach is in stark contrast with partial and field-specific accounts that dominate the
literature.
Early applications are intuitively accessible and their description quickly makes it
clear how the algorithm could have been discovered independently on several occasions.
Adopting a history of statistics narrative style makes the technical aspects more easily
accessible. From Dutch telephone exchanges to Mexican marriage patterns, Chapter 3
on early inventions and applications provides a wide-ranging overview of IPF’s early
history and gives extra insight into the terminological complications that ensued. The
intuitiveness and simplicity of these examples detracts from the theoretical underpin-
ning of IPF, and this is an issue that many of the original authors faced and that many
contemporary users still have not resolved.
Log linear models provide the language that allows us to express even the most
complex applications of IPF in a clear and unambiguous way. This formal framework
231
is extensive, with what might be seen as redundant options: additive or multiplicative
formulations, indicator or deviation contrast parameters, all various alternative ways
of expressing the same model. More than anything else Chapter 4 represents a detailed
breakdown of all these types of formulations and parameter contrast choices that we
unfortunately find lacking in the current technical literature. The argument has been
made that log-linear models are underutilized in the social sciences despite the pre-
ponderance of categorical data, and we feel this chapter could serve as a contribution
towards reversing that trend. Most importantly for understanding IPF, the log-linear
framework reinforces the concept of variational independence, the idea that different
table constraints can be independent from each other and yet consistently and uniquely
combined.
Bridging the theoretically separate fields of log-linear models and entropy maximiz-
ing are gravity models, models that have a special place in geography and are given
extra consideration in this context. Wilson’s reformulation of these spatial interaction
models gave them a theoretical justification based on minimizing the uncertainty asso-
ciated with missing information. Wilson’s entropy models serve as a link to the third
level of understanding of IPF and the logic of its estimates: entropy as a measure of
uncertainty of a probability distribution. Chapter 5 goes into considerable theoretical
depth elucidating the different concepts of entropy. Albeit perhaps tangential to our
main aim of understanding the logic of IPF estimates, this foray into philosophical
waters is crucial in completing a comprehensive review of the procedure. For it is by
understanding the maximum entropy solution as the one that agrees with all of the
known information, while being minimally committal with respect to the unknown,
that we establish IPF as based on a fundamental principle of reasoning.
As geographers we have a special interest in the relevance of the spatial dimension
and this is true even when investigating a data technique that is in essence aspatial.
While IPF treats geography as just another variable, this thesis pay special attention
to how it estimates geographic variability. This means singling out the geographic
dimension of the dataset and investigating along it: how well are relationships between
variables reproduced after IPF and to what extent are the IPF estimates geographically
realistic. This first requires a definition of what we mean by relationship between
variables, which is the first of the metric problems tackled in the thesis. The decision
on what is a strong and what is a weak association between two variables cannot
be a straightforward one and to a certain extent there will always be an element of
arbitrariness involved. The same applies to the choice of goodness-of-fit statistic and in
both cases our philosophy is that this awareness is more important than the convenience
of settling for a single measure. We therefore do not make that choice, but rather use
this opportunity to gain a better understanding of three types of association descriptors
and their behaviour in describing the variability of our data.
232
Perhaps even more important is the question of goodness-of-fit. This metric is
fundamental to our definition of what we consider a quality estimate. It is therefore
not just a tool to a problem, but is implicated in the definition of the problem as well.
Consequently our methods chapter (chapter 8) goes into considerable detail elucidating
the logic and theory behind various measures and how they might be practical for our
applications. The specifics of our estimation scenarios also called for novel solutions
to using classical measures. Because we needed to be able to compare the quality of
estimates across different sized tables, this meant that measures that are asymptotically
chi-square are not directly useful as they cannot be compared due to different degrees
of freedom. This is an uncommon problem and we believe we set a precedent by solving
it using a normalization of chi-squared to standardize the measures and thereby make
them directly comparable.
Another technical yet original aspect of the research presented here is the program-
ming solution adopted to carry out the analysis. The R code provides a dedicated
open-source solution that is more straightforward to use than alternative (commercial)
options, and can even handle sampled data directly. The programme is provided as an
Appendix, but the intention is to make it available electronically as well (after the in-
clusion of some additional error handling). R also provides great graphing capabilities
and the vcd package is responsible for most of the mosaic displays used throughout
the thesis. Although this is a more pedagogic side note, we feel this sort of visual
demonstration of tables contributes significantly to the understanding of the structures
and processes taking place. The mosaic cubes used here are furthermore an original
attempt to extend this principle to three dimensions.
The section of the thesis describing the dataset is unconventional in that it uses
the description of the geographic variation found in the UK Census Small Area Micro-
data (SAM) as an opportunity to comprehensively review three important statistical
and geographical issues. The modifiable areal unit problem, the ecological fallacy and
Simpson’s paradox are all topics that have received considerable attention in the lit-
erature – as separate issues. Our analysis however brings the three issues together
while rigorously delimiting them in a comprehensive manner that has so far been lack-
ing. The SAM dataset is introduced and explored by investigating the magnitude of
each of these issues. This analysis concludes in one of the important contributions of
this thesis: a summary table that conveniently and clearly defines and distinguishes
the various manifestations of the ecological fallacy, Simpson’s paradox and the modi-
fiable areal unit problem, thereby hopefully reducing misinterpretations and confusion
between them in the future.
The two practical applications of IPF to the dataset of 1596 three-dimensional
crosstabulations of UK census data provide some expected and some surprising results.
On the one hand most of the variables that stood out as being particularly prone to error
233
were not surprising. Variables such as region of origin, country of birth and ethnic group
consistently showed high levels of geographic variation in their interactions with other
variables. Other times bivariate combinations were more important than univariate
distributions. This was particularly true for variable combinations which conspire to
create large numbers of structural zeros, which can effectively provide enough of a
constraint on the solution to make sure the estimates are perfect. A comprehensive
ranking of these variables, which is provided as Appendix E, offers a useful reference
as to which variables researchers should take particular care with i.e. in which case
the geographical margins are more important than the bivariate ones and vice versa.
These results alone justify singling out the spatial dimension in IPF.
The two sampling applications described in Chapter 11 bring with them even more
important lessons for users of IPF. The first finding concerns the question of sampling
zeros - often resolved without much thought my adding a minimal constant to ensure
convergence. We show the dangers inherent in such an approach as it can lead to
dramatic distortions of the estimates. Understanding the concepts of variational inde-
pendence and the multiplicative nature of the odds ratio this should of course come
as no surprise. We have unfortunately not found a robust means of deciding on what
constant to add other than to urge caution at going into either extreme.
Investigating the properties of IPF estimates under various sampling schemes found
even more dramatic results: at the local authority resolution our results show that only
50% samples have enough discrimination power to generally outperform the use of
a uniform prior i.e. assuming no second-order interaction. Again this is the result
of the multiplicative nature of the models, which mean that sampling noise in three-
dimensions propagates much more quickly, leading to unreliable estimates.
As an alternative and more realistic option we finally investigated the quality of esti-
mates if the sample was taken at either regional or Supergroup level. As expected these
estimates produced better results – relative to sample size – than the previous samples,
but perhaps the most important contribution of this analysis is the relationship between
the geographic and geodemographic results. Except for a small proportion of tables the
geodemographic aggregations consistently outperformed the geographical ones, indicat-
ing that despite the cluster design being based on univariate variable distributions, they
are also good at capturing variation in bivariate associations. These results also suggest
that future publications of census and other survey micro data would gain considerable
information content by including geodemographic information, while not increasing the
risk of identification which is associated with releasing geographic information at too
fine a level.
It is important to highlight that despite our best efforts, this work is far from com-
prehensive and that there are important issues that have been sidestepped or omitted
completely due to lack of time and space. One important practical issue that sooner
234
or later becomes relevant in any IPF application of the type investigated here is that
of rounding. Our analysis stopped short of that and investigated the errors of the pure
IPF estimates, however many applications would require the IPF estimate to be con-
verted back into microdata, which requires integer cell values. This is a topic to be
tackled in the future.
There is also the scope to expand on the sampling models used here by considering
a further level of geographic aggregation – the use of a national prior – and by revisiting
the analysis using the lower, less constrained IPF, such as the first three models con-
sidered in Chapter 10. Following previous research into geodemographic classification
the main inspiration for follow-up work beyond the thesis lies in the results from the
second sampling exercise. Those results not only give extra support to geodemographic
classification, but also provide a possible way of evaluating the quality of classifications
by their ability to capture bivariate variation. To what extent this knowledge could
be used to inform and improve the classification process is a topic awaiting further
research.
Through this story of IPF we have brought together a wide spectrum of statistical
as well as classically geographical topics: from generalized linear models and goodness-
of-fit statistics on the one hand to spatial interaction models and the modifiable areal
unit problem on the other. It is not common that the in-depth study of one specialized
topic can also lead to this sort of breadth of disciplinary engagement. In this way we
believe this thesis also represents an interdisciplinary contribution to the history of
science.
235
236
Bibliography
Abadir, K. M., & Magnus, J. R. (2002). Notation in econometrics: a proposal for a
standard. Econometrics Journal,5(1), 76–90.
Agresti, A. (1996). An Introduction to Categorical Data Analysis. Wiley Series on
Probability and Statistics. New York, Chichester: John Wiley & Sons, Inc.
Agresti, A. (2002). Categorical Data Analysis. Wiley series in probability and statistics.
Hoboken, NJ: Wiley Interscience, 2nd ed.
Agresti, A., Ghosh, A., & Bini, M. (1995). Raking Kappa: Describing Potential Impact
of Marginal Distributions on Measures of Agreement. Biometrical Journal,37 (7),
811–820.
Alker, H. R. J. (1969). A Typology of Ecological Fallacies. In M. Dogan, & S. Rokkan
(Eds.) Social Ecology, (pp. 69–86). Cambridge: MIT Press.
Amrhein, C. (1995). Searching for the elusive aggregation effect: evidence from statis-
tical simulations. Environment and Planning A,27 (1), 105–119.
Anderson, T. R. (1955). Intermetropolitan migration: a comparison of the hypothesis
of Zipf and Stouffer. American Sociological Review,20 (3), 287–291.
Bakan, D. (1966). The test of significance in psychological research. Psychological
Bulletin,66 (6), 423–437.
Basu, A., Ray, S., Park, C., & Basu, S. (2002). Improved power in multinomial
goodness-of-fit tests. Journal of the Royal Statistical Society. Series D (The Statis-
tician),51 (3), 381–393.
URL http://www.jstor.org/stable/3650281
Batty, M. (1974). Spatial Entropy. Geographical Analysis,6(1), 1–31.
Batty, M. (1976). Entropy in Spatial Aggregation. Geographical Analysis,8(1), 1–21.
237
Batty, M., & Mckie, S. (1972). The Calibration of Gravity, Entropy and Related Models
of Spatial Interaction. Environment and Planning A,4(2), 205–233.
Bennett, R. J., & Haining, R. P. (1985). Spatial Structure and Spatial Interaction:
Modelling Approaches to the Statistical Analysis of Geographical Data. Journal of
the Royal Statistical Society. Series A (General),148 (1), 1–36.
Berkson, J. (1938). Some difficulties of interpretation encountered in the application of
the chi-square test. Journal of the American Statistical Association,33 (203), 526–
536.
URL http://www.jstor.org/stable/2279690
Birch, M. (1963). Maximum Likelihood in Three-Way Contingency Tables. Journal of
the Royal Statistical Society Series B (Methodological),25 (1), 220–233.
Birkin, M., & Clarke, M. (1988). SYNTHESIS - a synthetic spatial information system
fro urban and regional analysis: methods and examples. Environment and Planning
A,20 (12), 1645–1671.
Bishop, Y. M. M. (1969). Calculating smoothed contingency tables. In J. P. e. a.
Bunker (Ed.) The National Halothane Study: A Study of the Possible Association
Between Halothane Anesthesia and Postoperative Hepatic Necrosis. Bethesda, MD:
National Institutes of General Medical Sciences.
Bishop, Y. M. M., Fienberg, S. E., & Holland, P. W. (1975). Discrete Multivariate
Analysis. Cambridge MA and London UK: MIT Press.
Blalock, H. M. J. (1964). Causal Inference in Nonexperimental Research. Chapel Hill:
University of Morth Carolina Press.
Blyth, C. R. (1972). On Simpson’s Paradox and the Sure-Thing Principle. Journal of
the American Statistical Association,67 (338), 364–366.
Boltzmann, L. (1877). Über die Beziehung dem zweiten Haubtsatze der mechanischen
Wärmetheorie und der Wahrscheinlichkeitsrechnung respektive den Sätzen über das
Wärmegleichgewicht. Wiener Berichte, 76 ,76 , 373–435.
Box, G. E., Hunter, W. G., & Hunter, J. S. (1978). Statistics for Experimenters: And
Introduction to Design, Data Analysis, and Model Building. New York: Wiley.
Bregman, L. M. (1967). Proof of the convergence of Sheleikhovskii’s method for a
problem with transportation constraints. USSR Computational Mathematics and
Mathematical Physics,7(1), 191–204.
238
Brouwer, F., Nijkamp, P., & Scholten, H. (1988). HYBRID LOG-LINEAR MODELS
FOR SPATIAL INTERACTION AND STABILITY ANALYSIS(∗). Metroeconom-
ica,39 (1), 43–65.
Brown, D. T. (1959). A note on approximations to discrete probability distributions.
Information and Control,2(4), 386–392.
Brown, M. B. (1976). Screening effects in multidimensional contingency tables. Applied
Statistics,25 (1), 37–46.
Burns, P. (2009). The R Inferno. London: Burns Statistics.
Byrne, B. M. (1991). The maslach burnout inventory: Validating factorial structure
and invariance across intermediate, secondary, and university educators. Multivariate
Behavioral Research,26 (4), 583.
URL http://search.ebscohost.com.ezproxy.liv.ac.uk/login.aspx?direct=
true&db=buh&AN=6377630&site=ehost-live&scope=site
Canal, L. (2005). A normal approximation for the chi-square distribution. Computa-
tional Statistics & Data Analysis,48 (4), 803 – 808.
URL http://www.sciencedirect.com/science/article/B6V8V-4CBVKT2-1/2/
96509bcb6e66566e64fed043090bb093
Carmines, E., & McIver, J. (1981). Social Measurement: Current Issues, chap. Ana-
lyzing models with unobservable variables, (pp. 65–115). Sage.
Carrothers, G. A. (1956). An Historical Review of the Gravity and Potential Concepts
of Human Interaction. Journal of the American Institute of Planners,22 (2), 94–102.
Chen, P. Y., & Popovich, P. M. (2002). Correlation: Parametric and Nonparametric
Measures. No. 07-139 in Sage University Press Series on Quantitative Applications
in the Social Sciences. Thousand Oaks Ca: Sage.
Cleave, N., Brown, P., & Payne, C. (1995). Evaluation of Methods for Ecological
Inference. Journal of the Royal Statistical Society series A,158(1), 55–72.
Cressie, N., & Read, T. R. C. (1984). Multinomial goodness-of-fit tests. Journal of the
Royal Statistical Society. Series B (Methodological),46 (3), 440–464.
URL http://www.jstor.org/stable/2345686
Darroch, J. N. (1962). Interactions in Multi-Factor Contingency Tables. Journal of the
Royal Statistical Society. Series B (Methodological),24 (1), 251–263.
Deming, W. E., & Stephan, F. F. (1940). On a Least Squares Adjustment of a Sampled
Frequency Table When the Expected Marginal Totals are Known. The Annals of
Mathematical Statistics,11 (4), 427–444.
239
Duncan, O. D., Cuzzort, R. P., & Duncan, B. (1961). Statistical Geography: Problems
in Analysing Areal Data. Glencoe, Illinois: The Free Press.
Fienberg, S. E. (1971). A Statistical Technique for Historians: Standardizing Tables of
Counts. Journal of Interdisciplinary History,1(2), 305–315.
Fienberg, S. E. (1992). Introduction to Birch (1963) Maximum Likelihood in Three-
Way Contingency Tab;es. In S. Kotz, & N. L. Johnson (Eds.) Breakthroughs in
Statistics, (pp. 453–461). Springer.
Fienberg, S. E., & Rinaldo, A. (2007). Three centuries of categorical data analysis: Log-
linear models and maximum likelihood estimation. Journal of Statistical Planning
and Inference,137 , 3430–3445.
Fingleton, B. (1981a). Log-linear modelling of geographical contingency tables. Envi-
ronment and Planning A,13 (12), 1539.
Fingleton, B. (1981b). Log-linear models, mostellerizing and forecasting. Area,13 (2),
123–129.
Fisher, R. A. (1950(1925)). Statistical Methods for Research Workers. Edinburgh:
Oliver and Boyd, 11th ed.
Flowerdew, R., Geddes, A., & Mick, G. (2001). Behaviour of regression models under
random aggregation. In N. J. Tate, & P. M. Atkinson (Eds.) Modelling Scale in
Geographical Information Science. Chichester: John Wiley.
Fotheringham, A. S., & Wong, D. W. S. (1991). The modifiable areal unit problem in
multivariate statistical analysis. Environment and Planning A,23 (7), 1025–1044.
Fotheringham, S. A., & Knudsen, D. C. (1987). Goodness-of-fit Statistics. Norwich:
Geo Books.
Fratar, T. J. (1954). FORECASTING DISTRIBUTION OF INTERZONAL VEHIC-
ULAR TRIPS BY SUCCESSIVE APPROXIMATIONS . Highway Research Board
Proceedings,33 , 376–384.
Freedman, D. A. (2002). The ecological fallacy. online.
URL http://www.stat.berkeley.edu/~census/ecofall.txt
Frick, M., & Axhausen, K. W. (2004). Generating Synthetic Populations using IPF
and Monte Carlo Techniques: Some New Results.
Friendly, M. (1994). Mosaic Displays for Multi-Way Contingency Tables. Journal of
the American Statistical Association,89 (425), 190–200.
240
Friendly, M. (1995a). A Fourfold Displa for 2 by 2 by k Tables. Tech. rep., Psychology
Department, York university.
Friendly, M. (1995b). Conceptual and Visual Models for Categorical Data. The Amer-
ican Statistician,49 (2), 153–160.
Friendly, M. (1998). Extending Mosaic Displays: Marginal, Partial and Conditional
Views of Categorical Data.
Furness, K. (1965). Time Function Iteration. Traffic Engineering and Control,7(7),
458–460.
Gehlke, C., & Biehl, K. (1934). Certain effects of grouping upon the size of the corre-
lation coefficient in census tract material. Jounral of the American Statistical Asso-
ciation,29 (185), 169–170.
Gelman, A., Shor, B., Bafumi, J., & Park, D. (2007). Rich state, poor state, red state,
blue state: What’s the matter with connecticut? Quarterly Journal of Political
Science,2(4), 345–367.
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics,33 (5), 587–606.
Glass, D. V. (1954). Social Mobility in Britain. London: Routledge & Kegan Paul.
Good, I., & Mittal, Y. (1987). The amalgamation and geometry of tyo-by-two contin-
gency tables. The Annals of Statistics,15 (2), 694–711.
Good, I. J. (1963). Maximum Entropy for Hypothesis Formulation, Especially for
Multidimensional Contingency Tables. The Annals of Mathematical Statistics,34 (3),
911–934.
Goodman, L. A. (1970). The Multivariate analysis of qualitative data. Journal of the
American Statistical Association,65 (329), 226–256.
Goodman, L. A., & Kruskal, W. H. (1954). Measures of Association for Cross Classi-
fication. Journal of the American Statistical Association,49 (268), 732–764.
Gould, P. (1972). Pedagogic Review (Wilson, Entropz in Urban and Regional Mod-
elling). Annals of the Association of American Geographers,62 (4), 689–700.
Grizzle, J. E., Starmer, C. F., & Koch, G. G. (1969). Analysis of categorical data by
linear models. Biometrics,25 (3), pp. 489–504.
URL http://www.jstor.org/stable/2528901
Hartigan, J., & Kleiner, B. (1984). A Mosaic of Television Ratings. The American
Statistician,38 (1), 32–35.
241
Hastie, T. (1987). A Closer Look at the Deviance. The American Statistician,41 (1),
16–20.
Hendrickx, J. (2005). Using standardised tables for interpreting Loglinear models.
Quality and Quantity,38 (5), 603–620.
Hogg, R. V., & Craig, A. T. (1978). Introduction to Mathematical Statistics. New York:
Macmillan Publishing.
Holt, D. (1979). Log-linear models for contingency table analysis: On the interpretation
of coefficients. Sociological Methods & Research,7(3), 330–336.
Hornik, K. (2011). The R FAQ. ISBN 3-900051-08-9.
URL http://CRAN.R-project.org/doc/FAQ/R-FAQ.html
Ireland, C., & Kullback, S. (1968a). Contingency tables with given marginals.
Biometrika,55 (1), 179–188.
Ireland, C. T., & Kullback, S. (1968b). Minimum Discrimination Information Estima-
tion. Biometrics,24 (3), 707–713.
Jaynes, E. T. (1957). Information Theory and Statistical Mechanics. Physical Review,
106 (4), 620–630.
Jaynes, E. T. (2003). Probability Theory: The Logic of Science. Cambridge: Cambridge
University Press.
Johnston, R., & Pattie, C. (1993). Entropy-Maximising and the Interative Proportional
Fitting Procedure. Professional Geographer,45 (3), 317–322.
Johnston, R. J. (2000). Modifiable areal unit problem. In R. J. Johnston, G. Derek,
P. Geraldine, & W. Michael (Eds.) The dictionary of human georgaphy. Oxford,
Malden, MA: Blackwell Publishing, 4th ed.
Johnston, R. J., & Hay, A. M. (1982). On the parameters of uniform swing in single-
member constituency electoral systems. Environment and Planning A,14 (1), 61–74.
Johnston, R. J., & Hay, A. M. (1983). Voter Transition Probability Estimates: An
Entropy-Maximizing Approach∗. European Journal of Political Research,11 (1), 93–
98.
Johnston, R. J., & Pattie, C. (1991). Evaluating the use of entropy-maximizing pro-
cedures in the study of voting patterns: 1. Sampling and measurement error in the
flow-of-the-vote matrix and the robustness of estimates. Environment and Planning
A,23 , 411–420.
242
Knoke, D., & Burke, P. J. (1980). Log-Linear Models. Sage University Paper series on
Quantative Applications in the Social Sciences, 07-020. Beverly Hills and London:
SAGE Publications Ltd, 1. ed.
Knudsen, D. C., & Fotheringham, S. A. (1986). Matrix Comparison, Goodnes-of-Fit
and Spatial Interaction Modeling. International Regional Science Review,10 (2),
127–147.
Kotz, S., & Johnson, N. L. (Eds.) (1983). Encyclopedia od statistical sciences vol.4 .
New York: Wiley.
Kruithof, J. (1937). Telefonverkeersrekening. De Ingenieur,52 (8), e15–e25.
Ku, H. H., & Kullback, S. (1974). Models in contingency table analysis. The American
Statistician,28 (4), 115–122.
Kuha, J., & Firth, D. (2010). On the index of dissimilarity for lack of fit in loglinear
and log-multiplicative models. Computational Statistics & Data Analysis,In Press,
Corrected Proof , –.
URL http://www.sciencedirect.com/science/article/B6V8V-504123R-1/2/
776828011974f423234c67ae14a41921
Kullback, S. (1959). Information Theory and Statistics. New York: Wiley.
Kullback, S. (1987). The Kullback-Leibler Distance (Letters to the Editor). The Amer-
ican Statistician,41 (4), 340–431.
Lamond, B., & Stewart, N. (1981). Bregman’s Balancing Method. Transportation
Research B (Methodological),15 (4), 239–248.
Lancaster, H. O. (1951). Complex Contingency Tables Treated by the Partition of $
chiˆ2$. Journal of the Royal Statistical Society. Series B (Methodological),13 (2),
242–249.
Lill, E. (1891). Das Reisegesetz und seine Anwendung auf den Eisenbahnverkehr(The
Trip Law and its Use for Railway Traffic). Vienna: Spielhagen & Schurich.
Liu, R. (1980). A note on phi-coefficient comparison. Research in Higher Education,
13 (1), 3–8.
Long, J. S. (1984). Estimable Functions in Log-Linear Models. Sociological Methods
Research,12 (4), 399–432.
Messick, D. M., & van de Geer, J. P. (1981). A reversal paradox. Psychological Bulletin,
90 (3), 582–593.
243
Meyer, D., Zeileis, A., & Hornik, K. (2010). vcd: Visualizing Categorical Data. R
package version 1.2-9.
Mosteller, F. (1968). Association and Estimation in Contingency Tables. Journal of
the American Statistical Association,63 (321), 1–28.
Mueller, R. O. (1996). Basic principles of structural equation modelling: an introduction
to LISREL and EQS. New York: Springer.
Mühlenbein, H., & Höns, R. (2005). The Estimation of Distributions and the Minimum
Relative Entropy Principle. Evolutionary Computation,13 (1), 1–27.
Munro, B. H. (2005). Statistical methods for health care research. Philadephia: Lip-
pincott.
Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized Linear Models. Journal
of the Royal Statistical Society. Series A (General),135 (3), 370–384.
Nijkamp, P. (1977). Gravity and Entropy Models - the State of the Art. Research
memorandum nr 1977-2. Amsterdam: Vrije Universiteit, Economische Fakulteit,.
Oakes, M. J. (2009). Comemntary: Individual, ecological adn multilevel fallacies. In-
ternational Journal of Epidemiology,28 , 361–368.
ONS (2006). 2001 United Kingdom Small Area Microdata Licensed File [computer file]
distributed by the Cathie Marsh Centre for Census and Survey Research, University
of Manchester.
ONS (2007). National Statistics 2001 Area Classification of Local Authorities, 2007
Amendment, Office for National Statistics website: http://www.statistics.gov.
uk/about/methodology_by_theme/area_classification/, Acessed on 12.11.2009.
ONS (n.d.). Methods for national statistics 2001 area classification for local authori-
ties.
URL http://www.statistics.gov.uk/about/methodology_by_theme/area_
classification/la/downloads/Methods.pdf
Openshaw, S. (1976). An Empirical study of some spatial interaction models. Envi-
ronment and Planning A,8(1), 23–41.
Openshaw, S. (1977). A geographical solution to scale and aggregation problems in
region-building, partitioning and spatial modelling. Transactions of the British In-
stitute of British Geographers, New Series,2(4), 459–472.
244
Openshaw, S. (1979). Alternative Methods of Estimating Spatial Interaction Models
and their Performance in Short-term Forecasting. In C. P. Bartels, & R. H. Ketel-
lapper (Eds.) Exploratory and Explanatory Statistical Analysis of Spatial Data, (pp.
201–225). Boston: Martinus Nijhoff Publishing.
Openshaw, S. (1984). Ecological fallacies and the analysis of areal census data. Envi-
ronment and Planning A,16 (1), 17–31.
Openshaw, S., & Taylor, P. (1979). A million or so correlation coefficents: three exper-
iments on the modifiable unit problem. In N. Wrigley (Ed.) Statistical Applications
in the Spatial Sciences, (pp. 127–144). London: Pion.
Paperny, V. (2003). Moscow in 1937: Faith, Truth and Reality .
Parta, E. R., Klensin, J. C., & de Sola Pool, I. (1982). The Shortwave Audience in
the USSR: Methods for Improving the Estimates. Communication Research,9(4),
581–606.
Pearson, K. (1900). On the Criterion that a given System of Deviations from the
Probable in the Case of Correlated Systems of Variables is such that it can be rea-
sonably supposed to have arisen from Random Sampling. Philosophical Magazine,
50 , 157–175.
Pearson, K., Lee, A., & Bramley-Moore, L. (1899). Mathematical contributions to
the theory of evolution vi. genetic (reproductive) selection: Inheritance of fertility
in man, and of fecundity in thoroughbread racehorses. Philosophical Transactions
of the Royal Society of London. Series A, Containing Papers of a Mathematical or
Physical Character,192 , 257–330.
Peizer, D. B., & Pratt, J. W. (1968). A normal approximation for binomial, f, beta,
and other common, related tail probabilities, i. Journal of the American Statistical
Association,63 (324), pp. 1416–1456.
URL http://www.jstor.org/stable/2285895
Pitfield, D. (1978). Sub-optimality in Freight Distribution. Transportation Research,
12 (6), 403–409.
Plackett, R. L. (1983). Karl Pearson and the Chi-Squared Test. International Statistical
Review / Revue Internationale de Statistique,51 (1), 59–72.
Planck, M. (1901). Über das Gesetz der Energieverteilung im Normalspectrum. Annalen
der Physik,309 , 553–563.
Pooler, J. (1983). Information theoretic methods of spatial model building: A guide to
the unbiased estimation of the form of probability distributions+. Socio-Economic
Planning Sciences,17 (4), 153–164.
245
Press, W. H., Teukolsky, S. A., Vetterling, W. T., & Flannery, B. P. (1992). Numerical
Recipes in Fortan 77: The Art of Scientific Computing. Cambridge: Cambridge
University Press, 2nd ed.
R Development Core Team (2011). R: A Language and Environment for Statistical
Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-
900051-07-0.
URL http://www.R-project.org/
Radlow, R., & Alf, E. F. (1975). An alternate multinomial assesment of the accuracy
of the χ2test of goodness-of-fit. Journal of the American Statistical Association,
70 (352), 811–813.
Rahman, N. (1968). A Course in Theoretical Statistics. London: Griffin.
Ravenstein, G., Ernest (1885). The Laws of Migration. Journal of the Statistical Society
of London,48 (2), 167–235.
Raymer, J. (2007). The estimation of international migration flows: a general technique
focused on the origin - destination association structure. Environment and Planning
A,39 (4), 985–995.
Raymer, J. (2008). Obtaining an Overall Picture of Population Movement in the Euro-
pean Union. In J. Raymer, & F. Willekens (Eds.) International Migration in Europe:
Data, Models and Estimates, (pp. 209–234). Chichester: Wiley.
Raymer, J., Bonaguidi, A., & Valentini, A. (2006). Describing and projecting the age
and spatial structures of interregional migration in Italy. Population, Space and
Place,12 (5), 371–388.
Read, T. R., & Cressie, N. A. (1988). Goodness-of-fit Statistics for Discrete Multivariate
Data. Springer Verlag.
Reilly, W. J. (1931). The Law of Retail Gravitation. New York: Knickerbocker Press.
Roberts, C. (1999). Measuring Cuban Public Opinion: Methodology. In Papers and
Proceedings of the Ninth Annual Meeting of the Association for the Study of the
Cuban Economy (ASCE). Coral Gables, Florida.
Roberts, J. M. (2002). Connections between A.K. Romney’s analyses of endogamy
and other developments in log-linear models and network analysis. Social Networks,
24 (3), 185–199.
Robinson, W. S. (1950). Ecological Correlations and the Behavior of Individuals. Amer-
ican Sociological Review,15 (9), 351–357.
246
Rogers, A., Willekens, F., & Raymer, J. (2005). Imposing Age and Spatial Structures
on Inadequate Migration-Flow Datasets. Professional Geographer,55 (1), 56–68.
Romney, K. A. (1971). Measuring Endogamy. In P. Kay (Ed.) Explorations in Mathe-
matical Anthropology, (pp. 191–213). Cambridge MA: MIT Press.
Romney, K. A. (2009). Personal Communication.
Rudas, T. s. (1998). Odds Ratios in the Analysis of Contingency Tables. Sage University
Papers Series on Quantitative Applications in the Social Sciences 07-119. Thousand
Oaks, CA: Sage Publications.
Sãolvrik, B., & Crewe, I. (1983). Decade of Dealingment - The Conservative Victory of
1979 and Electoral Trends in the 1970s. Cambridge: Cambridge UNiversity Press.
Seligson, M. A. (1999). COMMENTS ON “Measuring Cuban Public Opinion: Method-
ology” by Roberts. In Papers and Proceedings of the Ninth Annual Meeting of the
Association for the Study of the Cuban Economy (ASCE). Coral Gables, Florida.
Selvin, H. C. (1958). Durkheim’s Suicide and Problems of Empirical Research. The
American Journal of Sociology,63 (6), 607–619.
Sen, A., & Smith, T. E. (1995). Gravity Models of Spatial Interaction Behavios. Ad-
vances in Spatial and Network Economics. Berlin: Springer-Verlag.
Senior, M. L. (1979). From gravity modelling to entropy maximizing: a pedagogic
guide. Progress in Human Geography,3, 179–210.
Shannon, C. E., & Weaver, W. (1949). The Mathematical Theory of Communication.
Urbana: University of Illinois Press, 6th ed.
Simpson, E. H. (1951). The Interpretation of Interaction on Contingency Tabels. Jour-
nal of the Royal Statistical Society Series B (Methodological),13(2), 238–241.
Simpson, L., & Tranmer, M. (2005). Combining Sample and Census Data in Small Area
Estimates: Iterative Proportional Fitting with Standard Software. The Professional
Geographer ,57 (2), 222–234.
Smith, D. M., Clarke, G. P., & Harland, K. (2009). Improving the synthetic data
generation process in spatial microsimulation models. Environment and Planning A,
41 , 1251–1268.
Snickars, F., & Weibull, J. A. W. (1977). A minimum information principle : Theory
and practice. Regional Science and Urban Economics,7(1-2), 137–168.
247
SPSS (1995 - 2010). SPSS Statistics 17.0 Algorithms.
URL http://support.spss.com/ProductsExt/SPSS/ESD/17/Download/User%
20Manuals/English/SPSS%20Statistics%2017.0%20Algorithms.pdf
Stabler, B., & Gregor, B. (2003). ipf.r.
URL http://tolstoy.newcastle.edu.au/R/e4/help/08/03/5789.html
Stephan, F. F. (1942). An Iterative Method of Adjusting Sample Frequency Tables
When Expected Marginal Totals are Known. The Annals of Mathematical Statistics,
13 (2), 166–178.
Stephan, F. F., Deming, W. E., & Hansen, M. H. (1940). The Sampling Procedure
of the 1940 Population Census. Journal of the American Statistical Association,
35 (212), 615–630.
Stewart, J. Q. (1941). An Inverse Distance Variation for Certain Social Influences.
Science,93 (2404), 89–90.
Stewart, J. Q. (1948). Demographic Gravitation: Evidence and Applications. Sociom-
etry,11 (1/2), 31–58.
Stigler, S. (1999). Statistics on the Table: The History of Statistical Concepts and
Methods. Cambridge Massachusetts: Harvard University Press.
Stigler, S. (2002). The Missing Early History of Contingency Tables. Annales de la
Faculte des Sciences de Toulouse,XI (4), 563–573.
Stone, J. R. N., & Brown, A. (1962). A computable model of economic growth. London:
Chapman & Hall.
Strauss, D. J. (1977). Measuring Endogamy. Social Science Research,6(3), 225–245.
Strauss, D. J., & Romney, K. A. (1982). Log-Linear Multiplicative Models for the
Analysis of Endogamy. Ethnology,21 (1), 79–99.
Svalastoga, K. (1959). Prestige, Class and Mobility. London: William Heinemann.
Tan, P.-N., Kumar, V., & Srivastava, J. (2004). Selecting the Right objective measure
for association analysis. Information Systems,29 (4), 293–313.
Thorsen, I., & Gitlesen, J. P. (1998). Empirical Evaluation of Alternative Model Speci-
fications to Predict Commuting Flows. Journal of Regional Science,38 (2), 273–292.
Upton, G. J. (1978). The Analysis of Cross-Tabulated Data. Chichester: Wiley.
Upton, G. J., & Fingleton, B. (1979). Log-Linear Models in Geography. Transactions
of the British Institute of British Geographers, New Series,4(1), 103–115.
248
U.S., C. B. (1931). Statistical Abstract of the United States 1931 . Washington: United
States Governmnet Printing Office.
Voas, D., & Williamson, P. (2001). Evaluating Goodness-of-Fit Measures for Synthetic
Microdata. Geographical & Environmental Modelling,5(2), 177–200.
Vokey, J. R. (1997). Collapsing Multiway Contingency Tables: Simpson’s Paradox
and homogenization. Behavior Research Methods, Instruments & Computers,29 (2),
210–215.
Wakefield, J. (2004). Ecological inference for 2×2tables. Journal of the Royal Statistical
Society: Series A,167 (3), 385–445.
Webber, M. (1977). Pedagogy Again: Ehat is Entropy? Annals of the Association of
American Geographers,67 (2), 254–266.
Wheaton, B., Muthen, B., Alwin, D. F., & Summers, G. F. (1977). Assessing reliability
and stability in panel models. Sociological Methodology,8, 84–136.
URL http://www.jstor.org/stable/270754
Wickens, T. D. (1989). Multiway Contingency Tables Analysis for the Social Sciences.
Hillsdale NJ: Lawrence Erlbaum Associates.
Willekens, F. (1980). Entropy, multiproportional adjustment and analysis of contin-
gency tables. Sistemi Urbani,2, 171–201.
Willekens, F. (1982). Multidimensional population analysis with incomplete data. In
A. Rogers, & K. C. Land (Eds.) Multidimensional mathematical demography, (pp.
43–111). New York: Academic Press.
Willekens, F. (1983). Log-linear Modelling of Spatial Interaction. Papers of the regional
Science Association,52 (1), 187–205.
Willekens, F. (1994). Monitoring International Migration Flows in Europe: Towards a
Statistical Data Base Combining Data From Different Sources. European Journal of
Population,10 , 1–42.
Willekens, F. (1999). Modeling Aproaches to the Indirect Estimation of Migration
Flows: From Entropy to EM. Mathematical Population Studies,7(3), 239–278.
Williamson, P., Birkin, M., & Rees, P. H. (1998). The estimation of population mi-
crodata by using data from small area statistics and samples of anonymised records.
Environment and Planning A,30 (5), 785–816.
Wilson, A. G. (1967). A Statistical Theory of Spatial Distribution Models. Transporta-
tion Research A,1, 253–269.
249
Wilson, A. G. (1970). Entropy in Urban and Regional Modelling. London: Pion Limited.
Wilson, E., & Hilferty, M. (1931). The distribution of chi-square. Proceedings of the
National Academy of Sciences,17 , 684–688.
Wong, D. W. (1992). The Reliablity of Using the Iterative Proportional Fitting Pro-
cedure. Professional Geographer,44 (3), 340–348.
Wrigley, N. (1980). Log-Linear Models in Geography: Comments on the Recent Article
by Upton and Fingleton. Transactions of the British Institute of British Geographers,
New Series,5(1), 113–117.
Yates, F. (1934). Contingency tables involving small numebrs and the χ2test. Supple-
ment to the Journal of the Royal Statistical Society,1(2), 217.235.
Young, E. C. (1924). The movement of the farm population (Bulletin 426). Ithaca:
New York Agricultural Experiment Station.
Yule, G. U. (1912). On the Methods of Measuring Association Between Two Attributes.
Journal of the Royal Statistical Society,75 (6), 579–652.
Yule, U. (1903). Notes on the theory of association of attributes in statistics.
Biometrika,2(2), 121–134.
250
Appendix A
SAM Variable list
The following table lists the categories and the proportional distribution for the 57
variables selected from the Samples of Anonymised Records - Small Area Microdata
(ONS, 2006). All the values are calculated from the reduced data set used (only England
and Wales) so the percentages apply to N= 2,621,560 (and therefore do not coincide
with the values published on the ONS website).
Variable — Categories Percentage
Accommodation type
Detached or semi-detached 58.81 %
Terraced house 26.31 %
Purpose-built flats, Flat-converted or
shared house (including bedsits), Flat,
Maisonette in commerc.
13.08 %
na - communal establishment 1.8 %
Age of respondents
0-4 5.88 %
5-9 6.3 %
10-15 7.86 %
16-19 5.17 %
20-24 6.51 %
25-29 6.54 %
30-39 15.36 %
40-49 13.23 %
50-59 12.48 %
60-64 4.83 %
65-74 8.31 %
75-84 5.59 %
85+ 1.93 %
Use of bath/shower/toilet
251
...continued
Variable — Categories Percentage
Sole use 97.88 %
Shared use/none 0.32 %
na - communal establishment 1.8 %
Cars/vans owned or available
No car 18.93 %
1 car 41.19 %
2 or more cars 38.08 %
na - communal establishment 1.8 %
Communal establishment type
NHS 0.07 %
LA/HA/Vol./Private Co. etc 0.8 %
No code required 99.14 %
Central heating
Yes in some or all rooms 91.16 %
No 7.04 %
na - communal establishment 1.8 %
Status in communal establishment
Staff or relative of staff 0.14 %
Resident non-staff 1.66 %
na - not in communal establishment 98.2 %
Country of birth
England 82.86 %
Scotland 1.57 %
Wales 5.35 %
Northern Ireland /UK part not speci-
fied
1.34 %
All other countries of birth 7.91 %
Not usual resident 0.97 %
No of residents per room
Up to and including 0.75 77.5 %
Over 0,75 and up to 1 16.69 %
Over 1 4.01 %
Not in household 1.81 %
Distance of Move for Migrants (km)
0-4 km 5.63 %
252
...continued
Variable — Categories Percentage
5-19 km 2.22 %
20-99 km 1.32 %
100 + km 1.39 %
Outside UK 0.7 %
na - not a migrant 88.75 %
Distance to work
0-4 km 18.01 %
5-19 km 15.04 %
20 + km 5.8 %
At home 4.14 %
No fixed place 1.98 %
na - not in work, not usual resident 55.04 %
Economic activity (last week)
In employment (employee or self-
employed)
44.96 %
Unemployed 2.65 %
Student not economically active 3.37 %
Other economically inactive 20.57 %
Not applicable 28.44 %
Ethnic Group
White British 86.64 %
White Irish 1.22 %
Other White 2.57 %
Mixed: White & Black Caribbean,
Mixed: White & Black African, Black
other
0.78 %
Mixed: White & Asian, Other mixed 0.66 %
Indian (Asian/Asian British) 1.97 %
Pakistani (Asian/Asian British) 1.37 %
Bangladeshi(Asian/Asian British) 0.53 %
Other Asian (Asian/Asian British) 0.46 %
Caribbean (Black/Black British) 1.07 %
African (Black/Black British) 0.91 %
Chinese 0.43 %
Other 0.42 %
na - not resident 0.97 %
Ever worked
253
...continued
Variable — Categories Percentage
Yes 22.76 %
No 3.84 %
na - in work, out of range not residnet 73.41 %
Family type
Lone parent 11.61 %
Married/cohabiting couple - no children 23.25 %
Married/cohabiting couple - children 45.41 %
Ungrouped individual (not in a family) 16.97 %
not usual resident or in com. est. 2.77 %
Dependent children in family
No dependent children 34.1 %
Dependent children 46.17 %
na - not in family, not resident, in
com.est.
19.73 %
Economic position of Family reference person
In employment 60.58 %
Unemployed 1.81 %
Economically inactive 18.72 %
na - not resident in a family 18.89 %
NS-SEC social-economic classification of Family reference person
1. Higher managerial and professional
occupations
10.28 %
2. Lower managerial and professional
occupations
17.22 %
3. Intermediate occupations 5.33 %
4. Small employers and own account
workers
8.9 %
5. Lower supervisory and technical oc-
cupations
8.12 %
6. Semi-routine occupations 8.4 %
7. Routine occupations 8.42 %
8. Never worked or long-term unem-
ployed
2.13 %
Full-time students 0.49 %
Other (Not classified) 11.81 %
na - ungrouped individual or no occu-
pation recorded
18.89 %
Sex of Family reference person
254
...continued
Variable — Categories Percentage
Male 57.81 %
Female 23.3 %
na - ungrouped individual 18.89 %
Generation indicator
Ungrouped individual 17.93 %
Upper generation - member of couple or
lone parent
51.36 %
Lower generation - child in family 28.91 %
na - in communal establishment 1.8 %
General health over last 12 months
Good 67.89 %
Fairly good 22.02 %
Not good 9.12 %
na - Not usually resident 0.97 %
Household education indicator
Household has educational attainment 57.94 %
Level 2 or eq. not achieved by any 16-64
year old
40.25 %
na - Not in household 1.81 %
Household employment indicator
No non-student aged 16-74 sick or un-
employed
84.05 %
Non-student aged 16-74 sick or unem-
ployed
14.14 %
na - not in household 1.81 %
Household housing indicator
Not overcrowded or lacking amenities 83.52 %
Overcrowded / lacks bath, shower, wc
or heating
14.67 %
na - not in household 1.81 %
Household health and disability indicator
No one in household has a LLTI or poor
health
64.21 %
Household member has LLTI or poor
health
33.98 %
na - not in household 1.81 %
Household headship (ODPM)
255
...continued
Variable — Categories Percentage
Household representative 41.2 %
Concealed household representative 0.27 %
Other 55.76 %
na - not usual resident or in comm. est. 2.77 %
Number of carers in the household
None 78.12 %
One 12.98 %
Two or more 7.08 %
na - not in household with residents 1.81 %
Number of employed adults in the household
No earners 25.34 %
One earner 26.95 %
Two ore more earners 45.91 %
na - not in household / away from
empty hh
1.81 %
Number in household with LLTI
None 66.77 %
One 23.48 %
Two or more 7.94 %
na - not in household / away from
empty hh
1.81 %
Number in household with poor health
None 80.75 %
One 14.66 %
Two or more 2.79 %
na - not in household / away from
empty hh
1.8 %
Number of usual residents in household
None to one 12.45 %
Two to four 70.23 %
Five or more 15.52 %
na - in communal establishment 1.8 %
Hours worked weekly
1-15 3.75 %
16-30 7.37 %
31-37 8.19 %
38-48 18.38 %
256
...continued
Variable — Categories Percentage
49 + 7.27 %
na - no employment record or comm.
est.
55.04 %
Social grade of Household reference person
A professional and B middle manager 22.59 %
C1 all other non-manual workers 25.03 %
C2 All skilled manual workers 16.1 %
D All semi-skilled and unskilled manual
workers
18.4 %
E On benefit/unemployed 9.59 %
na - not in household or no employment
record
8.29 %
Year last worked
In employment 44.96 %
Never worked 3.84 %
2000 to 2001 5.46 %
1996 to 1999 5.74 %
Before 1996 11.55 %
na out of age range, not usual resident 28.44 %
Limiting long-term illness
Yes 18.08 %
No 80.95 %
na - not usual resident 0.97 %
Lowest floor level of household living accommodation
Basement 2.88 %
Ground floor 86.97 %
First floor 5.47 %
Above first floor 2.87 %
na - communal establishment 1.8 %
Marital status
Single (never married) 44.76 %
Married / re-married 40.23 %
Separated (but still legally married) /
Divorced / widowed
15.01 %
Migration indicator
Same address 86.97 %
No usual address a year ago 0.81 %
Move within LAD area 6.17 %
257
...continued
Variable — Categories Percentage
Move from outside LAD area 4.38 %
Move from outside the UK 0.7 %
na 0.97 %
Region of origin
Same address as now 86.97 %
No usual addres 0.81 %
Migrant from outside UK 0.7 %
A North East 0.48 %
B North West 1.28 %
D Yorkshire and the Humber 1.02 %
E East Midlands 0.83 %
F West Midlands 0.97 %
G East of England 1.02 %
J South East 1.67 %
K South West 1.02 %
I Inner London 0.73 %
O Outer London 0.87 %
S Scotland 0.09 %
W Wales 0.55 %
N Northern Ireland 0.02 %
na - student living away 0.97 %
Occupancy rating household
At least two rooms more than required 46.73 %
One room more than required 24.43 %
Number of rooms equals number re-
quired
18.54 %
Number of rooms is lower than those
required
8.49 %
not in hhd with resident 1.81 %
Professional qualifications
Does not have professional qualifica-
tions
55.35 %
Has professional qualification 16.21 %
na 28.44 %
Number of hours care provided per week
Provides no care 89.1 %
Provides 1-19 hours care 6.78 %
258
...continued
Variable — Categories Percentage
Provides 20-49 hours care 1.09 %
Provides 50 or more hours care 2.06 %
na - not usual resident 0.97 %
Level of Highest Qualifications (Aged 16 to 74)
No qualifications 20.81 %
Level 1 11.85 %
Level 2 13.87 %
Level 3 5.93 %
Level 4/5 14.13 %
Other qualifications or level unknown 4.97 %
n/a: out of age range or student living
away
28.44 %
Religion
Christian 71.08 %
Buddhist 0.27 %
Hindu 1.06 %
Jewish 0.5 %
Muslim 2.96 %
Sikh 0.62 %
Other 1.12 %
No religion 13.82 %
Religion not stated 7.6 %
na - not resident 0.97 %
Relationship to Household reference person
Household reference person 41.17 %
Husband or wife 19.48 %
Partner 3.72 %
Son or daughter/ Step-child 29.01 %
Other related 1.96 %
Unrelated 2.47 %
Not knonw 0.38 %
na - -9 Not in household with usual res-
idents
1.81 %
Number of rooms occupied in household space
One to two 2 %
Three to four 20.69 %
Five to six 50.12 %
259
...continued
Variable — Categories Percentage
Seven or more 25.39 %
na - in communal establishment 1.8 %
Accommodation self-contained
Yes 98.01 %
No 0.19 %
na 1.8 %
Sex
Male 48.67 %
Female 51.33 %
Household with students away during term time
None away 94.92 %
One or more away 3.28 %
na - not in household 1.8 %
Schoolchild or student in full-time education
Yes 21.64 %
No 78.36 %
Supervisor / Foreman
Yes 20.14 %
No 47.58 %
na - never worked, not resident, out of
age range
32.28 %
Tenure of accommodation
Owns outright/with mortgage/shared
ownership
70.29 %
Rents Housing Association Co-op etc. 17.37 %
Private rented or lives rent free 10.54 %
na - not usual resident or in communal
establishment
1.8 %
Term time address of students or schoolchildren
Living with parents 18.29 %
not living with parents 2.38 %
na - not resident student 79.33 %
Transport to work
Work mainly at or from home 4.14 %
Train, inc. underground, metro, light
rail, tram etc.
3.17 %
Bus, minibus, coach 3.32 %
260
...continued
Variable — Categories Percentage
Motor cycle, scooter or moped / taxi 0.73 %
Car 27.64 %
Bicycle 1.25 %
On foot / other 4.71 %
na - not in work 55.04 %
Size of workforce
1 to 9 18.55 %
10-24 10.34 %
25-499 25.75 %
500 or more 13.08 %
na - not resident, out of age range, never
worked
32.28 %
Workplace
No fixed place 1.98 %
Work (or study) mainly at home 4.14 %
Inside LAD area of residence 21.04 %
Outside LAD area but inside GB 17.65 %
Northern Irelan 0 %
Outside LGD, but within NI (NI only) 0 %
Outside NI, but within UK (NI only) 0 %
Outside UK 0.15 %
na - no employment record 55.04 %
NS-SEC 8 classes
1.1 Large employers and higher man-
agerial occupations
2.46 %
1.2 Higher professional occupations 3.61 %
2 Lower managerial and professional oc-
cupations
13.26 %
3 Intermediate occupations 6.73 %
4 Small employers and own account
workers
5 %
6 5 Lower supervisory and technical oc-
cupation
5.12 %
6 Semi-routine 8.35 %
7 Routine occupations 6.49 %
8 Never worked and long-term unem-
ployed
2.68 %
na - not defined 46.3 %
261
262
Appendix B
List of Local authorities and their geographic and
geodemographic associations
ID Local Authority County Region Supergroups Groups Subgroups
00AB Barking and Dagenham Greater London London 1 Cities and Services 1.2 Centres with Industry 1.2.3
00AC Barnet Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00AD Bexley Greater London London 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00AE Brent Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.11
00AF Bromley Greater London London 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00AG Camden Greater London London 3 London Centre 3.5 London Centre 3.5.8
00AH Croydon Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.7
00AJ Ealing Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00AK Enfield Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.7
00AL Greenwich Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.7
00AM Hackney Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.10
00AN Hammersmith and Fulham Greater London London 3 London Centre 3.5 London Centre 3.5.8
00AP Haringey Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.10
263
Continued
ID Local Authority County Region Supergroups Groups Subgroups
00AQ Harrow Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00AR Havering Greater London London 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00AS Hillingdon Greater London London 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00AT Hounslow Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00AU Islington Greater London London 3 London Centre 3.5 London Centre 3.5.8
00AW Kensington and Chelsea Greater London London 3 London Centre 3.5 London Centre 3.5.8
00AX Kingston upon Thames Greater London London 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00AY Lambeth Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.10
00AZ Lewisham Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.10
00BA Merton Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00BB Newham Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.11
00BC Redbridge Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.6
00BD Richmond upon Thames Greater London London 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00BE Southwark Greater London London 4 London Cosmopolitan 4.6 London Cosmopolitan 4.6.10
00BF Sutton Greater London London 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00BG Tower Hamlets Greater London London 3 London Centre 3.5 London Centre 3.5.9
00BH Waltham Forest Greater London London 2 London Suburbs 2.4 London Suburbs 2.4.7
00BJ Wandsworth Greater London London 3 London Centre 3.5 London Centre 3.5.8
00BK Westminster Greater London London 3 London Centre 3.5 London Centre 3.5.8
00BL Bolton Greater Manchester North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
00BM Bury Greater Manchester North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00BN Manchester Greater Manchester North West 1 Cities and Services 1.2 Centres with Industry 1.2.3
00BP Oldham Greater Manchester North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
00BQ Rochdale Greater Manchester North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
00BR Salford Greater Manchester North West 1 Cities and Services 1.1 Regional Centres 1.1.1
00BS Stockport Greater Manchester North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00BT Tameside Greater Manchester North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00BU Trafford Greater Manchester North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
264
Continued
ID Local Authority County Region Supergroups Groups Subgroups
00BW Wigan Greater Manchester North West 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00BX Knowsley Merseyside North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00BY Liverpool Merseyside North West 1 Cities and Services 1.1 Regional Centres 1.1.1
00BZ St.Helens Merseyside North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CA Sefton Merseyside North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CB Wirral Merseyside North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CC Barnsley South Yorkshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00CE Doncaster South Yorkshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00CF Rotherham South Yorkshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00CG Sheffield South Yorkshire Yorkshire and The Humber 1 Cities and Services 1.1 Regional Centres 1.1.1
00CH Gateshead Tyne and Wear North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CJ Newcastle upon Tyne Tyne and Wear North East 1 Cities and Services 1.1 Regional Centres 1.1.1
00CK North Tyneside Tyne and Wear North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.21
00CL South Tyneside Tyne and Wear North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CM Sunderland Tyne and Wear North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00CN Birmingham West Midlands West Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.3
00CQ Coventry West Midlands West Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.2
00CR Dudley West Midlands West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00CS Sandwell West Midlands West Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.3
00CT Solihull West Midlands West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00CU Walsall West Midlands West Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.2
00CW Wolverhampton West Midlands West Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.3
00CX Bradford West Yorkshire Yorkshire and The Humber 1 Cities and Services 1.2 Centres with Industry 1.2.2
00CY Calderdale West Yorkshire Yorkshire and The Humber 1 Cities and Services 1.2 Centres with Industry 1.2.2
00CZ Kirklees West Yorkshire Yorkshire and The Humber 1 Cities and Services 1.2 Centres with Industry 1.2.2
00DA Leeds West Yorkshire Yorkshire and The Humber 1 Cities and Services 1.1 Regional Centres 1.1.1
00DB Wakefield West Yorkshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00EB Hartlepool Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
265
Continued
ID Local Authority County Region Supergroups Groups Subgroups
00EC Middlesbrough North Yorkshire North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00EE Redcar and Cleveland North Yorkshire North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00EF Stockton-on-Tees Durham North East 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00EH Darlington Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00ET Halton Cheshire North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00EU Warrington Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
00EX Blackburn with Darwen Lancashire North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
00EY Blackpool Lancashire North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00FA Kingston upon Hull, City of East Yorkshire Yorkshire and The Humber 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00FB East Riding of Yorkshire East Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
00FC North East Lincolnshire Lincolnshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00FD North Lincolnshire Lincolnshire Yorkshire and The Humber 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00FF York North Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00FK Derby Derbyshire East Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.2
00FN Leicester Leicestershire East Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.3
00FP Rutland Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
00FY Nottingham Nottinghamshire East Midlands 1 Cities and Services 1.2 Centres with Industry 1.2.3
00GA Herefordshire, County of Worcestershire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
00GF Telford and Wrekin Shropshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00GL Stoke-on-Trent Staffordshire West Midlands 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00HA Bathand North East Somerset Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00HB Bristol,City of Bristol South West 1 Cities and Services 1.1 Regional Centres 1.1.1
00HC North Somerset Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
00HD South Gloucestershire Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
00HG Plymouth Devon South West 1 Cities and Services 1.1 Regional Centres 1.1.1
00HH Torbay Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00HN Bournemouth Dorset South West 1 Cities and Services 1.1 Regional Centres 1.1.1
00HP Poole Dorset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
266
Continued
ID Local Authority County Region Supergroups Groups Subgroups
00HX Swindon Wiltshire South West 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00JA Peterborough Cambridgeshire East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00KA Luton Bedfordshire East of England 2 London Suburbs 2.4 London Suburbs 2.4.6
00KF Southend-on-Sea Essex East of England 1 Cities and Services 1.1 Regional Centres 1.1.1
00KG Thurrock Essex East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00LC Medway Kent South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00MA Bracknell Forest Berkshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
00MB West Berkshire Berkshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
00MC Reading Berkshire South East 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
00MD Slough Berkshire South East 2 London Suburbs 2.4 London Suburbs 2.4.6
00ME Windsor and Maidenhead Berkshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
00MF Wokingham Berkshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
00MG Milton Keynes Buckinghamshire South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
00ML Brighton and Hove East Sussex South East 1 Cities and Services 1.1 Regional Centres 1.1.1
00MR Portsmouth Hampshire South East 1 Cities and Services 1.1 Regional Centres 1.1.1
00MS Southampton Hampshire South East 1 Cities and Services 1.1 Regional Centres 1.1.1
00MW Isle of Wight Isle of Wight South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NA Isle of Anglesey Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NC Gwynedd Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NE Conwy Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NG Denbighshire Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NJ Flintshire Wales Wales 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00NL Wrexham Wales Wales 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00NN Powys Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
00NQ Ceredigion Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
00NS Pembrokeshire Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NU Carmarthenshire Wales Wales 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
00NX Swansea Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
267
Continued
ID Local Authority County Region Supergroups Groups Subgroups
00NZ Neath Port Talbot Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PB Bridgend Wales Wales 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
00PD The Vale of Glamorgan Wales Wales 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
00PF Rhondda Cynon Taff Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PH Merthyr Tydfil Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PK Caerphilly Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PL Blaenau Gwent Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PM Torfaen Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PP Monmouthshire Wales Wales 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
00PR Newport Wales Wales 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
00PT Cardiff Wales Wales 1 Cities and Services 1.1 Regional Centres 1.1.1
09UC Mid Bedfordshire Bedfordshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
09UD Bedford Bedfordshire East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
09UE South Bedfordshire Bedfordshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
11UB Aylesbury Vale Buckinghamshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
11UC Chiltern Buckinghamshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
11UE South Bucks Buckinghamshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
11UF Wycombe Buckinghamshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
12UB Cambridge Cambridgeshire East of England 1 Cities and Services 1.3 Thriving London Periphery 1.3.4
12UC East Cambridgeshire Cambridgeshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
12UD Fenland Cambridgeshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
12UE Huntingdonshire Cambridgeshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
12UG South Cambridgeshire Cambridgeshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
13UB Chester Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
13UC Congleton Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
13UD Crewe and Nantwich Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
13UE Ellesmere Port&Neston Cheshire North West 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
13UG Macclesfield Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
268
Continued
ID Local Authority County Region Supergroups Groups Subgroups
13UH Vale Royal Cheshire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
15UB Caradon Cornwall and Isles of Scilly South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
15UC Carrick Cornwall and Isles of Scilly South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
15UD Kerrier Cornwall and Isles of Scilly South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
15UE North Cornwall Cornwall and Isles of Scilly South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
15UG Restormel Cornwall and Isles of Scilly South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
16UB Allerdale Cumbria North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
16UC Barrow-in-Furness Cumbria North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
16UD Carlisle Cumbria North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
16UE Copeland Cumbria North West 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
16UF Eden Cumbria North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
16UG South Lakeland Cumbria North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
17UB Amber Valley Derbyshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
17UC Bolsover Derbyshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
17UD Chesterfield Derbyshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
17UF Derbyshire Dales Derbyshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
17UG Erewash Derbyshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
17UH High Peak Derbyshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
17UJ North East Derbyshire Derbyshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
17UK South Derbyshire Derbyshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
18UB East Devon Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
18UC Exeter Devon South West 1 Cities and Services 1.1 Regional Centres 1.1.1
18UD Mid Devon Devon South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
18UE North Devon Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
18UG South Hams Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
18UH Teignbridge Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
18UK Torridge Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
18UL West Devon Devon South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
269
Continued
ID Local Authority County Region Supergroups Groups Subgroups
19UC Christchurch Dorset South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
19UD East Dorset Dorset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
19UE North Dorset Dorset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
19UG Purbeck Dorset South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
19UH West Dorset Dorset South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
19UJ Weymouth and Portland Dorset South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
20UB Chester-le-Street Durham North East 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
20UD Derwentside Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
20UE Durham Durham North East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
20UF Easington Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
20UG Sedgefield Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
20UH Teesdale Durham North East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
20UJ Wear Valley Durham North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
21UC Eastbourne East Sussex South East 1 Cities and Services 1.1 Regional Centres 1.1.1
21UD Hastings East Sussex South East 1 Cities and Services 1.1 Regional Centres 1.1.1
21UF Lewes East Sussex South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
21UG Rother East Sussex South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
21UH Wealden East Sussex South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
22UB Basildon Essex East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
22UC Braintree Essex East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
22UD Brentwood Essex East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
22UE Castle Point Essex East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
22UF Chelmsford Essex East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
22UG Colchester Essex East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
22UH Epping Forest Essex East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
22UJ Harlow Essex East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
22UK Maldon Essex East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
22UL Rochford Essex East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
270
Continued
ID Local Authority County Region Supergroups Groups Subgroups
22UN Tendring Essex East of England 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
22UQ Uttlesford Essex East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
23UB Cheltenham Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
23UC Cotswold Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
23UD Forest of Dean Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
23UE Gloucester Gloucestershire South West 5 Prospering UK 5.8 New and Growing Towns 5.8.15
23UF Stroud Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
23UG Tewkesbury Gloucestershire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
24UB Basingstoke and Deane Hampshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
24UC East Hampshire Hampshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
24UD Eastleigh Hampshire South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
24UE Fareham Hampshire South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
24UF Gosport Hampshire South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
24UG Hart Hampshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
24UH Havant Hampshire South East 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
24UJ New Forest Hampshire South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
24UL Rushmoor Hampshire South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
24UN Test Valley Hampshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
24UP Winchester Hampshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UB Broxbourne Hertfordshire East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
26UC Dacorum Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UD East Hertfordshire Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UE Hertsmere Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UF North Hertfordshire Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UG St Albans Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UH Stevenage Hertfordshire East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
26UJ Three Rivers Hertfordshire East of England 5 Prospering UK 5.9 Prospering Southern England 5.9.16
26UK Watford Hertfordshire East of England 1 Cities and Services 1.3 Thriving London Periphery 1.3.5
271
Continued
ID Local Authority County Region Supergroups Groups Subgroups
26UL Welwyn Hatfield Hertfordshire East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
29UB Ashford Kent South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
29UC Canterbury Kent South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
29UD Dartford Kent South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
29UE Dover Kent South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
29UG Gravesham Kent South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
29UH Maidstone Kent South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
29UK Sevenoaks Kent South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
29UL Shepway Kent South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
29UM Swale Kent South East 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
29UN Thanet Kent South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
29UP Tonbridge and Malling Kent South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
29UQ Tunbridge Wells Kent South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
30UD Burnley Lancashire North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
30UE Chorley Lancashire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
30UF Fylde Lancashire North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
30UG Hyndburn Lancashire North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
30UH Lancaster Lancashire North West 1 Cities and Services 1.1 Regional Centres 1.1.1
30UJ Pendle Lancashire North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
30UK Preston Lancashire North West 1 Cities and Services 1.2 Centres with Industry 1.2.2
30UL Ribble Valley Lancashire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
30UM Rossendale Lancashire North West 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
30UN South Ribble Lancashire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
30UP West Lancashire Lancashire North West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
30UQ Wyre Lancashire North West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
31UB Blaby Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
31UC Charnwood Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
31UD Harborough Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
272
Continued
ID Local Authority County Region Supergroups Groups Subgroups
31UE Hinckley and Bosworth Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
31UG Melton Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
31UH North West Leicestershire Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
31UJ Oadby and Wigston Leicestershire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
32UB Boston Lincolnshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
32UC East Lindsey Lincolnshire East Midlands 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
32UD Lincoln Lincolnshire East Midlands 1 Cities and Services 1.1 Regional Centres 1.1.1
32UE North Kesteven Lincolnshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
32UF South Holland Lincolnshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
32UG South Kesteven Lincolnshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
32UH West Lindsey Lincolnshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
33UB Breckland Norfolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
33UC Broadland Norfolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
33UD Great Yarmouth Norfolk East of England 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
33UE King’s Lynn and West Norfolk Norfolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
33UF North Norfolk Norfolk East of England 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
33UG Norwich Norfolk East of England 1 Cities and Services 1.1 Regional Centres 1.1.1
33UH South Norfolk Norfolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
34UB Corby Northamptonshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
34UC Daventry Northamptonshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
34UD East Northamptonshire Northamptonshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
34UE Kettering Northamptonshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
34UF Northampton Northamptonshire East Midlands 5 Prospering UK 5.8 New and Growing Towns 5.8.15
34UG South Northamptonshire Northamptonshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
34UH Wellingborough Northamptonshire East Midlands 5 Prospering UK 5.8 New and Growing Towns 5.8.15
35UB Alnwick Northumberland North East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
35UC Berwick-upon-Tweed Northumberland North East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
35UD Blyth Valley Northumberland North East 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
273
Continued
ID Local Authority County Region Supergroups Groups Subgroups
35UE Castle Morpeth Northumberland North East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
35UF Tynedale Northumberland North East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
35UG Wansbeck Northumberland North East 7 Mining and Manufacturing 7.11 Industrial Hinterlands 7.11.20
36UB Craven North Yorkshire Yorkshire and The Humber 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
36UC Hambleton North Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
36UD Harrogate North Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
36UE Richmondshire North Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
36UF Ryedale North Yorkshire Yorkshire and The Humber 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
36UG Scarborough North Yorkshire Yorkshire and The Humber 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
36UH Selby North Yorkshire Yorkshire and The Humber 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
37UB Ashfield Nottinghamshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
37UC Bassetlaw Nottinghamshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
37UD Broxtowe Nottinghamshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
37UE Gedling Nottinghamshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
37UF Mansfield Nottinghamshire East Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
37UG Newark and Sherwood Nottinghamshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
37UJ Rushcliffe Nottinghamshire East Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
38UB Cherwell Oxfordshire South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
38UC Oxford Oxfordshire South East 1 Cities and Services 1.3 Thriving London Periphery 1.3.4
38UD South Oxfordshire Oxfordshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
38UE Vale of White Horse Oxfordshire South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
38UF West Oxfordshire Oxfordshire South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
39UB Bridgnorth Shropshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
39UC North Shropshire Shropshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
39UD Oswestry Shropshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
39UE Shrewsbury and Atcham Shropshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
39UF South Shropshire Shropshire West Midlands 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
40UB Mendip Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
274
Continued
ID Local Authority County Region Supergroups Groups Subgroups
40UC Sedgemoor Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
40UD South Somerset Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
40UE Taunt on Deane Somerset South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
40UF West Somerset Somerset South West 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
41UB Cannock Chase Staffordshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
41UC East Staffordshire Staffordshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
41UD Lichfield Staffordshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
41UE Newcastle-under-Lyme Staffordshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
41UF South Staffordshire Staffordshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
41UG Stafford Staffordshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
41UH Staffordshire Moorlands Staffordshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
41UK Tamworth Staffordshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
42UB Babergh Suffolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
42UC Forest Heath Suffolk East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
42UD Ipswich Suffolk East of England 5 Prospering UK 5.8 New and Growing Towns 5.8.15
42UE Mid Suffolk Suffolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
42UF St Edmundsbury Suffolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
42UG Suffolk Coastal Suffolk East of England 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
42UH Waveney Suffolk East of England 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.18
43UB Elmbridge Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UC Epsom and Ewell Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UD Guildford Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UE Mole Valley Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UF Reigate and Banstead Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UG Runnymede Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UH Spelthorne Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UJ Surrey Heath Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UK Tandridge Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
275
Continued
ID Local Authority County Region Supergroups Groups Subgroups
43UL Waverley Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
43UM Woking Surrey South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
44UB North Warwickshire Warwickshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
44UC Nuneaton and Bedworth Warwickshire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
44UD Rugby Warwickshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
44UE Stratford-on Avon Warwickshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
44UF Warwick Warwickshire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
45UB Adur West Sussex South East 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.12
45UC Arun West Sussex South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
45UD Chichester West Sussex South East 6 Coastal and Countryside 6.10. Coastal and Countryside 6.10.17
45UE Crawley West Sussex South East 5 Prospering UK 5.8 New and Growing Towns 5.8.15
45UF Horsham West Sussex South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
45UG Mid Sussex West Sussex South East 5 Prospering UK 5.9 Prospering Southern England 5.9.16
45UH Worthing West Sussex South East 1 Cities and Services 1.1 Regional Centres 1.1.1
46UB Kennet Wiltshire South West 5 Prospering UK 5.9 Prospering Southern England 5.9.16
46UC North Wiltshire Wiltshire South West 5 Prospering UK 5.9 Prospering Southern England 5.9.16
46UD Salisbury Wiltshire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
46UF West Wiltshire Wiltshire South West 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
47UB Bromsgrove Worcestershire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
47UC Malvern Hills Worcestershire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
47UD Redditch Worcestershire West Midlands 7 Mining and Manufacturing 7.12 Manufacturing Towns 7.12.22
47UE Worcester Worcestershire West Midlands 5 Prospering UK 5.8 New and Growing Towns 5.8.15
47UF Wychavon Worcestershire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.13
47UG Wyre Forest Worcestershire West Midlands 5 Prospering UK 5.7 Prospering Smaller Towns 5.7.14
276
Appendix C
Robinson’s data on Nativity and
Illiteracy
The Nativity and Illiteracy in Nine Divisions of the US (Source: US Census (1931)) used in the
Robinson analysis, although for unknown reasons this data is not completely identical to the
parts that are reproduced in Robinson (1950).
Foreign Native Total φOdds
born born coefficient ratio
New
England
Illiterate 210,046 34,316 244,362
Literate 1,600,695 4,681,192 6,281,888 0.26 17.90
Total 1,810,741 4,715,508 6,526,250
Middle
Atlantic
Illiterate 638,479 114,966 753,445
Literate 4,594,955 15,569,526 20,164,481 0.27 18.82
Total 5,233,434 15,684,492 20,917,926
East
North
Central
Illiterate 281,645 146,738 428,383
Literate 2,918,866 17,111,998 20,030,864 0.20 11.25
Total 3,200,511 17,258,736 20,459,247
West
North
Central
Illiterate 51,982 90,008 141,990
Literate 1,008,875 9,960,445 10,969,320 0.10 5.70
Total 1,060,857 10,050,453 11,111,310
South
Atlantic
Illiterate 31,328 976,638 1,007,966
Literate 269,903 10,867,784 11,137,687 0.01 1.29
Total 301,231 11,844,422 12,145,653
East
South
Central
Illiterate 4,238 662,212 666,450
Literate 53,032 5,645,324 5,698,356 −0.01 0.68
Total 57,270 6,307,536 6,364,806
West
South
Central
Illiterate 15,958 484,747 500,705
Literate 153,808 8,183,384 8,337,192 0.02 1.75
277
Foreign Native Total φOdds
born born coefficient ratio
Total 169,766 8,668,131 8,837,897
Mountain
Illiterate 15,962 30,116 46,078
Literate 269,074 2,802,481 3,071,555 0.11 5.52
Total 285,036 2,832,597 3,117,633
Pacific
Illiterate 56,446 17,285 73,731
Literate 1,095,513 5,030,532 6,126,045 0.16 15.00
Total 1,151,959 5,047,817 6,199,776
Continental
US
Illiterate 1,306,084 2,557,026 3,863,110
Literate 11,964,722 79,852,667 91,817,389 0.12 3.41
Total 13,270,806 82,409,693 95,680,499
278
Appendix D
R IPF code
1Fu nI pf < - f un ct io n (c on st ra ints , c on .l ist , seed ,
2ma x . it e r =1 00 , c lo s ur e = 0. 00 1 ,
3sa mp le d =N ULL , c on . lis t .s am pl ed = NUL L ,
4updated .constraints = FALSE) {
5# IP F f u n c t ion o n c o n s t rain ts
6# wi th op t iona l a d d i t i o n of sam p l e d c o nstr a i n t s
7#
8# Arguments :
9#
10 # constraints : list of margins . Whatever is missing will become
11 # uniform - that is if the dimensions are known either from
12 # other constraints or seed .
13 # c on . list : l i s t of ma r g i n i n d e x e s
14 # seed: array of full dimensions - if missing a uniform one
15 # is calculated with minimum dimensions consistent with
16 # constraints
17 # max.iter : maximum number of iterations if closure is not reached.
18 # Default is 100.
19 # closure: acceptable error between given constraints and result .
20 #
21 # I f one o r m o r e o f the co n s t r a in t s ar e s a m p l ed a r e c ursi v e c a l l t o
22 # F u n I p f m a d e from w i t h i n t he f unct i o n b ased o n :
23 #
24 # sampled : vector indicating which of the constraints are sa mpled .
25 # Th ere ca n b e m o r e than o n e a s long a s t h e y do no t
26 # overlap . Default is NULL .
27 # c on . list .s a m p l e d : lis t o f l e n gth ( sampl e d ) lis t s , ea c h w i t h l engt h
28 # (constraints ) elements , indexing which dimensions of
29 # which constraints should be used as margins to IPF the
30 # sampled constraints . Default is NULL. See example for
31 # de ta il ed e xa mp le .
32 # u p d a t ed .c o n s t rain t s : if s a m p l e d c o n s t r aints wer e u p d a t e d t h e y c an be
33 # returned . Default is FALSE .
34 #
35 # Returns :
36 #
37 # A n a r r ay of th e s a m e d i m e n sion s o f s e e d
38 # c o n s i sten t wi t h all t he con s t r a i n t s .
39 #
40 # O r if s am p le d c o ns t ra i nt s a re u s ed a nd u p da t ed . co n st r ai n t s = T R UE a l is t
41 # w h e r e the a r r a y is t he fi r s t e l e ment and t he upd a t e d c o n s t r a i n t s
42 # a s a l i st : l i st ( I PF = r es u lt , up d at e d . c on s tr a in t s = co n st r ai n t s )
43 #
44 # Error handling
45 # class checks:
46 if ( ! is . l i st ( c o ns t ra i nt s ) | ! i s . li st (c o n . li s t ) ) {
47 st o p ( " C o n s t r aints mus t b e giv e n a s l ists ! " )
48 }
49 # c o ns t ra i nt c o ns i st e nc y ( s im p le t ot al s )
50 if ( ! is . n u ll ( s a mp l ed )) {
51 if ( di f f ( ra ng e ( l ap pl y ( c on st r ai nt s [ - sa mp l ed ] , su m ) )) > c lo su r e ) {
279
52 st o p ( " I n c o n si s t e n t co n s tr a i nt t o t al s !" )
53 }
54 } els e {
55 if ( d if f ( r an g e ( l ap pl y ( co ns t ra i nt s , su m ) )) > c l os u re ) {
56 st o p (" In c o ns i s te n t c on s tr a in t t ot al s ! " )
57 }
58 }
59 # co n st r ai n t ma p pi n g ( l is t ) no t co m pl e te :
60 if ( l e ng t h ( c on s tr a in t s )! = l e ng t h ( co n . l is t ) ) {
61 st o p ( " I n c o n si s t e n t co n s tr a i nt l i s t ! " )
62 }
63 # i f s e e d is gi v e n
64 if ( ! mi ss i ng ( se ed ) ) {
65 # i t m u s t be lar g e e n o u g h t o a c c o m moda te th e c o n s t r a i nt s
66 if ( l en gt h ( di m ( se e d )) < l en gt h ( t ab le ( u n li st ( co n . li st ) ) )) {
67 st o p (" T o o fe w se e d di m en s io n s !" )
68 }
69 }
70 # i f " s a m p l ed " is gi v e n t h a n " c o l .l i s t . sam p l e d " mu s t be supp l i e d as w e l l
71 # a n d m u s t be of s a m e l e ngth
72 if ( l en g th ( sa m pl ed ) != le ng t h ( co n . l i st . s a mp l ed ) ) {
73 st o p (" C a n n ot run th i s if sam p l e s a n d con .li s t .s a m pled are n o t in sy n c !" )
74 }
75 # mo r e m a r g i ns c an be sam p l e d a s l o n g as the y d on ’t ov e r l a p :
76 if ( l en g th ( un l is t ( c on . l i st [ sa mp l ed ]) ) >
77 le n gt h ( un i qu e ( u nl is t ( co n . l is t [ sa mp l ed ]) ) )) {
78 st op ( "T he s amp le d ma rg in s ov er la p ! S orr y , don ’ t kn ow ho w to de al
79 with that!")
80 }
81 # End o f non - c om p re h e ns i ve e rr o r ha nd l in g
82 #
83 # Ca l c u late se e d if non e i s g i v en :
84 if ( mi ss i n g ( s e e d ) ) {
85 # b ut for t h i s t he m argi n s do h a v e t o be c o m p l e te !
86 if ( l en g th ( ta b le ( u n li s t ( c on . l i st ) ) ) != m a x ( u nl i st ( c on . li s t )) ) {
87 st o p (" C o n s t rain t s a r e not suf f i c i e n t to ca l c u l ate se e d !" )
88 }
89 di m en s io n s < - l en g th ( ta b le ( u n li s t ( c on . l i st ) ) ) # nu m b e r o f dim o f s e e d
90 po z n < - ve c to r ( " n um e ri c ", l en g th =d i me n si o n s ) # first mention of each dim
91 fo r ( i i n 1: di m e nsio n s ) {
92 po z n[ i ] <- m at ch (i , u nl is t ( c on . l is t ))
93 }
94 di m . l e n gt h s <- u n li s t ( la pp l y ( kn ow n . mar g in s , fu n c ti o n ( x ) if ( is .v e c to r (x ) ) {
95 le n g th (x ) } e ls e d im ( x ) ) ) [ p o zn ] # find dim lengths
96 se e d <- a rr a y ( da ta = c (1 ) , di m = dim . l e ng th s ) # create uniform array
97 wa r n in g (" U n i fo r m " , di m e ns i o ns , " - d i me n s io n a l s ee d w as c a l cu l a te d " )
98 }
99 #
100 # If sa m p l e d co n s t r a ints ar e g ive n , th e s e are n o w a d d ress e d f i r s t
101 #
102 if ( ! is . n u ll ( s a mp l ed )) {
103 i =1
104 fo r ( i in 1 : le n gt h ( s am pl ed ) ){
105 av a i la b l e < - sa p p ly ( co n . l is t , f u n ct i o n ( x ) l en g t h ( x ) )
106 ne e d < - sa pp l y ( con .l i s t . s a mp l ed [[ i ] ] , f u nc t io n (x ) l e ng t h ( x ) )
107 # f i r s t k e ep th e o nes th a t ar e w h o l e i.e . ex a c t
108 s .c on . l i st < - con . l is t [ w hi ch ( a v ai la bl e == n ee d ) ]
109 s . co n st r ai n ts < - co n st r ai n t s [ w h ic h ( a v ai l ab l e = = ne e d )]
110 # n o w f i n d the c o n s t r aints tha t n e e d t o be c o l l a p s ed !
111 ne e d . su b set < - w hich ( need < av a i l a b l e & n e e d > 0)
112 fo r (j i n 1 : l en gt h ( n ee d . su bs e t ) ){
113 xx <- co n . l is t . s am p le d [ [ i ] ][ [ n ee d . s u bs e t [ j ]] ]
114 s . c on s tr a i n ts [ [ l e ng t h ( s . co ns t r ai n t s ) +1 ] ] <-
115 ma r gi n . t a bl e ( c o ns t ra i nt s [[ ne ed . s u bs e t [ j ]] ] , xx )
116 s . co n . li s t [[ le ng t h ( s . co n . l is t ) + 1] ] < - c o n . li s t [ [ ne e d . su b se t [ j ] ] ][ x x ]
117 }
118 # w h i c h i s all f i ne , e x c ept t h a t s . con .li s t i s i n dexe d w ith th e c o n s t r aint
119 # n um be rs f ro m con . li st , bu t h er e th er e c an ’ t b e a ny m iss in g m argi ns :
120 y <- 1: l e ng t h ( c on . l i st [ [ s a mp l ed [i ] ] ])
121 s . co n . l i s t <- l ap p l y ( s . c on .l is t , fu nc t i on ( x ) y[ m a tc h ( x ,
280
122 un i q ue ( un l i st (s . c on . li s t ) ) ) ])
123 s . s ee d < - co n st r a in t s [[ sa m pl e d [ i ]]]
124 # N ow c a lcul a t e the co n s t r a int from t h e s a m p le a nd rep l a c e i t i n c o sntr a i n t s !
125 co n st ra i nt s [[ s am p le d [ i ]] ] <-
126 Fu n Ip f ( c o ns t ra i nt s = s . co n st r ai n ts , co n . l is t = s . co n . li s t , se e d = s. s e ed )
127 }
128 }
129 #
130 # Now a ll th e c onstr a i n t s a re the r e .
131 # T O D O : t h e r e c o uld be som e s tre a m l i n i n g h e r e to r e m o v e sub - s e ttab l e
132 # c o n s t raints b u t a s i t is it w o r k s f ine , ju s t r e d u ndan t ly cy c l e s
133 # through subsets.
134 #
135 # Now f or th e a ctual IP F #
136 # Set initial values
137 result <- seed
138 iteration <- 0
139 er r o r <- r e p ( 1 , l e ng t h ( c o n . l i st ) )
140 co n st ra i nt . nu m be r < - s eq ( l en g th ( c on . l is t ) )
141 #
142 # IP F
143 wh i le ( ( an y ( er r or > cl o su re )) & ( it er a ti on < ma x . it er ) ) {
144 fo r ( s te p in c o ns t ra i nt . n um b er ) {
145 ma r gi nT o ta l <- a pp l y (r es ul t , con . l i st [ [ st e p ]] , su m )
146 ma r gi nC o ef f <- c on s tr ai n ts [ [ st ep ] ] / ma r gi nT o ta l
147 ma r g in C o ef f [ i s . i nf i n it e ( ma r g in C o e ff ) ] <- 1
148 ma r gi n Co e f f [ i s . na n ( m ar g in C oe f f )] < - 1
149 re s ul t < - s w ee p ( re su lt , co n . li st [ [ s te p ]] , ma r gi nC o ef f , " * " )
150 er r or [ s te p ] <- ma x ( ab s (1 - ma r gi nC oe f f ) )
151 }
152 iteration <- iteration + 1
153 }
154 # If IP F s t o p p ed d ue to num b e r o f i t e r atio n s t h e n o u t p ut s u c h i n f o
155 if ( i te r at io n = = ma x . it er ) {
156 warning ("Reached maximum iteratons , Remaining errors are printed
157 ab o v e . " )
158 print(error)
159 }
160 # Results
161 if ( u p da t e d . c o n st r a in t s == TR U E ) { # return array and new constraints
162 re t ur n ( l i st ( I PF =r e su lt , up d at e d . c o ns t ra in t s = c on s tr a in t s ) )
163 } el s e { # o t herw i s e j u st a r ray
164 return( result )
165 }#
166 }
167 #
168 #
169 #
170 ##
171 ##################################################################
172 # ##
173 ##
174 #
175 # Two examples :
176 # 1. Simp l e ex a m p l e w i t h n o samp l i n g :
177 #
178 # U s e r m u st ma k e s u r e th e c o s n t raints a r e c o n s i sten t !There is only
179 # v e r y b a s i c c h e c k s t h a t t h e t o t a ls a r e e qua l , so it is u p
180 # t o t he u s e r t o m a k e s u r e . F o r t h i s e x a mple we s tart wi t h a
181 # f u l l t a ble to m a k e sure t he c o nstr a i n t s a re con s i s t e n t :
182 fu l l. t a bl e < - ar ra y ( c (1 :1 7 ) , dim = c ( 2 ,5 , 4 ,3) )
183 # t h en s el ec t t he " k no wn m a rg in s "
184 kn o wn . m ar g in s . ma p < - li st ( c (1 , 2) , c (3 , 4) , c (1 , 2 ,3 ) , c ( 1 ,2 , 4) )
185 known. margins <-vector (" list")
186 fo r (i in 1 : le n gt h ( k no wn . m ar g in s . m ap ) ){
187 kn o wn . m ar g in s [ [i ] ] <- a p pl y ( fu ll . t ab le , kn ow n . m ar gi ns . ma p [[ i ]] , s um )
188 }
189 # T he ma p and t h e marg i n s a r e all t h a t is n e e ded to run F u n I p f :
190 re s ul t <- F un I pf ( kn o wn . m a rg in s , kn ow n . m a rg i ns . ma p )
191 # B e c a u se w e d idn ’t in p u t t h e s e e d a u n ifor m one w as
281
192 # used automatically.
193 # C h e c k t h at th e m argi n s of r e s ult ar e c o rrec t :
194 fo r (i i n 1 : l en gt h ( k no wn . ma rg i ns . m ap ) ) {
195 pr i nt ( a l l . eq u al (k n ow n . m a rg i ns [[ i ] ], ap p ly ( r es u lt , kn o wn . ma r gi ns . ma p [ [ i ]] ,
196 su m ) ) )
197 }
198 #
199 # 2. E x a m ple w i t h o n e o f the m a r g i n s s a m p l e d . Usin g t h e s a m e d a t a
200 # a s abo v e , but th e l a s t c o n stra i n t w i l l be sa m p l e d a n d t h e n
201 # I PF - ed b ef o r e t he w h o le t a b le i s IPF - ed
202 #
203 # F ir st we n eed a sa mp ling f unct io n :
204 FunSample <- function (Full , n) {
205 Fr a me < - as . da ta . f ra m e ( la pp l y ( ex pa nd . gr id ( l a pp ly ( d im ( F u ll ) , se q )) , fa c to r )
)
206 ta b le ( F r am e [ s am p le (1 : n r ow ( F r am e ) , n , p ro b = Fu ll , r e pl a ce = TR U E ) , ] )
207 }
208 # U s i n g t he s a me data a s a b o v e we rep l a c e t h e l a s t c o n stra i n t
209 # c ( 1 ,2 , 4) w it h a s a mp le t ak en f r om it .
210 kn ow n . ma r gi ns [ [ 4] ] <- Fu n Sa mp le ( kn ow n . ma rg in s [ [4 ]] , 1 00 )
211 sa m p le d =c ( 4 )
212 # N o w we n e e d to a lso ch o s e w hich wh o l e c o n stra i n t s g et use d t o
213 # a d j u s t t he s a m p le . I n ou r c a s e t h at is t h e f i r st one c (1 ,2) a n d
214 # c ( 4) f ro m th e s ec on d on e c (3 , 4) :
215 sa m pl e . c o ns t ra i nt s < - li st ( l i st (c ( 1 ,2 ) , c (2 ) , NUL L , N U LL ) )
216 # E a c h e l e m e n t i n t he l i s t c o r resp o n d s t o the ind i c e s o f t he m a r g i ns t h a t
217 # a r e n e e d ed : bot h f r o m the fir s t c o n s t r aint , o n l y the se c o n d o n e i n the
218 # s e c o n d a nd n o t h ing f r o m the l a s t t wo . T h e n runn i n g IP F :
219 re s ul t < - F un I pf ( kn o wn . m a rg in s , kn o wn . m a rg i ns . map ,
220 sa m pl ed = sa mp le d , c on . li st . s am p le d = s am pl e . c on st r ai nt s ,
221 up d at e d . c o ns t ra in t s = TR U E )
222 # A g a i n the d i f f e renc e be t w e e n t eh IP F m a rgin s and t h e
223 # o ri gi na l m ar gi ns c an b e te st ed :
224 fo r ( i in 1: le ng t h ( k n o wn .m a r gi n s .m a p [ - s a mp l ed ]) ) {
225 pr i nt ( a ll . e qu al ( k n ow n . ma r gi ns [- s a mp le d ] [[ 1] ] , ap pl y ( r es ul t $ IP F ,
226 kn o wn . m ar g in s . ma p [ - sa mp l ed ][ [1 ]] , su m ) ))
227 }
228 # D e p e n d ing o n t he d e g r ee o f a c cura c y r e quir e d the er r o r s
229 # c an be m inim ised b y a djus ting t he cl osur e pa ramet er .
230 #
231 # M o r e t h a n one sa m p l e c a n a l s o be in c l u d e d di r e c l t y a s l o n g
232 # a s t h e y do n o t o v e rlap . If t h ey do , t h en F unIp f mu s t b e u s ed tw ice
233 # f i r s t r e c o n cili ng th e o v e r l apping sa m p l e s in t o o ne
234 # c o n s t r aint , wh ich is th e n f ed i n to a se c o nd r un of F u n I p f .
235 #
236 # #
237 ##################################################################
238 # ##
239 # #
240 #
282
Appendix E
Summary statistics for the analysis in Chapter 10
[A] [AB]>[AC] [AC] [AB]
Variable Name ZG2Rank No. Rank ZG2Rank mean(ZG2) Rank sd(ZG2) Rank
Accommodation type 318.41 (22.) 0 (1.) 479.03 (54.) 180.41 (24.) 77.85 (24.)
Age 259.23 (11.) 43 (52.) 206.39 (34.) 329.95 (57.) 150.90 (51.)
Bath/toilet 409.20 (42.) 31 (36.) 116.76 (8.) 144.01 (9.) 71.01 (15.)
Cars/vans owned 281.21 (13.) 0 (1.) 351.18 (48.) 205.17 (39.) 69.98 (12.)
Communal establishment type 417.32 (46.) 48 (55.) 49.76 (2.) 113.85 (2.) 47.24 (2.)
Central heating 374.00 (36.) 2 (10.) 241.63 (37.) 149.08 (10.) 71.02 (16.)
Status in communal establishment 411.46 (44.) 32 (38.) 108.91 (5.) 143.17 (7.) 70.80 (13.)
Country of birth 494.51 (51.) 0 (1.) 608.17 (56.) 131.18 (4.) 74.80 (22.)
Residents per room 387.39 (39.) 8 (14.) 253.37 (39.) 189.51 (31.) 68.67 (11.)
Distance moved 519.83 (54.) 3 (12.) 182.93 (23.) 119.77 (3.) 81.40 (27.)
Distance to work 376.65 (37.) 14 (19.) 270.40 (41.) 219.93 (42.) 154.54 (52.)
Economic activity (last week) 325.23 (23.) 29 (34.) 194.64 (29.) 257.04 (51.) 157.39 (55.)
Ethnic Group 695.43 (56.) 0 (1.) 580.14 (55.) 168.33 (20.) 83.61 (29.)
Ever worked 304.55 (18.) 21 (24.) 155.87 (15.) 162.11 (15.) 99.92 (43.)
283
Continued
[A] [AB]>[AC] [AC] [AB]
Variable Name ZG2Rank No. Rank ZG2Rank mean(ZG2) Rank sd(ZG2) Rank
Family type 292.95 (15.) 22 (26.) 253.20 (38.) 258.06 (52.) 97.25 (41.)
Dependent children in family 154.36 (2.) 25 (29.) 183.20 (25.) 204.77 (38.) 91.29 (37.)
Economic position of FRP 317.58 (21.) 21 (24.) 231.27 (36.) 221.29 (43.) 106.37 (44.)
NS-SEC position of FRP 306.51 (19.) 11 (16.) 369.65 (50.) 279.95 (55.) 121.86 (45.)
Sex of FRP 205.60 (9.) 14 (19.) 203.25 (32.) 169.08 (21.) 97.62 (42.)
Generation indicator 294.45 (16.) 45 (53.) 201.60 (30.) 269.08 (54.) 95.56 (39.)
General health 354.06 (33.) 37 (47.) 132.58 (12.) 163.90 (16.) 66.68 (10.)
Household education indicator 287.04 (14.) 32 (38.) 193.31 (27.) 191.67 (32.) 61.92 (5.)
Household employment indicator 346.88 (28.) 23 (28.) 204.38 (33.) 167.89 (18.) 61.37 (4.)
Household housing indicator 345.08 (27.) 3 (12.) 272.71 (42.) 160.93 (14.) 72.43 (19.)
household health indicator 295.58 (17.) 32 (38.) 176.73 (20.) 181.90 (25.) 72.30 (18.)
Household headship 365.69 (35.) 54 (57.) 112.85 (6.) 221.65 (44.) 54.32 (3.)
No. of carers in household 383.53 (38.) 32 (38.) 137.74 (14.) 168.04 (19.) 74.19 (21.)
No. of employed adults in household 277.29 (12.) 39 (49.) 201.68 (31.) 233.38 (46.) 88.22 (34.)
No. in household with LLTI 348.49 (29.) 32 (38.) 189.09 (26.) 196.77 (34.) 74.14 (20.)
No in household with poor health 400.49 (41.) 30 (35.) 176.07 (19.) 182.77 (28.) 71.59 (17.)
No. of usual residents in household 351.87 (31.) 32 (38.) 182.95 (24.) 214.08 (40.) 88.21 (33.)
Hours worked weekly 362.23 (34.) 28 (33.) 163.63 (17.) 219.29 (41.) 149.13 (50.)
Social grade of HRP 196.17 (5.) 11 (16.) 284.99 (44.) 229.04 (45.) 90.72 (36.)
Year last worked 343.73 (26.) 32 (38.) 193.41 (28.) 265.27 (53.) 165.76 (57.)
Limiting long term il lness 340.99 (25.) 35 (45.) 128.82 (11.) 160.71 (13.) 62.47 (7.)
Lowest floor level of accommodation 468.30 (48.) 0 (1.) 402.51 (52.) 174.66 (23.) 79.00 (25.)
Marital status 181.19 (4.) 26 (31.) 170.09 (18.) 182.26 (26.) 87.09 (31.)
Migration indicator 513.60 (52.) 15 (21.) 180.56 (21.) 157.45 (12.) 87.19 (32.)
Region of origin 741.33 (57.) 0 (1.) 688.58 (57.) 182.84 (29.) 95.65 (40.)
Occupancy rating of household 312.00 (20.) 2 (10.) 352.82 (49.) 197.79 (37.) 80.61 (26.)
Professional Qualification 200.59 (7.) 27 (32.) 126.43 (10.) 174.05 (22.) 131.33 (47.)
284
Continued
[A] [AB]>[AC] [AC] [AB]
Variable Name ZG2Rank No. Rank ZG2Rank mean(ZG2) Rank sd(ZG2) Rank
Hours of care provided per week 482.42 (50.) 42 (51.) 88.04 (4.) 137.41 (5.) 63.93 (9.)
Level of highest qualification 253.97 (10.) 18 (22.) 269.29 (40.) 238.21 (48.) 144.66 (48.)
Religion 569.23 (55.) 1 (8.) 450.60 (53.) 155.01 (11.) 75.97 (23.)
Relationshio to HRP 436.49 (47.) 45 (53.) 227.34 (35.) 297.85 (56.) 87.04 (30.)
Number of rooms occupied 348.76 (30.) 1 (8.) 337.55 (47.) 197.63 (36.) 82.10 (28.)
Accommodation self-contained 410.27 (43.) 31 (36.) 122.90 (9.) 143.94 (8.) 70.99 (14.)
Sex 24.38 (1.) 51 (56.) 10.38 (1.) 51.82 (1.) 45.95 (1.)
Students away during termtime 391.44 (40.) 39 (49.) 134.46 (13.) 165.29 (17.) 62.15 (6.)
Schoolchild or student in FT education 202.84 (8.) 37 (47.) 78.30 (3.) 142.21 (6.) 93.79 (38.)
Supervisor/foreman 156.99 (3.) 35 (45.) 113.91 (7.) 182.49 (27.) 129.37 (46.)
Tenure of accommodation 353.63 (32.) 0 (1.) 323.68 (46.) 194.96 (33.) 63.92 (8.)
Term time address of students 328.98 (24.) 22 (26.) 181.35 (22.) 186.22 (30.) 88.59 (35.)
Transport to work 468.97 (49.) 9 (15.) 392.67 (51.) 238.64 (49.) 154.56 (53.)
Size of workforce 197.60 (6.) 25 (29.) 159.60 (16.) 197.20 (35.) 145.25 (49.)
Workplace 517.94 (53.) 13 (18.) 320.85 (45.) 233.96 (47.) 165.04 (56.)
NS-SEC 8 classes 416.16 (45.) 18 (22.) 275.95 (43.) 253.88 (50.) 155.87 (54.)
285
286
Appendix F
Sampling results: Model
4B-GOR vs. Model 4B-SG
The following table graphically depicts the proportion of tables with each variable, where the
geographical model (Model 4B-GOR) outperformed the geodemographic model (Model 4B-SG)
as measured using proportion misclassified. The variables are sorted by the last column i.e. the
proportion of tables where a 50% regional level sample performs better than the Supergroup
one.
Sample sizes
Variable 0.005%0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%
Country of birth
Accommodation type
Migration origin
Central heating
Sex
Occupancy rating
LLTI
Cars/vans
Transport to work
Lowest floor
Housing indicator
Bath/WC
Workplace
Number of rooms
Hours per week
Residents per room
Household LLTI
Number of carers
H. employment indicator
Religion
Hours of care
No. of residents
287
...continued
Sample sizes
Variable 0.005%0.01% 0.05% 0.1% 0.5% 1% 5% 10% 50%
No. w. poor health
No w. LLTI
Ethnic group
Distance to work
NS-SEC 8 Classes
Tenure
Social grade of HRP
No. employed adults
Headship
Ever worked
Supervisor
Self-contained acc.
Prof. qualifications
Ecnomic activity
Term time address
Relationship to HRP
Migration indicator
Last worked
General health
Sex of FRP;
Economic position of FRP
Dependent children
Distance moved
Comm. est. type
Workforce size
Student/schoolchild
Students away
Highest qualifications
Household education
Generation indicator
NS-SEC of FRP
Family type
Status in com.est.
Age
Marital status
288
Appendix G
Sampling results: Model
4B-GOR andModel 4B-SG vs.
Model 4
The following table graphically depicts for each variable the proportion of tables, where uniform
prior (Model 4 ) outperformed the geographic model (Model 4B-GOR - dark grey bars) and
the geodemographic model (Model 4B-SG - light grey bars) as measured using proportion
misclassified.
Sample sizes
Variable 0.1% 0.5% 1% 5% 10% 50%
Comm.est. type
Status in comm. est.
Bath/WC
Hours of care
Students away
Number of carers
Self-contained acc.
Sex
Ethnic group
Religion
No. w. poor health
Hh. employment ind.
Age
Migration indicator
NS-SEC 8 Classes
Workforce size
Migration origin
Transport to work
No w. LLTI
General health
Social grade of HRP
289
...continued
Sample sizes
Variable 0.1% 0.5% 1% 5% 10% 50%
Distance moved
No. w. HHLTI
Hours worked
NS-SEC of FRP
Residents per room
Workplace
Central heating
Relationship to HRP
Headship
Housing indicator
Distance to work
Occupancy rating
LLTI
Lowest floor
Last worked
Highest qualification
Term-time address
Prof. qualifications
Hh. education indicator
Number of rooms
Ever worked
Supervisor
Cars/vans
Economic activity
Usual residents
Family type
Country of birth
Employed adults in hh.
Generation indicator
Student/schoolchild
Accommodation type
Tenure
Sex of FRP
freconac;
Marital status
Dependent children
290







