ArticlePDF Available

Some methods for measuring and correcting for spatial autocorrelation

Authors:
  • Ulster Institute for Social Research

Abstract

Two new approaches to spatial autocorrelation (SAC) were examined for their ability to measure SAC: correlation of distance scores and k nearest spatial neighbor regression. Furthermore, the methods were compared to a more traditional measure of SAC, Moran’s I. All methods were found to give similar results as measured by their intercorrelations. Three approaches, correlation of distance scores, k nearest spatial neighbor regression and spatial local regression, were used on simulated datasets to see if they could successfully distinguish between a true and a spurious cause. The correlation of distances approach was found to be lacking, while the other two methods clearly distinguished between the spurious and true causes and gave similar results.
The Winnower, a free to publish, post-publication review mega-journal.
Published 2015/10/20.
Some methods for measuring and
correcting for spatial autocorrelation
Emil O. W. Kirkegaard
Abstract
Two new approaches to spatial autocorrelation (SAC) were examined for their ability to measure SAC:
correlation of distance scores and k nearest spatial neighbor regression. Furthermore, the methods were
compared to a more traditional measure of SAC, Moran's I. All methods were found to give similar
results as measured by their intercorrelations.
Three approaches, correlation of distance scores, k nearest spatial neighbor regression and spatial local
regression, were used on simulated datasets to see if they could successfully distinguish between a true
and a spurious cause. The correlation of distances approach was found to be lacking, while the other
two methods clearly distinguished between the spurious and true causes and gave similar results.
Key words: spatial autocorrelation, Moran's I, correlation of distances, k nearest spatial neighbor
regression, KNSNR, spatial local regression, SLR
1. Introduction
Much research analyzes data from countries, states, weather stations or other units have that a location.
Spatial autocorrelation (SAC) is when there are spatial patterns in the dataset (Hassall & Sherratt, 2011;
Radil, 2011). This can be both positive (nearby cases are similar), neutral (neighbor cases have no
particular relationship, absence of SAC), or negative (nearby cases are dissimilar). Figure 1 illustrates
this.
The presence of SAC in a dataset means that the cases are not independent which means that the
degrees of freedom are overestimated. There are methods for correcting the degrees of freedom, but
these are not the focus of this paper (Gelade, 2008; Hassall & Sherratt, 2011). Instead, the purpose of
this article is to explore ways to measure and correct for SAC with respect to effect sizes.
2. Correlation of distances method
A conceptually simple method for examining SAC is the correlation of distance scores method (CD).
CD was used by Davide Piffer (Piffer, 2015) but is so simple that it would be surprising if it had not
been thought of before. The first step is to calculate distance scores for all cases for all variables of
interest. Note that the kind of distance score depends on the type of variable. Piffer analyzed, among
other things, phylogenetic distances (genetic distance between two organisms or classes of organisms)
and used Fst values. However, if one's data concern e.g. countries, one would use spherical geometric
distances. If one is analyzing data from flatland (of any hyperspace with flat planes), one could use
euclidean distances. For single metric type variables, one could use the absolute difference.
Alternatively, one could combine several non-spatial variables and employ various well-researched
distance measures such as complete linkage (James, Witten, Hastie, & Tibshirani, 2013, p. 401).
The distance scores results in a new dataset which has N(N-1)/2 cases (choose 2 of N, order irrelevant),
where N is the number of cases in the original dataset.
The idea behind the method is that given two variables that are linearly related, pairs of cases that are
far from each other on one variable should also be far from each other on another variable. To better
understand this, it is worth looking at some example data. Figure 2 shows the spatial distribution of
some datapoints in flatland. An outcome variable also exists and is shown by color coding.
There are clear spatial patterns in the data: 1) we see 5 fairly distinct clusters, 2) neighbors tend to have
similar outcomes, and 3) points with higher x and y values tend to have higher outcome values.
Using the distance scores, one can examine (2) with a standard Pearson correlation, as shown in Figure
3.
As expected we see a positive correlation, i.e. cases closer to each other do have more similar
outcomes. We also see substantial non-linearity in the plot. This is because the cases are not uniformly
spatially distributed and that distances between all pairs of cases are used.
If one wants to control for SAC, one can use standard statistical methods such as multiple regression or
semi-partial correlations. Suppose one has a predictor that is spatially distributed as shown in Figure 4.
We see that there is some similarity to Figure 2, but it is not immediately obvious whether the
relationship owes to a pattern at the cluster-level or whether the pattern is general. In the original data
the correlation between predictor and outcome is .66 and using CD it is .36.
The semi-partial correlation between predictor and outcome is .13, a fair bit lower than .36, which
gives us some evidence that the relationship owes much of its size to the cluster pattern. If one instead
uses multiple regression the standardized betas are .16 and .36 for predictor and spatial distance,
respectively. In fact, the true correlation of predictor and outcome once SAC is taken into account is
near 0. The relationship between them is purely due to their common association with the clusters we
see (common cause scenario).
2.1. A second dataset
What about a similar situation where the relationship between predictor and outcome is not spurious?
Consider the data in Figures 5 and 6.
We see the same overall pattern as before, but it is subtly different. The correlation in the original
dataset between predictor and outcome is nearly the same, .68 vs. .66 before. However, using the
distance scores, the correlation between outcome and spatial distance is only .17 vs. .45 before. The
distance-based correlation between predictor and outcome is .43 vs. .36 before.
The semi-partial correlation for predictor and outcome correcting the latter for spatial distance is .39 vs.
.13 before. A fairly large difference. Using multiple regression the betas are .42 and .04 for predictor
and spatial distance, respectively.
Clearly, the distance scores approach shows that the association between predictor and outcome is
different in the two datasets.
2.2. The relationship between correlations of distances and correlations in the
original data
Before we go further, it is worth examining the relationship between correlations of distance data and
correlations of the original data. From my experimentation, correlations of distance scores seem to be
an r2-type statistic (Hunter & Schmidt, 2004, p. 189). This can be seen if one simulates normally
distributed data with a known correlation and then correlate their distance scores. The results are shown
in Figure 7.
As can be seen, the correlation of distance scores is almost exactly the same as the square of the
correlation, or conversely, the square root of the correlation of distance scores is almost the same as the
original correlation score.
Going back to the results in the previous section, this means that we need to take the square root of the
semi-partial correlations and standardized betas to get an r-type measure. These values are shown in
Table 1.
Relationship Dataset 1 Dataset 2
r (predictor x outcome), original data .66 .68
sqrt. r (predictor x outcome), distance data .60 .66
sqrt. r (spatial x outcome), distance data .67 .41
sqrt. rsp (predictor x outcome, spatial), distance
data
.36 .62
sqrt. beta (predictor x outcome), distance data .40 .65
sqrt. beta (spatial x outcome), distance data .60 .20
Table 1: Table of numbers from datasets 1 and 2.
There are several things worth noting. 1) The square root conversion seems to work because the
distance-based correlations are similar to the ones based on the original data. 2) Distance-based
correlations were higher for the first dataset as expected. 3) Semi-partial correlations of distance-data
showed that the predictor had a substantial size (.36) in the first dataset despite this relationship being
entirely spurious. However, it did correctly indicate a stronger effect in the second dataset (.62). 4)
Similarly, multiple regression showed that the predictor had substantial validity in the first dataset and
reversely that spatial distance had some validity in the second dataset. Neither should be the case. On
the positive side, in both cases did the true causes have larger effect sizes (.60>.40; .65>.20).
3. k nearest spatial neighbors regression
Given the problems of the CD method, I developed a new method. It is a variant of the general
approach of using the neighboring points to predict values, called k nearest neighbors (KNN) (James et
al., 2013). The implementations of KNN I could find for R did not support spatial data, so I wrote my
own implementation which I dub k nearest spatial neighbors regression (KNSNR) The approach is
fairly simple:
1. For each case, calculate the distance (spherical, euclidean or otherwise) to each every other
case.
2. For each case, find the k nearest neighbor cases.
3. For each case, calculate the mean value of the neighbors' scores and use that as the prediction.
One can alter k in (2) to tune the method. This value should probably be found thru cross-validation
(James et al., 2013).
(3) can be altered to make more advanced versions such as taking into account the relative distances
among the nearest neighbors (closer neighbors are more important) or the case weights (if using
aggregated data or uncertain data).
The function I wrote will output one of four things: 1) predicted values, 2) correlation of actual values
with predicted values, 3) residuals, and 4) the correlation of a predictor with the residuals of the
outcome. Each has its use.
The correlation of actual values with predicted values is a measure of the SAC in a given variable. It
corresponds to the correlation of a given variable and spatial distance when using distance data. Which
k is the right to use? It depends. What k does here is change the zoom level. The value at each k tells
you how much SAC there is in the dataset given a particular level of zooming in on every case. Figure
8 shows the amount of SAC in the outcome variable according to the first 100 values of k for multiple
datasets.
In general, k should be larger than 1. k=3 often works well. k reached plateaus for datasets 1-2 around
k=10. For these datasets k=50 makes more sense because there are 50 cases in each cluster. At all
levels, however, do we see that SAC is stronger for the odd numbered datasets, just as was found using
CD.
The predicted scores from KNSNR can be used in a multiple regression model where they can compete
with other variables. The residuals can be correlated with a predictor to easily perform semi-partial
correlation with SAC controlled for in the chosen variable. The semi-partial correlation between
predictor and outcome controlling outcome for SAC in the first dataset is .01 while it is .42 in the
second dataset (using k=50). In other words, the method tells us that the without SAC in the outcome
the predictor doesn't work in the first dataset, but it does in the second. The semi-partial correlations of
spatial prediction x outcome controlled for predictor are .34 and .01, for the first and second datasets
respectively. In other words, spatial proximity retains predictive value once the predictor has been
controlled, showing that there is an additional unmeasured cause with strong SAC. All four results are
good because they are in line with how the data were generated.
4. A different kind of spatial autocorrelation
The previous examples were easy in the sense that the data pretty clearly contained clusters of points.
One could have easily classified the cases into the regional/island clusters and used a more traditional
approach to the question: multi-level analysis. In simple terms, one would look whether relationship
found in the pooled data holds when analyzing the data within each cluster. Table 2 shows these results.
Cluster Dataset 1 Dataset 2
1 -0.176 0.511
2 -0.094 0.615
3 -0.037 0.5
4 0.117 0.511
5 0.162 0.617
Table 2: Correlations inside clusters in datasets 1-2.
Clearly, the datasets are markedly different in that there is a near-zero relationship inside the clusters in
the first, but there are fairly strong relationships in the second.
When results give discrepant results across levels, it is termed Simpson's paradox. Not understanding
this concept has lead to bogus claims of sex discrimination at least twice (Albers, 2015; Kievit,
Frankenhuis, Waldorp, & Borsboom, 2013). An interactive visualization of the phenomenon can be
found at: http://emilkirkegaard.dk/understanding_statistics/?app=Simpson_paradox
Clearly we did not need to use SAC to spot the difference between datasets 1-2. However, the datasets
are unusual in that they permit very easy clustering of the cases. Real life datasets are often more
difficult to deal with. Consider datasets 3-4 shown in Figures 9-12.
These datasets are variations of datasets 1-2, yet now the clusters are not easy to spot. One would not
be able to use a simple clustering algorithm on the spatial data to sort them into 5 clusters correctly. If
one cannot do that, one cannot use the standard multi-level approach because it requires discrete levels.
The correlations between the predictors and outcomes is in fact identical to before: .66 and .68
respectively. These correlations are generated in the same way as before, meaning that the first is a
spurious, SAC induced correlate, while the second is a true cause. The datasets still contain a sizable
amount of SAC as was seen in Figure 8, so one might still wonder how this affects the results. Using
the semi-partial SAC control approach, the results are .15 and .46 respectively. Should the number .15
be closer to 0? Not exactly.
To understand why, I need to be more explicit about how the data were simulated. In the odd numbered
datasets, the outcome and predictor variables are both a function of the clusters, which serve is a SAC
variable (SACV). In the even numbered datasets, the predictor is a function of the clusters + noise and
the outcome is a function of the predictor + noise. Figure 13 shows the path models.
The difference between datasets 1-2 and 3-4 is solely that the within cluster spatial variation was
increased from standard deviation=6 to 15. This weakens the SAC in the data and thus the effect of
controlling for SAC. The results are thus in line with expectations.
Still, we would like paired examples that appear similar – i.e. has similar r (predictor x outcome) – but
where the first effect is spurious and the second is real and that our methods correctly identify this. To
do this, I created two more datasets. This time there are no a priori clusters.
The generation proceeded like this:
1. For each case:
a) Sample two values from the uniform distribution of numbers between 1 and 100 and use
these as x and y coordinates.
b) Sample a number from a standard normal distribution and use it as the score for the SACV.
2. Induce SAC in the SACV using a KNSN-type approach.
After this, I proceeded similarly to before, namely by creating two variables, predictor and outcome,
according to the path models shown in Figure 13. Noise were added to hit a target uncorrected
correlation of .66, similar to datasets 1-4. Because the SACV variable has strong SAC, controlling for
SAC decreases the correlation in the spurious cause scenario because path goes thru the SACV, but not
in the true cause scenario where causation is directly from predictor to outcome.
It is worth going into more detail about how SAC was induced. The algorithm works as follows:
1. For each iteration:
a) For each case:
1. Find the k nearest spatial neighbors.
2. Find the mean value of the chosen variable for these neighbors.
b) For each case:
1. Change the value of the datapoint to (1-w) times its current value and w times the mean
value of its neighbors.
(a) and (b) must proceed in this fashion otherwise the order of the cases would affect the algorithm. The
algorithm requires values for the three parameters: i, k and w. i determines the number of iterations the
algorithm goes thru. k control the number of neighbors taken into account and w controls the relative
weight given to the value from the neighbors. Smaller k means the SAC pattern will be more local,
while both i and w makes the SAC stronger but in different ways. The parameters used for datasets 5
and 6 were i=20, k=10, w=1/3.
Figures 14 to 17 show the spatial layout as well as predictor and outcome values for datasets 5-6.
Figure 1: Illustrations of spatial autocorrelation. From (Radil, 2011).
Figure 7: Comparison of r, CD and r2.
Figure 2: Dataset 1. Flatland and outcome.
Figure 3: Scatter plot of distance scores for spatial and outcome.
Figure 4: Dataset 1. Flatland and predictor.
Figure 5: Dataset 2. Flatland and outcome.
Figure 6: Dataset 2. Flatland and predictor.
Figure 8: The amount of SAC in datasets 1-6 according to KNSNR at a k 1-100.
Figure 9: Dataset 3. Flatland and outcome.
Figure 11: Dataset 4. Flatland and outcome.
Figure 12: Dataset 4. Flatland and predictor.
Figure 13: Path models for the example datasets.
Figure 14: Example 5. Flatland and outcome.
Figure 15: Example 5. Flatland and predictor.
Figure 16: Example 6. Flatland and outcome.
Figure 10: Dataset 3. Flatland and predictor.
From just inspecting the spatial plots, it is not easy to see a marked difference between datasets 5 and 6.
Both datasets clearly show some degree of SAC. Figures 18 and 19 show the regressions between
predictor and outcome.
Altho not identical, there do not seem to be marked differences between the datasets. Thus, one could
not tell the situations apart by simple inspection. There are also no obvious clusters one could classify
cases into and use multi-level analysis.
However, if we examine the strength of SAC in the datasets, they are fairly different as shown in Table
3.
Variable Dataset 5 Dataset 6
SACV 0.998 0.998
x 0.999 0.999
y 0.998 0.998
predictor 0.758 0.632
outcome 0.794 0.365
outcome_predictor_resid
s 0.28 -0.035
Table 3: SAC in datasets 5 and 6. SAC calculated using k=10 because data were generated using
k=10.
Unsurprisingly, the coordinate variables and the SACV show extreme levels of SAC. The predictors
diverge a bit in SAC, while the difference in SAC for the outcome variables is fairly large. This is as
expected given the way the data were generated. In dataset 6, the outcome variable is 2 steps away
from SACV, while it is only 1 step away in dataset 5. Also worth noting is the SAC in the model
residuals in dataset 5, which was discussed by (Hassall & Sherratt, 2011) as indicating a problem. This
indicates that one or more causes not included in the model are SAC.
The correlations with the predictor after controlling the outcome for SAC are .08 and .49 for datasets 5
and 6, respectively showing that the method can successfully detect which association is due to an
uncontrolled common cause which is SAC.
4.1. Control outcome, predictor or both for SAC?
So far I have only presented results where the outcome was controlled for SAC. However, one might
also control the predictor or both. It is not immediately clear to me which one is the best. For this
reason, I tried all three side by side. Table 4 shows the results.
Dataset k3_o k3_p k3_b k10_o k10_p k10_b k50_o k50_p k50_b
1 0.17 0.17 0.07 0.04 0.04 -0.01 0.01 0.01 0.02
2 0.57 0.57 0.58 0.45 0.45 0.53 0.42 0.42 0.54
3 0.3 0.3 0.27 0.19 0.19 0.24 0.15 0.15 0.24
4 0.6 0.6 0.61 0.52 0.52 0.58 0.46 0.46 0.56
5 0.18 0.18 0.07 0.08 0.08 0.03 0.27 0.27 0.36
6 0.57 0.57 0.57 0.49 0.49 0.55 0.51 0.51 0.59
Table 4: KNSN regression results, controlling either or both variables with three different k values. The
number is the value of k used, the letter indicators which variable was controlled for SAC: p=predictor,
o=outcome, b=both.
We see some interesting patterns. When k is 3 or 10, controlling both variables tends to reduce the
correlation further in the datasets with spurious causes, but not in the datasets with real causes. When
using k=50, controlling both actually tends to make the results stronger for both true and spurious
causes.
The most important comparison is that between datasets 5 and 6 because these are the most realistic
datasets. When using an appropriate k (3 or 10), we see that controlling both reduces the spurious
correlations (.18 to .07; .08 to .03) but either leaves unchanged or increases the true cause correlation
(.57 to .57; .49 to .55). For this reason, it would seem best to control both predictor and outcome
variables provided one has a reasonable idea about the correct k to use.
4.1.1. Multiple regression
There is another way to control for a variable, namely multiple correlation. However, because the
predicted values from KNSNR were often very similar (r>.7) to the predictor, this was not very useful.
Therefore, if one does use the MR approach one should pay attention to multicollinearity in the models,
e.g. by means of the variable inflation factor (Field, 2013; Gordon, 2015).
5. Spatial local regression
Local regression is a method related to the k nearest neighbor regression. The idea of local regression is
that instead of fitting a model to all the predictor data, one fits the model for every case and weigh the
nearby cases more than the distant cases. Afterwards we combine the models. The result of this is a
smooth fit to the data. Figure 20 shows as illustration of this.
In the above case, the distance measure for the cases is their absolute difference on the predictor
variable. However, one could also use spatial distance, a method one might call spatial local regression
(SLR). In this way, one would fit a regression for each case and use only the cases that are spatially
nearby and finally aggregate/meta-analyze the results. The rationale here is that SAC will have only a
weak influence when one examines cases near each other.
For the regression betas, one could use a weighted mean with unit-weights or inverse distance weights.
The latter is theoretically superior, but I implemented both options. The implementation can also output
the vector of beta values in case someone wants to experiment using other central tendency measures.
The results are surprisingly similar to those from KNSNR. Table 5 shows results for the predictor x
outcome relationship in datasets 1-6 using SLR.
Dataset slr_3_none slr_10_none slr_50_none slr_3_inverse slr_10_inverse slr_50_inverse
1 -0.017 -0.033 0.047 -0.008 -0.021 0.03
2 0.442 0.493 0.551 0.436 0.493 0.551
3 0.102 0.156 0.316 0.1 0.159 0.278
4 0.457 0.533 0.583 0.456 0.535 0.588
5 0.141 0.225 0.494 0.129 0.194 0.452
6 0.51 0.555 0.623 0.501 0.55 0.625
Table 5: Spatial local regression results for the predictor x outcome relationship in datasets 1-6. The
number refers to the chosen k value. The word refers to the use of weights.
Recall that datasets with odd numbers are spurious causes while those with even numbers are true
causes. For k=3 and 10, we see strong contrasts between the results from the even and odd numbered
datasets, but not for k=50. It appears that using k=50 is sufficient to induce a strong spurious effect.
The use of weights does not seem to have much effect, which is somewhat surprising. It does have a
small effect on reducing the size of the spurious cause correlations.
To further investigate this, calculated the same numbers as shown in Table 5 but for k = [3-50] using
only inverse weighing. Results are shown in Figure 21.
We see that at lower k values, all correlations decrease, but the spurious ones decrease more. As k
increases towards N of the sample, the spurious and true cause correlations converge.
6. Comparison of measures of SAC
It is worth comparing the two measures of SAC examined in this paper to the more standard methods in
the field. Two methods are widely used to measure the amount of SAC in a dataset: Moran's I and
Geary's C (Gelade, 2008; Hassall & Sherratt, 2011; Radil, 2011). These two measures are described as
approximately inversely related and often only the first is used. I also used only the first because I was
unable to understand the implementations of the latter in R (e.g. in spdep package). Both Moran's I and
Geary's C are global measures of SAC, altho local variations exist (Radil, 2011).
Because KNSNR tends to hit a ceiling quickly (e.g. in Table 2), the analysis here is focused on
variables where this doesn't happen. For this purpose, the outcome, predictor and residuals from
predicting the first with the latter are examined with all three methods in across all datasets. Results
from different datasets with the 'same' variable were appended. Because KNSNR requires a tuning
parameter, 3 values were chosen for this (3, 10, and 50). A square root version of CD was also added to
see if the non-linearity inherent in the CD correlations would substantially change the relationship to
the other measures.
To increase sample size and diversity of datasets, more datasets were created. Dataset 7 is a variant of
dataset 6, but where SAC is also induced into the outcome and predictor variables. Dataset 8 uses the
distance to the center as the basis for the predictor values and the outcome values are a noisy version of
the predictor. Datasets 9 and 10 are variants of dataset 8 but where the distance to the points (75;75)
and (100;100) are used, respectively. Dataset 11 had no SAC induced and functions as a null dataset.
Dataset 12 has datapoints perfectly distributed in a grid pattern, perfect negative SAC for the predictor
and uneven SAC for the outcome. The supplementary materials contains scatter plots for these
additional datasets. Table 6 shows the SAC values according to the 5 measures and Table 7 shows their
intercorrelations.
Dataset and variable Morans_I cd cd_sqrt knsn_3 knsn_10 knsn_50
ex1_outcome 0.403 0.452 0.673 0.741 0.789 0.798
ex1_predictor 0.432 0.547 0.739 0.764 0.812 0.819
ex1_outcome_predictor_resids 0.074 0.062 0.249 0.227 0.283 0.325
ex2_outcome 0.146 0.172 0.415 0.3 0.453 0.468
ex2_predictor 0.298 0.309 0.556 0.598 0.671 0.685
ex2_outcome_predictor_resids -0.002 0.02 0.141 -0.02 -0.021 -0.107
ex3_outcome 0.248 0.378 0.615 0.646 0.716 0.735
ex3_predictor 0.278 0.468 0.684 0.698 0.758 0.756
ex3_outcome_predictor_resids 0.038 0.054 0.233 0.16 0.271 0.303
ex4_outcome 0.09 0.176 0.419 0.268 0.386 0.445
ex4_predictor 0.18 0.263 0.512 0.507 0.601 0.635
ex4_outcome_predictor_resids -0.004 0.04 0.199 -0.018 0.014 -0.059
ex5_outcome 0.136 0.066 0.258 0.737 0.794 0.671
ex5_predictor 0.14 0.055 0.234 0.693 0.758 0.662
ex5_outcome_predictor_resids 0.021 0.026 0.162 0.14 0.28 0.243
ex6_outcome 0.031 0.041 0.202 0.246 0.365 0.323
ex6_predictor 0.105 0.042 0.204 0.54 0.632 0.565
ex6_outcome_predictor_resids -0.013 0.023 0.152 -0.092 -0.035 -0.101
ex7_outcome 0.176 0.065 0.256 0.98 0.994 0.729
ex7_predictor 0.219 0.101 0.317 0.987 0.997 0.831
ex7_outcome_predictor_resids 0.174 0.088 0.296 0.976 0.99 0.676
ex8_outcome 0.126 0.017 0.131 0.599 0.651 0.642
ex8_predictor 0.27 0.028 0.167 0.994 0.993 0.972
ex8_outcome_predictor_resids -0.007 0.002 0.04 0.018 -0.01 -0.171
ex9_outcome 0.188 0.201 0.448 0.595 0.645 0.65
ex9_predictor 0.395 0.483 0.695 0.997 0.997 0.985
ex9_outcome_predictor_resids -0.007 0 0.015 -0.02 -0.199
ex10_outcome 0.2 0.25 0.5 0.6 0.652 0.66
ex10_predictor 0.414 0.565 0.752 0.998 0.998 0.993
ex10_outcome_predictor_resids -0.007 0 0.015 0.016 -0.016 -0.176
ex11_outcome -0.002 -0.002 -0.027 -0.023 -0.04
ex11_predictor -0.006 0.001 0.037 0.019 -0.004 -0.126
ex11_outcome_predictor_resids -0.002 -0.002 -0.027 -0.023 -0.04
ex12_outcome 0.004 0.001 0.032 0.722 0.702 -0.609
ex12_predictor -0.006 0 -1 0.978 -0.969
ex12_outcome_predictor_resids 0.004 0.001 0.032 0.722 0.702 -0.609
Table 6: SAC measures of three variables in 11 datasets.
Morans_I cd cd_sqrt knsn_3 knsn_10 knsn_50
Morans_I 0.89 0.884 0.724 0.718 0.824
cd 0.882 0.968 0.501 0.493 0.639
cd_sqrt 0.876 1 0.487 0.533 0.736
knsn_3 0.822 0.652 0.523 0.689 0.761
knsn_10 0.767 0.591 0.587 0.879 0.547
knsn_50 0.931 0.831 0.809 0.793 0.721
Table 7: Intercorrelations among 6 SAC measures in 12 datasets. Pearson correlations above diagonal.
Spearman's rank-order below.1
All correlations between the measures were sizable as one would expect. The CD measure showed
downward-biasing non-linearity because the rank-order correlations were noticeably stronger.
Behind the strong intercorrelations lie some surprisingly inconsistencies. Consider ex1_outcome vs.
ex7_outcome_predictor_resids. CD gives 0.452 and 0.088 while knsn_3 gives 0.741 and 0.976. What
this seems to mean is that there is a strong local SAC (which knsn_3 measures) in the residuals but a
weak global SAC (which CD measures).
The KNSNR results for dataset 12 need special attention. Because the points are distributed in an exact
grid pattern, many pairs of points are equidistant. This poses a problem for KNN-type approaches
because they have to picked k nearest neighbors. In this case, the computer chooses one based on the
position in the distance matrix which is an arbitrary choice. If one is examining data distributed in a
perfect grid pattern, extra attention should be given to the choice of k. Notice also that cases close to
the edges have no neighbors in that direction (this isn't Pacman world), so the algorithm must choose
neighbors further away. This can give strange results.
7. Comparison of correction methods
In the previous sections, three different methods for controlling for SAC were discussed. It is useful to
do a straight up comparison with the best choice of parameters. Table 8 shows the uncontrolled and
SAC-controlled correlation in datasets 1-6 using correlation of distances (using semi-partial, partial and
multiple regression sub-methods), KNSNR (using all three parameters with regards to which
variable(s) to control for SAC, k=10) and SLR (using inverse weights, k=3).
Dataset Uncontrolled CD_d_sqrt CD_p_sqrt CD_b_sqrt CD_mr_sqrt KNSNR_d_k10 KNSNR_p_k10 KNSNR_b_k10 SLR_k3
1 0.664 0.356 0.367 0.356 0.402 0.042 0.046 -0.012 -0.008
2 0.684 0.622 0.633 0.622 0.65 0.447 0.481 0.532 0.436
3 0.664 0.445 0.456 0.445 0.485 0.194 0.199 0.235 0.1
4 0.684 0.628 0.635 0.628 0.646 0.518 0.533 0.583 0.456
5 0.661 0.611 0.61 0.611 0.611 0.08 0.089 0.034 0.129
6 0.662 0.626 0.626 0.626 0.626 0.487 0.514 0.548 0.501
Table 8: SAC corrected results across methods.
We see that for datasets 1-4, the CD approaches fail to correct for the spurious correlation since the
correlations after control are still sizable. For the same datasets we see that KNSNR and SLR perform
ok to well. For the most important datasets, 5-6, the CD approaches barely make a difference, but the
other two methods make a very large difference.
8. Discussion and conclusion
Table 9 shows an overview of SAC methods and their uses.
Approach Can induce SAC? Can measure SAC? Can control for SAC?
Correlation of distances ? Yes Yes
k nearest spatial neighbor Yes Yes Yes
Spatial local regression No No Yes
Moran's I and similar No Yes No
Table 9: Overview of SAC methods and their uses.
Correlation of distances (Piffer, 2015) is a clever idea, and a potentially powerful approach because it
can be used to both measure and control for SAC and possibly induce SAC too. However, the version
of it examined in this paper (“simple CD”) seems to be fairly useless in the sense that it failed to correct
for spurious causation in simulated datasets.
It is worth examining whether limiting CD to the nearest k neighbors or neighbors within a radius of a
given number of kilometers makes it a better measure. Another option is to weigh the datapoints by
weights that diminish as the distance increases (e.g. inverse). Both suggested modifications attempt to
deal with the problem that results from the globalness of the simple CD.
A second problem with CD approaches is that because distance scores are based on two datapoints, any
measurement error in these scores will have a compound effect on the difference scores, but the matter
is complicated (Trafimow, 2015). I'm not sure if this has an effect on the conversion back to an r-type
measure as described in this paper, but it might.
A third problem with CD approaches is that the number of distances quickly become very large. E.g.
for a dataset with 10,000 cases, the number of distance scores is 49,995,000. This makes for fairly
resource-intensive calculations. Random sampling of datapoints could probably be used to offset this
problem at the price of some sampling error.
KNSNR approaches are powerful since they can be used to all three ends. The main weakness seems to
be that they require a tuning parameter. Probably some approach could be find this through re-sampling
methods similar to those used to find tuning parameters for e.g. lasso regression (James et al., 2013).
One problem with KNSNR is that it tends to fairly quickly reach a ceiling.
SLR approaches seem more limited because they can only be used to correct for SAC. They do
however seem to give similar results to KNSNR approaches, so they could probably be used along
KNSNR in a multi-method approach to controlling for SAC. Like KNSNR approaches SLR requires a
tuning parameter, but it seems to always be better to use k=3. The smaller the k, the smaller the effect
of SAC in the data on the estimate but the more sampling error.
Still, the present findings call for further research into all three approaches.
8.1. Limitations
Only 12 test datasets were examined. Testing the methods against each other on a larger and more
diverse set of datasets might change conclusions somewhat.
Of the traditional measures of SAC, only Moran's I was used in the comparisons. Further research
should compare the proposed measures with other already developed measures such as Geary's C, Gi,
Gi* and local Moran's I (Radil, 2011, p. 19).
Another method used by (Hassall & Sherratt, 2011), spatial eigenvector mapping, to correct for SAC
could not be used because I could not find a suitable R implementation.
Supplementary material and acknowledgments
The R source code for this paper is available at https://osf.io/f2pnr/files/
Thanks to Davide Piffer for helpful comments.
References
Albers, C. (2015). NWO, Gender bias and Simpson’s paradox. Retrieved from
http://blog.casperalbers.nl/science/nwo-gender-bias-and-simpsons-paradox/
Field, A. P. (2013). Discovering statistics using IBM SPSS statistics (4th edition). Los Angeles: Sage.
Gelade, G. A. (2008). The geography of IQ. Intelligence, 36(6), 495–501.
http://doi.org/10.1016/j.intell.2008.01.004
Gordon, R. A. (2015). Regression analysis for the social sciences (Second edition). New York:
Routledge, Taylor & Francis Group.
Hassall, C., & Sherratt, T. N. (2011). Statistical inference and spatial patterns in correlates of IQ.
Intelligence, 39(5), 303–310. http://doi.org/10.1016/j.intell.2011.05.001
Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis correcting error and bias in research
findings. Thousand Oaks, Calif.: Sage. Retrieved from http://site.ebrary.com/id/10387875
James, G., Witten, D., Hastie, T., & Tibshirani, R. (Eds.). (2013). An introduction to statistical
learning: with applications in R. New York: Springer.
Kievit, R., Frankenhuis, W. E., Waldorp, L., & Borsboom, D. (2013). Simpson’s paradox in
psychological science: a practical guide. Quantitative Psychology and Measurement, 4, 513.
http://doi.org/10.3389/fpsyg.2013.00513
Piffer, D. (2015). A review of intelligence GWAS hits: Their relationship to country IQ and the issue of
spatial autocorrelation. Intelligence, 53, 43–50. http://doi.org/10.1016/j.intell.2015.08.008
Radil, S. M. (2011). Spatializing social networks: making space for theory in spatial analysis.
University of Illinois at Urbana-Champaign. Retrieved from
https://www.ideals.illinois.edu/handle/2142/26222
Trafimow, D. (2015). A defense against the alleged unreliability of difference scores. Cogent
Mathematics, 2(1), 1064626. http://doi.org/10.1080/23311835.2015.1064626
1 The reader with a sense for detail might wonder how the rank-order correlations for CD and CD_sqrt differ. Taking the
square root of a vector of numbers does not change their order, so how can the rank-order correlations be different? The
explanation is that sometimes the square root is undefined because the original number was negative. The correlations
are based on pair-wise, not case-wise complete data, which results in a slightly different set of cases used for the
correlations for CD and CD_sqrt.
Figure 17: Example 6. Flatland and predictor.
Figure 18: Scatter plot for predictor and outcome in dataset 5.
Figure 19: Scatter plot for predictor and outcome in dataset 6.
Figure 20: Local regression. At each point, a model is fit. Only nearby cases are
used to fit and they are weighted by their distance such that the more distant cases
weigh less. Copied from (James et al., 2013, p. 285).
Figure 21: Spatial local regression results for datasets 1-6 for varying
values of k. Inverse distance used as weight.
... Positive autocorrelation means that units that are closer to each other in some space (e.g. in time, in physical space or in family space) are more similar. When spatial autocorrelation is present in datasets, conclusions derived using methods not designed to take the phenomenon into account may give misleading results: either because we seem to find evidence of an effect that isn't real (false positive) or fail to find evidence of a real cause due to noise from other effects (false negative) (Kirkegaard, 2015;Radil, 2011). ...
... There are multiple standard methods that can be used to quantify the strength of the spatial autocorrelation. The most commonly used is Moran's I (Kirkegaard, 2015;Radil, 2011). However, because Moran's I is based on correlations (sort of) which require at the very least ordinal level data, it cannot be used on many language features. ...
... For the purpose of examining spatial autocorrelation in population genetical data, I developed a method called knearest spatial neighbor regression (KNSNR) (Kirkegaard, 2015). The method consists of the following steps: ...
Preprint
Full-text available
Statistical methods are presented to a linguistics audience. Statistical methods are then applied to the large WALS dataset to show that automated methods can identify patterns among language features. These results are shown to be more extreme than one would expect based on chance variation. Furthermore, it is shown numerically that language features exhibit spatial autocorrelation which needs to be taken into account when using numerical methods.
... For these reasons, one of us developed another method for measuring SAC and two more for controlling for it. The same author conducted a simulation study to examine the inter-method agreement in measuring SAC and to determine how well the SAC controlling method could distinguish between true and spurious causes of SAC (Kirkegaard, 2015i). The spatial statistics methods used are explained below. ...
... Thus, KNSNR can be used both as a measure and as a control method. For a longer explanation see Kirkegaard (2015i). ...
... With KNSNR, CD can be used both as a measure and as a control method like KNSNR. While simulations seem to show that this method is a fine measure of SAC, they also seem to show that this method poorly controls for SAC (Kirkegaard, 2015i). Simulations also show that one needs to take the square root of the CD results for them to be on a correlation-like scale and thus comparable with the results generated by other methods. ...
Article
Full-text available
We conducted novel analyses regarding the association between continental racial ancestry, cognitive ability and socioeconomic outcomes across 6 datasets: states of Mexico, states of the United States, states of Brazil, departments of Colombia, sovereign nations and all units together. We find that European ancestry is consistently and usually strongly positively correlated with cognitive ability and socioeconomic outcomes (mean r for cognitive ability = .708; for socioeconomic well-being = .643) (Sections 3-8). In most cases, including another ancestry component, in addition to European ancestry, did not increase predictive power (Section 9). At the national level, the association between European ancestry and outcomes was robust to controls for natural-environmental factors (Section 10). This was not always the case at the regional level (Section 18). It was found that genetic distance did not have predictive power independent of European ancestry (Section 10). Automatic modeling using best subset selection and lasso regression agreed in most cases that European ancestry was a non-redundant predictor (Section 11). Results were robust across 4 different ways of weighting the analyses (Section 12). It was found that the effect of European ancestry on socioeconomic outcomes was mostly mediated by cognitive ability (Section 13). We failed to find evidence of international colorism or culturalism (i.e., neither skin reflectance nor self-reported race/ethnicity showed incremental predictive ability once genomic ancestry had been taken into account) (Section 14). The association between European ancestry and cognitive outcomes was robust across a number of alternative measures of cognitive ability (Section 15). It was found that the general socioeconomic factor was not structurally different in the American sample as compared to the worldwide sample, thus justifying the use of that measure. Using Jensen's method of correlated vectors, it was found that the association between European ancestry and socioeconomic outcomes was stronger on more S factor loaded outcomes, r = .75 (Section 16). There was some evidence that tourist expenditure helped explain the relatively high socioeconomic performance of Caribbean states (Section 17).
... For the third step, we add the spatial lag, i.e. the outcome variable as predicted by spatial data. Concretely, we used spatial k nearest neighbor regression by averaging the values of the three nearest neighboring units (Anselin & Bera, 1998;Kirkegaard, 2015b; in review see spatial statistics supplement). The position of each commune was calculated by the centroid of their polygons. ...
Article
Full-text available
We examined regional inequality in Belgium, both in the 19 communes of Brussels and in the country as a whole (n = 589 communes). We find very strong relationships between Muslim% of the population and a variety of social outcomes such as crime rate, educational attainment, and median income. For the 19 communes of Brussels, we find a correlation of-.94 between Muslim% and a general factor of socioeconomic variables (S factor) based on 22 diverse indicators. The slope for this relationship is-7.52, meaning that a change in S going from 0% to 100% Muslim corresponds to a worsening of overall social well-being by 7.52 (commune-level) standard deviations. For the entire country, we have data for 8 measures of social inequality. Analysis of the indicators shows an S factor which is very similar to the one from the Brussels data only based on the full set of indicators (r's = .98). In the full dataset, the correlation between S and Muslim% is-.52, with a slope of-8.05. Adding covariates for age, population density, and spatial autocorrelation changes this slope to-8.77. Thus, the expected change going from a 0% to 100% Muslim population is-8.77 standard deviations in general social well-being. We discuss our findings in relation to other research on immigration and social inequality, with a focus on the causal influence of intelligence on life outcomes in general.
... First, as we noted earlier, the coverage of countries is not wholly satisfactory. While the national IQ datasets provide data for every country, in fact a large number of these values are estimated based on neighboring countries by spatial nearest neighbor imputation (see Kirkegaard, 2015). This is a valid method of data imputation when data are strongly spatially autocorrelated, as the countrylevel data are. ...
Article
Full-text available
Patient people fare better in life than impatient people. Based on this and on economic models, many economists have claimed that more patient countries should fare better than less patient countries. We utilize cross-national data in non-cognitive traits measured in the Global Preference Survey (GPS). This survey measured six non-cognitive traits — risk and time preferences, positive and negative reciprocity, altruism, and trust — across 76 countries in about 80,000 persons. As such, it provides the best current database of economics-focused non-cognitive traits. We combine this database with existing estimates of national intelligence (national IQs) and model country outcomes as a function of these predictors. For outcomes, we used the 51 national well-being indicators from the Social Progress Index (SPI) as well as the composite extracted from this, the general socioeconomic factor. We find that non-cognitive variables, time preference included, are only weakly predictive of national well-being outcomes when national IQs are also in the model. The median β across the indicators was 0.11 for time preference but 0.39 for national IQ. We replicated these results using six economic indicators, again with similar results: median βs of 0.15 and 0.52 for time preference and national IQ, respectively. Across all our results, we found that national IQ has 2-4 times the predictive validity of time preference. These results are fairly robust to inclusion of a spatial autocorrelation control, alternative measures of national IQ and time preference, or no controls. Our results suggest that the importance of national non-cognitive traits, including time preference, is overestimated or that these traits are mismeasured.
Article
Full-text available
Introduction: Necrotizing fasciitis (NF) is a rare skin and soft-tissue bacterial infection with high morbidity and mortality. Knowledge about the prevalence and incidence of NF in Thailand is quite sparse. The objective of this study was to determine the prevalence of NF in Thailand and factors that may be potentially associated with NF morbidity and mortality. Methods: A cross-sectional study using secondary data from Thailand's national health databases between 2014 and 2018 was conducted. Descriptive statistics using median and percentage formats were used. This was complemented by multivariable logistic regression to determine the association between various factors (such as age and underlying diseases) with NF morbidity and mortality. Univariate spatial data analysis was exercised to identify the geographical hot spots in which the disease appeared. Results: During 2014-2018, we found 90,683 NF cases. About 4.86% of the cases died. The median age for all cases was 59.39 years old. The annual incidence of NF demonstrated an upward trend (from 26.08 per 100,000 population in 2014 to 32.64 per 100,000 population in 2018). The monthly incidence was highest between May and August. A high incidence cluster (as indicated by local Moran's I) was found in the north-eastern region of Thailand. The most infected sites were on the ankles and feet (43.18%) with an amputation rate of 7.99% in all cases. Multivariable logistic regression indicated that the significant risk factor for amputation was a presence of underlying diseases, namely diabetes (OR 7.94, 95% CI 7.34-8.61). Risk factors for mortality included being elderly (OR 1.82, 95% CI 1.68-1.98) and a presence of underlying hypertension (OR 1.16, 95% CI 1.07-1.27), cirrhosis (OR 4.67, 95% CI 4.17-5.21), and malignancy (OR 1.88, 95% CI 1.55-2.26). Discussion and conclusion: As the elderly and those with chronic underlying diseases are likely to face non-preferable health outcomes from NF, healthcare providers should pay great attention to these groups of patients. Early and intensive treatment might be considered in these groups of patients. Further studies that aim to validate the volume of actual NF cases and reported NF cases are recommended.
Article
Objectives We aimed to investigate whether there has been a geographic shift in the distribution of mesothelioma deaths in Great Britain given the decline of shipbuilding and progressive exposure regulation. Methods We calculated age-adjusted mesothelioma mortality rates and estimated rate ratios for areas with and without a dockyard. We compared spatial autocorrelation statistics (Moran’s I) for age-adjusted rates at local authority district level for 2002–2008 and 2009–2015. We measured the mean distance of the deceased’s postcode to the nearest dockyard at district level and calculated the association of average distance to dockyard and district mesothelioma mortality using simple linear regression for men, for 2002–2008 and 2009–2015. Results District age-adjusted male mortality rates fell during 2002–2015 for 80 of 348 districts (23%), rose for 267 (77%) and were unchanged for one district; having one or more dockyards in a district was associated with rates falling (OR=2.43, 95% CI 1.22 to 4.82, p=0.02). The mortality rate ratio for men in districts with a dockyard, compared with those without a dockyard was 1.41 (95% CI 1.35 to 1.48, p<0.05) for 2002–2008 and 1.18 (95% CI 1.13 to 1.23, p<0.05) for 2009–2015. Spatial autocorrelation (measured by Moran’s I) decreased from 0.317 (95% CI 0.316 to 0.319, p=0.001) to 0.312 (95% CI 0.310 to 0.314, p=0.001) for men and the coefficient of the association between distance to dockyard and district level age-adjusted male mortality (per million population) from −0.16 (95% CI −0.21 to −0.10, p<0.01) to −0.13 (95% CI −0.18 to −0.07, p<0.01) for men, when comparing 2002–2008 with 2009–2015. Conclusion For most districts age-adjusted mesothelioma mortality rates increased through 2002–2015 but the relative contribution from districts with a dockyard fell. Dockyards remain strongly spatially associated with mesothelioma mortality but the strength of this association appears to be falling and mesothelioma deaths are becoming more dispersed.
Article
Full-text available
A dataset of socioeconomic, demographic and geographic data for US counties (N≈3,100) was created by merging data from several sources. A suitable subset of 28 socioeconomic indicators was chosen for analysis. Factor analysis revealed a clear general socioeconomic factor (S factor) which was stable across extraction methods and different samples of indicators (absolute split-half sampling reliability = .85). Self-identified race/ethnicity (SIRE) population percentages were strongly, but non-linearly, related to cognitive ability and S. In general, the effect of White% and Asian% were positive, while those for Black%, Hispanic% and Amerindian% were negative. The effect was unclear for Other/mixed%. The best model consisted of White%, Black%, Asian% and Amerindian% and explained 41/43% of the variance in cognitive ability/S among counties. SIRE homogeneity had a non-linear relationship to S, both with and without taking into account the effects of SIRE variables. Overall, the effect was slightly negative due to low S, high White% areas. Geospatial (latitude, longitude, and elevation) and climatological (temperature, precipitation) predictors were tested in models. In linear regression, they had little incremental validity. However, there was evidence of non-linear relationships. When models were fitted that allowed for non-linear effects of the environmental predictors, they were able to add a moderate amount of incremental validity. LASSO regression, however, suggested that much of this predictive validity was due to overfitting. Furthermore, it was difficult to make causal sense of the results. Spatial patterns in the data were examined using multiple methods, all of which indicated strong spatial autocorrelation for cognitive ability, S and SIRE (k nearest spatial neighbor regression [KNSNR] correlations of .62 to .89). Model residuals were also spatially autocorrelated, and for this reason the models were re-fit controlling for spatial autocorrelation using KNSNR-based residuals and spatial local regression. The results indicated that the effects of SIREs were not due to spatially autocorrelated confounds except possibly for Black% which was about 50% weaker in the controlled analyses. Pseudo-multilevel analyses of both the factor structure of S and the SIRE predictive model showed results consistent with the main analyses. Specifically, the factor structure was similar across levels of analysis (states and counties) and within states. Furthermore, the SIRE predictors had similar betas when examined within each state compared to when analyzed across all states. It was tested whether the relationship between SIREs and S was mediated by cognitive ability. Several methods were used to examine this question and the results were mixed, but generally in line with a partial mediation model. Jensen's method (method of correlated vectors) was used to examine whether the observed relationship between cognitive ability and S scores was plausibly due to the latent S factor. This was strongly supported (r = .91, Nindicators=28). Similarly, it was examined whether the relationship between SIREs and S scores was plausibly due to the latent S factor. This did not appear to be the case.
Article
Full-text available
Based on a classical true score theory (classical test theory, CTT) equation, indicating that as the observed correlation between two tests increases, the reliability of the difference scores decreases, researchers have concluded that difference scores are unreliable. But CTT shows that the reliabilities of the two tests and the true correlation between them influence the observed correlation and previous analyses have not taken the true correlation sufficiently into account. In turn, the reliability of difference scores depends on the interaction of the reliabilities of the individual tests and their true correlation when the variances of the tests are equal, and on a more complicated interaction between them and the deviation ratio when the variances of the tests are not equal. The upshot is that difference scores likely are more reliable, on more occasions, than researchers have realized. I show how researchers can predict what the reliability of the difference scores is likely to be, to aid in deciding whether to carry through one’s planned use of difference scores.
Article
Full-text available
The direction of an association at the population-level may be reversed within the subgroups comprising that population-a striking observation called Simpson's paradox. When facing this pattern, psychologists often view it as anomalous. Here, we argue that Simpson's paradox is more common than conventionally thought, and typically results in incorrect interpretations-potentially with harmful consequences. We support this claim by reviewing results from cognitive neuroscience, behavior genetics, clinical psychology, personality psychology, educational psychology, intelligence research, and simulation studies. We show that Simpson's paradox is most likely to occur when inferences are drawn across different levels of explanation (e.g., from populations to subgroups, or subgroups to individuals). We propose a set of statistical markers indicative of the paradox, and offer psychometric solutions for dealing with the paradox when encountered-including a toolbox in R for detecting Simpson's paradox. We show that explicit modeling of situations in which the paradox might occur not only prevents incorrect interpretations of data, but also results in a deeper understanding of what data tell us about the world.
Book
Full-text available
Meta-analysis is arguably the most important methodological innovation in the social and behavioral sciences in the last 25 years. Developed to offer researchers an informative account of which methods are most useful in integrating research findings across studies, this book will enable the reader to apply, as well as understand, meta-analytic methods. Rather than taking an encyclopedic approach, the authors have focused on carefully developing those techniques that are most applicable to social science research, and have given a general conceptual description of more complex and rarely-used techniques. Fully revised and updated, Methods of Meta-Analysis, Second Edition is the most comprehensive text on meta-analysis available today. New to the Second Edition: * An evaluation of fixed versus random effects models for meta-analysis* New methods for correcting for indirect range restriction in meta-analysis* New developments in corrections for measurement error* A discussion of a new Windows-based program package for applying the meta-analysis methods presented in the book* A presentation of the theories of data underlying different approaches to meta-analysis
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform.Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Article
Provides graduate students in the social sciences with the basic skills they need to estimate, interpret, present, and publish basic regression models using contemporary standards. Key features of the book include: • interweaving the teaching of statistical concepts with examples developed for the course from publicly-available social science data or drawn from the literature. • thorough integration of teaching statistical theory with teaching data processing and analysis. • teaching of Stata and use of chapter exercises in which students practice programming and interpretation on the same data set. A separate set of exercises allows students to select a data set to apply the concepts learned in each chapter to a research question of interest to them, all updated for this edition.
Book
An Introduction to Statistical Learning provides an accessible overview of the field of statistical learning, an essential toolset for making sense of the vast and complex data sets that have emerged in fields ranging from biology to finance to marketing to astrophysics in the past twenty years. This book presents some of the most important modeling and prediction techniques, along with relevant applications. Topics include linear regression, classification, resampling methods, shrinkage approaches, tree-based methods, support vector machines, clustering, and more. Color graphics and real-world examples are used to illustrate the methods presented. Since the goal of this textbook is to facilitate the use of these statistical learning techniques by practitioners in science, industry, and other fields, each chapter contains a tutorial on implementing the analyses and methods presented in R, an extremely popular open source statistical software platform. Two of the authors co-wrote The Elements of Statistical Learning (Hastie, Tibshirani and Friedman, 2nd edition 2009), a popular reference book for statistics and machine learning researchers. An Introduction to Statistical Learning covers many of the same topics, but at a level accessible to a much broader audience. This book is targeted at statisticians and non-statisticians alike who wish to use cutting-edge statistical learning techniques to analyze their data. The text assumes only a previous course in linear regression and no knowledge of matrix algebra.
Article
This paper examines the distribution of national IQ in geographical space. When the heritability of IQ and its dependence on eco-social factors are considered from a global perspective, they suggest that the IQs of neighboring countries should be similar. Using previously published IQ data for 113 nations (Lynn, R., & Vanhanen, T., (2006). IQ and global inequality. Athens, GA: Washington Summit Publishers.) the relationship between geographical location and national IQ is formally tested using spatial statistics. It is found that as predicted, nations that are geographical neighbors have more similar IQs than nations that are far apart. National IQ varies strikingly with position around the globe; the relationship between location and national IQ is even stronger than the relationship between location and national average temperature. The findings suggest that Lynn & Vanhanen's national IQ measures are reliable and adequately representative, and that their procedures for estimating missing national IQ scores from the scores of nearby nations are defensible.
Article
Cross-national comparisons of IQ have become common since the release of a large dataset of international IQ scores. However, these studies have consistently failed to consider the potential lack of independence of these scores based on spatial proximity. To demonstrate the importance of this mission, we present a re-evaluation of several hypotheses put forward to explain variation in mean IQ among nations namely: (i) distance from central Africa, (ii) temperature, (iii) parasites, (iv) nutrition, (v) education, and (vi) GDP. We quantify the strength of spatial autocorrelation (SAC) in the predictors, response variables and the residuals of multiple regression models explaining national mean IQ. We outline a procedure for the control of SAC in such analyses and highlight the differences in the results before and after control for SAC. We find that incorporating additional terms to control for spatial interdependence increases the fit of models with no loss of parsimony. Support is provided for the finding that a national index of parasite burden and national IQ are strongly linked and temperature also features strongly in the models. However, we tentatively recommend a physiological – via impacts on host–parasite interactions – rather than evolutionary explanation for the effect of temperature. We present this study primarily to highlight the danger of ignoring autocorrelation in spatially extended data, and outline an appropriate approach should a spatially explicit analysis be considered necessary.
NWO, Gender bias and Simpson's paradox
  • C Albers
Albers, C. (2015). NWO, Gender bias and Simpson's paradox. Retrieved from http://blog.casperalbers.nl/science/nwo-gender-bias-and-simpsons-paradox/