PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

In 1895 Karl Pearson published his definition of the empirical correlation coefficient, but the idea of statistical correlation was anticipated substantially before this. Linear regression, and the associated correlation, is the principal statistical methodology in many applied sciences. We derive an explicit formula for the exact confidence density of the correlation. This can be used to replace the approximations currently in use.
Content may be subject to copyright.
Confidence in Correlation
doi:10.13140/RG.2.2.23673.49769
Gunnar Taraldsen
Department of Mathematical Sciences
Norwegian University of Science and Technology
May 18, 2021
Abstract
In 1895 Karl Pearson published his definition of the empirical correlation
coefficient, but the idea of statistical correlation was anticipated substan-
tially before this. Linear regression, and the associated correlation, is the
principal statistical methodology in many applied sciences. We derive an
explicit formula for the exact confidence density of the correlation. This can
be used to replace the approximations currently in use.
Keywords: confidence distributions; fiducial inference; correla-
tion coefficient; binormal distribution; Gaussian law
1 Introduction
The result of an experiment is given by four points with (x, y) coordinates (773,727),
(777,735), (284,286), and (519,573). There are reasons a priori for assuming a
linear relationship. This is further supported by Figure 1, and a high value for the
coefficient of determination R2= 97.00%. The R2equals the square of the empiri-
cal correlation r= 98.49%. An approximate 95% one sided confidence interval for
the correlation ρbased on the Fisher (1921) z-transformation is [66.08,100]%. Lin-
ear interpolation in the table presented by Fisher (1930, p.434) gives an exact 95%
confidence interval [67.42,100]%. 1Our Theorem 1, without linear interpolation,
gives the true exact 95% confidence interval [67.39,100]%.
The previous analysis is probably familiar to many readers with the possible
exception of the exact solution. Unfortunately, the exact solution by Fisher (1930)
166.4037 + (71.6298 - 66.4037)*(98.4893 - 98.4298)/(98.7371-98.4298) = 67.4156
1
Figure 1: A sample of size 4 with a regression line.
seems to be essentially forgotten. It can, and should, be implemented in standard
statistical software and practice. The purpose of this paper is to explain the
necessary theory, and to expand on the analysis given by Fisher (1930). The main
result is an explicit formula for the exact confidence density for the correlation.
This can be seen as adding an important example to the theory of confidence
distributions as presented by Schweder and Hjort (2016). It can also be seen as
an important example of fiducial inference as formulated by Hannig et al. (2016)
and others.
Much has been written on the correlation coefficient. A major source of in-
spiration for the presented proof is the above mentioned references and the work
of Hotelling (1953). In stead of giving a more thorough introduction we refer to
Rodgers and Nicewander (1988) and Rovine and von Eye (1997) which give further
references and several different interpretations of the correlation.
2 Theory
The correlation ρbetween two random variables Xand Yequals the cosine of the
angle βbetween XµXand YµYin the Hilbert space of finite variance random
2
variables. It is given by
ρ= cos(β) = hXµX, Y µYi
σXσY
(1)
exactly as for the calculus definition for vectors in R2. The inner product hX, Y i=
E(XY ) = RX(ω)Y(ω) P() defines kXk2=hX, Xiand orthogonality XY
by hX, Y i= 0. The mean µXand standard deviation σXof the random variable
Xequals the projection µX=h1, X iand the norm σX=kXµXk.
The reader may feel that the Hilbert space approach is unnecessarily abstract.
It is, in fact, rather useful. The problem has now been reformulated into the
problem of making inference regarding an angle between two vectors based on a
random sample. The empirical correlation rfor a random sample of size nis then,
naturally, given by the cosine of the angle between the vectors (xix), (yiy)
in Rn. This was the key geometrical idea when Fisher (1915) derived his explicit
formula for the probability density of the empirical correlation. The Hilbert space
approach is also very well described and motivated by Brockwell and Davis (1991)
in their text on time series where the correlation function of a process is the main
tool.
The best linear predictor ˆ
Yof Ygiven Xis the projection
ˆ
Y=µY+ρσY
XµX
σX
(2)
of Yonto the subspace spanned by the orthonormal basis {1,(XµX)X}.
Alternatively, equation (2) can be seen to correspond to the elementary definition
of the cosine from a triangle in a plane. The angle in the triangle is given by
the angle βbetween the vectors XµXand YµYspanning a two dimensional
subspace. Equation (2) gives that the correlation can be interpreted as the slope of
the best predictor line for standardized variables. This is possibly the most direct
natural interpretation in applications. Furthermore, it can be generalized to give
a similar interpretation for partial correlation.
If Xand Yare jointly Gaussian, then ˆ
Y= E(Y|X), and the conditional law of
Ygiven X=xis Gaussian. This gives the link between equation (2) and ordinary
regression
yi=a+bxi+σvi(3)
where v1, . . . , vnis a random sample from the standard normal law. Comparison
of equation (3) with equation (2) gives the constant term a=µYρµXσYX,
the slope b=ρσYX, and the conditional variance σ2= (1 ρ2)σ2
Y. The binor-
mal is hence parameterized alternatively by (µX, σX, a, b, σ2). A random sample
((x1, y1),...,(xn, yn)) of size nfrom the binormal can be generated by equation (3)
3
where xi=µX+σXuiand u1, . . . , unis a random sample from the standard normal
law. Combining gives
xi
yi=µX
µY+σX0
ρσYp1ρ2σY·ui
vi(4)
The data generating equation (4), and multivariate generalizations beyond
Gaussian, is treated by Fraser (1968,1979). This involves group actions, maximal
invariants, and leads to optimal inference methods as demonstrated by Taraldsen
and Lindqvist (2013). A particular consequence of equation (4), proved by Fraser
(1964, p.853), is
uρ
p1ρ2vr
1r2=z(5)
where uχ2(ν), v χ2(ν1), z N(0,1) are independent.
Equation (5) gives the law of ρwhen ris known. The degrees of freedom
ν=n1 for sample size n. With known mean the ν=n, and ris the cosine
of the angle between the vectors (xiµX), (yiµY) in Rn. The following result
holds for any real ν > 1 as a consequence of equation (5).
Theorem 1. Let rbe the empirical correlation of a random sample of size nfrom
the binormal. The confidence density for the correlation ρis
π(ρ|r, ν) = ν(ν1)Γ(ν1)
2πΓ(ν+1
2)(1r2)ν1
2·(1ρ2)ν2
2·(1)12ν
2F(3
2,1
2;ν+1
2;1 +
2)
where Fis the Gaussian hypergeometric function and ν=n1>1.
Proof. The idea is that equation (5) gives the conditional densities, and hence the
marginal densities after integration over u, v. This integration is done by a change
of variables resulting in a gamma integral and the above density. The details are
as follows.
The conditional density of sgiven u, v is normal by equation (5) with (s|u, v)
N(pv
ut, 1/u). Using this, the law of u, v, and ds = (1 ρ2)3/2give the joint
density of ρ, u, v as
(1 ρ2)3/2·uν
2
1eu
2
2ν
2Γ(ν
2)·vν1
2
1ev
2
2ν1
2Γ(ν1
2)·ru
2πeu
2(sv
ut)2(6)
The terms in the exponential are
1
2"u
1ρ22uvρr
p(1 ρ2)(1 r2)+v
1r2#=ν(s2
12s1s2+s2
2)
2(1 r2)(7)
4
using new coordinates (s1, s2) defined by νs2
1=u(1 r2)/(1 ρ2) and νs2
2=v.
Let s1=αexp(β/2) and s2=αexp(β/2). The density for ρ, α, β from
equation (6) is
21ννν
πΓ(ν
2)Γ(ν1
2)(1 r2)ν+1
2(1 ρ2)ν2
2eβαν1e
να(cosh(β)ρr)
1r2(8)
Integration over αgives π(ρ|r, ν) using the identity π(ν2)! = π2ν2Γ(ν
2)Γ(ν1
2)
and adjusting an integral representation of F(Olver et al.,2010, 14.3.9,14.12.4).
The Fisher (1921) z-transformation argument implies
1
2ln(1 + ρ
1ρ)1
2ln(1 + r
1r)z/ν2 (9)
Replacing equation (5) with this gives the z-transform approximate confidence
density
˜π(ρ|r, ν) = rν2
2π(1 ρ2)1e2ν
8[ln( (1+ρ)(1r)
(1ρ)(1+r))]2
(10)
for ν > 2.
The very last formula in the classical book by Fisher (1973) gives
π(ρ|r, ν) = (1 r2)ν1
2·(1 ρ2)ν2
2
π(ν2)! ν2
ρr θ1
2sin 2θ
sin3θ(11)
where cos θ=ρr and 0 < θ < π. This formula was derived by C. R. Rao.
3 Examples
Fisher (1930, p.534) considers the case with an observed correlation r= 99% from
a sample of size n= 4. Fisher, relying on calculations of Miss F. E. Allan, states
that the corresponding 5% ρis equal to about 76.5%. Using equation (5) on these
ρand rvalues confirms this.
Consider next the data presented in Figure 1. Theorem 1applied on these data
confirms the calculations giving the stated confidence intervals. More complete
information is given by the confidence densities shown in Figure 2. The empirical
correlation is r= 0.9849. The exact confidence density in Figure 2illustrates the
corresponding uncertainty corresponding to all possible confidence intervals with
all possible confidence levels.
5
Figure 2: The confidence density and the z-transform density.
Figure 3: The cd4 data of DiCiccio and Efron (1996, Table 1).
6
Figure 4: The confidence density and the z-transform density.
Figure 3shows the cd4 counts for 20 HIV-positive subjects (Efron,1998, p.101).
The x-axis gives the baseline count and the y-axis gives the count after one year
of treatment with an experimental antiviral drug. The empirical correlation is
r= 0.7232, and the equitail z-transform 90% approximate confidence interval is
[47.41,86.51]%. Figure 4shows the closeness of the confidence density and the
z-transform density. The exact equitail 90% confidence interval from Theorem 1
is [46.54,85.74]%. It is shifted to the left as also can be inferred from Figure 4.
As a final example, consider certain pairs of measurements with r= 0.534
taken on n= 8 children at age 24 months in connection with a study at a univer-
sity hospital in Hong Kong. Figure 5shows again the closeness of the confidence
density and the z-transform density. Schweder and Hjort (2016, p.227, Figure
7.8) discusses this example in much more detail including different bootstrap ap-
proaches. Using the method of Fisher (1930) they arrive at the same plot of the
confidence density using the exact distribution for the empirical correlation. This
provides additional verification of the exact result in Theorem 1.
7
Figure 5: The confidence density and the z-transform density.
4 Conclusion
The z-transform has been historically convenient since confidence intervals can be
calculated directly from tables of the standard normal distribution. Today, there
seems to be little reason for using this in stead of the exact result in Theorem 1
since the hypergeometric function is implemented in standard numerical libraries.
References
Brockwell, P. J. and R. A. Davis (1991). Time Series: Theory and Methods.
Springer Series in Statistics. New York: Springer-Verlag.
DiCiccio, T. J. and B. Efron (1996). Bootstrap Confidence Intervals. Statistical
Science 11 (3), 189–212.
Efron, B. (1998). R. A. Fisher in the 21st century (Invited paper presented at the
1996 R. A. Fisher Lecture). Statistical Science 13 (2), 95–122.
Fisher, R. A. (1915). Frequency distribution of the values of the correlation coef-
ficent in samples from an indefinitely large population. Biometrika 10, 507–21.
8
Fisher, R. A. (1921). On the ’probable error’ of a coefficient of correlation deduced
from a small sample. Metron 1 (4), 1–32.
Fisher, R. A. (1930). Inverse probability. Proc. Camb. Phil. Soc. 26, 528–535.
Fisher, R. A. (1973). Statistical Methods and Scientific Inference. Hafner press.
Fraser, D. A. S. (1964). On the definition of fiducial probability. Bull. Int.
Statist.Inst 40, 842–856.
Fraser, D. A. S. (1968). The Structure of Inference. John Wiley.
Fraser, D. A. S. (1979). Inference and Linear Models. McGraw-Hill.
Hannig, J., H. Iyer, R. C. S. Lai, and T. C. M. Lee (2016). Generalized Fiducial
Inference: A Review and New Results. Journal of the American Statistical
Association 111 (515), 1346–1361.
Hotelling, H. (1953). New Light on the Correlation Coefficient and its Transforms.
Journal of the Royal Statistical Society. Series B (Methodological) 15 (2), 193–
232.
Olver, F. W. J., D. W. Lozier, R. F. Boisvert, and C. W. Clark (Eds.) (2010).
NIST Handbook of Mathematical Functions. Cambridge University Press.
Rodgers, J. L. and W. A. Nicewander (1988). Thirteen Ways to Look at the
Correlation Coefficient. The American Statistician 42 (1), 59–66.
Rovine, M. J. and A. von Eye (1997). A 14th Way to Look at a Correlation
Coefficient: Correlation as the Proportion of Matches. The American Statisti-
cian 51 (1), 42–46.
Schweder, T. and N. L. Hjort (2016). Confidence, Likelihood, Probability: Statis-
tical Inference with Confidence Distributions. Cambridge University Press.
Taraldsen, G. and B. H. Lindqvist (2013). Fiducial theory and optimal inference.
Annals of Statistics 41 (1), 323–341.
9
... Results were then combined to assess the changes in the indicators for each soil quality class over the LTE regions. The correlation between the agroclimatic indicators and soil quality classes over the LTE regions was analyzed using the Pearson correlation which measures the linear correlation between two data sets (Taraldsen, 2020). It was represented by the determination coefficient (r) which ranges between − 1 and 1, where 1 indicates a perfect correlation between data sets. ...
Article
Full-text available
CONTEXT Long-Term Field Experiments (LTEs) were implemented to study the long-term effects of different management practices, including tillage, fertilization and crop rotation under otherwise constant conditions. Climate change is expected to change these conditions, challenging interpretation of LTE data with regard to the distinction between climate change and management effects. OBJECTIVE The objective of the study was to quantify the expected, spatially differentiated changes of agroclimatic conditions for the German LTE sites as a precondition for modelling and LTE data interpretation. METHODS We developed a framework combining spatially distributed climate data and LTE metadata to identify the possible climatic changes at 247 LTE sites with experiments running for 20 years or more. The LTEs were classified using the following categories: fertilization, tillage, crop rotation, field crops or grassland, conventional or organic. We utilized climate variables (temperature, precipitation) and agroclimatic indicators (aridity, growing degree days, etc.) to compare a baseline (1971–2000) with future periods (2021−2100) under the IPCC's Shared Socio-economic Pathways (SSP). A comprehensive LTE risk assessment was conducted, based on changes in climate variables and agroclimatic indicators between baseline and future scenarios. RESULTS AND CONCLUSIONS Under the most extreme scenario (SSP585), 150 LTEs are expected to shift from humid and dry sub-humid to semi-arid conditions. Frost days in LTE areas are expected to decline by 81%, and the growing season to lengthen by up to 92%. The spatial differentiation of expected climate change also facilitates the identification of suitable sites for future agricultural practices and may inform the design of new LTEs. SIGNIFICANCE Our results may guide the interpretation of LTE data regarding the effect of climate change, facilitating future soil crop modelling studies with LTE data and providing information for planning new LTE sites to support future agricultural research and/or adapting management on existing LTE sites. The framework we developed can easily be transferred to LTE sites in agricultural regions worldwide to support LTE research on climate change impacts and adaptation.
... In addition to the studies cited in the tables, data concerning commercial membranes has been obtained from studies by Güler et al. [93], Kingsbury et al. [112], and Avci et al. [92]. We calculate the degree of linear correlation between two variables with the Pearson correlation coefficient, r, which is the ratio between the co-variance of two variables and the product of their standard deviations [113,114]. When discussing the correlations, it is important to remember that all parameters underlie variations from study to study. ...
Article
Full-text available
The Reverse electrodialysis heat engine (REDHE) combines a reverse electrodialysis stack for power generation with a thermal regeneration unit to restore the concentration difference of the salt solutions. Current approaches for converting low-temperature waste heat to electricity with REDHE have not yielded conversion efficiencies and profits that would allow for the industrialization of the technology. This review explores the concept of Heat-to-Hydrogen with REDHEs and maps crucial developments toward industrialization. We discuss current advances in membrane development that are vital for the breakthrough of the RED Heat Engine. In addition, the choice of salt is a crucial factor that has not received enough attention in the field. Based on ion properties relevant for both the transport through IEMs and the feasibility for regeneration, we pinpoint the most promising salts for use in REDHE, which we find to be KNO3, LiNO3, LiBr and LiCl. To further validate these results and compare the system performance with different salts, there is a demand for a comprehensive thermodynamic model of the REDHE that considers all its units. Guided by such a model, experimental studies can be designed to utilize the most favorable process conditions (e.g., salt solutions).
... Reference source not found., Figure , Figure ). The value of 0.02 for r 2 (Figure 25) implies that there is not a linear dependency between the variables (Taraldsen, 2020, Hotelling, 1953. ...
Article
Full-text available
Transport emissions represent around 25% of the EU's total greenhouse gas emissions. The European Green Deal is meant to provide efficient, safe and environmentally friendly transport. The European goal of being the first climate-neutral continent by 2050 requires, among others, ambitious changes in in the continental transport system. The main aim of this paper is to provide an overview of of actions for sustainable and smart transport in the European Union, to analyse the main developments and trends related to the Green Deal strategy and also to present the transposition of sustainable and green transport policy within the Romanian national transport strategies and to analyse the strategic framework for the metropolitan train concept implementation. The research methodology focuses on the analysis of green policies applied in the European Union, the use of GIS techniques, applying cartographic and statistical methods and also on providing guidelines for the creation and the implementation of the strategies for green transition in the transport sector.
... The book by Schweder and Hjort (2016) contains a wealth of one-dimensional confidence distributions together with methods to obtain asymptotic confidence distributions. An explicit formula for the confidence density of the correlation coefficient from a sample from the binormal has been obtained by Taraldsen (2020). Upper and lower confidence distributions for the binomial success probability can also be obtained. ...
Preprint
Full-text available
Confidence distributions can be seen as a frequentist alternative to Bayesian posteriors for summarizing the knowledge available for an unknown quantity based on the observed data and a model, but without the need for a Bayesian prior. The concept of a confidence distribution is generalized so that it corresponds to the definition of confidence regions. This is illustrated by deriving confidence distributions for two-dimensional parameters using data generating algorithms as the main vehicle. The result is model generating algorithms corresponding to confidence distributions. Presently, confidence distributions are in practice not anywhere near the success of Bayesian posteriors due to a lack of general available software and a lack of a fully developed theory. This can be changed.
Presentation
Full-text available
Confidence distributions can be seen as a frequentist alternative to Bayesian predictive distributions for summarizing the knowledge available for an uncertain quantity based on the observed data and a model, but without the need for a Bayesian prior.
Presentation
Full-text available
Confidence distributions can be seen as a frequentist alternative to Bayesian posteriors for summarizing the knowledge available for an unknown quantity based on the observed data and a model, but without the need for a Bayesian prior. A confidence distribution is defined by two equally important parts: (i) A distribution estimator. (ii) A matching family of confidence-credibility regions. This is illustrated by examples for two-dimensional parameters, but also for one-dimensional parameters. The main tools for construction is given by (a) Data generating equations, (b) Pivotal quantities, and (c) p-value functions. Presently, confidence distributions are in practice not anywhere near the success of Bayesian posteriors due to a lack of general available software and a lack of a fully developed theory. This can be changed. 2020 Mathematics Subject Classification: 62C05 General considerations in statistical decision theory; 62F25 Parametric tolerance and confidence regions; 62B05 Sufficient statistics and fields;
Chapter
The presented paper establishes patterns of change in the probability of traffic accidents for twenty-year-old drivers taking into account traffic jams. It is noted that the duration of the traffic jam has a negative effect on the psychophysiological parameters of most drivers, which leads to an increase in psycho-emotional tension, the level of fatigue and, as a result, to an increase in the driver's reaction time. The probability of a traffic accident is one of the main properties of driver’s reliability, accounting for more than 70% of accidents. In order to assess the degree of impact of traffic jams on the probability of traffic accidents by young drivers, it is suggested to consider the ratio of these probabilities with and without traffic jams. It has been established that traffic jams affect drivers of different temperaments differently. The ratio of the probability of a traffic accident with and without a traffic jam for drivers of different temperaments was estimated using previously developed models of changes in the level of driver fatigue when leaving a traffic jam and changes in their reaction time. It was found that the duration of the traffic jam has the most negative effect on drivers of choleric and sanguine temperaments, increasing the probability of committing a traffic accident. Traffic jams have the least impact on a phlegmatic driver. The proposed scientific approach is based on a comprehensive study of the human factor, in particular the level of driver fatigue, which changes depending on driving conditions. The given patterns of change in the probability of road traffic accidents allow us to compare different solutions for improving road safety.
Article
R. A. Fisher, the father of modern statistics, proposed the idea of fiducial inference during the first half of the 20th century. While his proposal led to interesting methods for quantifying uncertainty, other prominent statisticians of the time did not accept Fisher’s approach as it became apparent that some of Fisher’s bold claims about the properties of fiducial distribution did not hold up for multi-parameter problems. Beginning around the year 2000, the authors and collaborators started to re-investigate the idea of fiducial inference and discovered that Fisher’s approach, when properly generalized, would open doors to solve many important and difficult inference problems. They termed their generalization of Fisher’s idea as generalized fiducial inference (GFI). The main idea of GFI is to carefully transfer randomness from the data to the parameter space using an inverse of a data generating equation without the use of Bayes theorem. The resulting generalized fiducial distribution (GFD) can then be used for inference. After more than a decade of investigations, the authors and collaborators have developed a unifying theory for GFI, and provided GFI solutions to many challenging practical problems in different fields of science and industry. Overall, they have demonstrated that GFI is a valid, useful, and promising approach for conducting statistical inference. The goal of this paper is to deliver a timely and concise introduction to GFI, to present some of the latest results, as well as to list some related open research problems. It is the authors’ hope that their contributions to GFI will stimulate the growth and usage of this exciting approach for statistical inference.
Article
An interpretation of the correlation coefficient is presented that describes the coefficient in terms of the proportion of matches in an idealized data set. This characteristic is then expressed as the result of the definition of the Pearson product-moment correlation. An expression is derived to relate the correlation coefficient to the proportion of times a second variable falls within a set range of the first. Both a simulation and a real data example are presented to show how this would be reflected in data.