PreprintPDF Available

# Methods and Algorithms for Correlation Analysis in R

Authors:

## Abstract

Correlations tests are arguably one of the most commonly used statistical procedures, and are used as a basis in many applications such as exploratory data analysis, structural modelling, data engineering etc. In this context, we present correlation, a toolbox for the R language and part of the easystats collection, focused on correlation analysis. Its goal is to be lightweight, easy to use, and allows for the computation of many different kinds of correlations.
Methods and Algorithms for Correlation Analysis in R
Dominique Makowski1, Mattan S. Ben-Shachar2, Indrajeet Patil3, and
Daniel Lüdecke4
1Nanyang Technological University, Singapore 2Ben-Gurion University of the Negev, Israel 3Max
Planck Institute for Human Development, Germany 4University Medical Center
Hamburg-Eppendorf, Germany
DOI: 10.21105/joss.02306
Software
Review
Repository
Archive
Editor: Mikkel Meyer Andersen
Reviewers:
@markhwhiteii
@mmrabe
Submitted: 21 May 2020
Published: 16 July 2020
Authors of papers retain
under a Creative Commons
Introduction
Correlations tests are arguably one of the most commonly used statistical procedures, and are
used as a basis in many applications such as exploratory data analysis, structural modelling,
data engineering etc. In this context, we present correlation, a toolbox for the R language
(R Core Team, 2019) and part of the easystats collection, focused on correlation analysis.
Its goal is to be lightweight, easy to use, and allows for the computation of many dierent
kinds of correlations, such as:
Pearson’s correlation: This is the most common correlation method. It corresponds
to the covariance of the two variables normalized (i.e., divided) by the product of their
standard deviations.
rxy =cov(x, y)
SDx×S Dy
Spearman’s rank correlation: A non-parametric measure of correlation, the Spearman
correlation between two variables is equal to the Pearson correlation between the rank
scores of those two variables; while Pearson’s correlation assesses linear relationships,
Spearman’s correlation assesses monotonic relationships (whether linear or not). Con-
dence Intervals (CI) for Spearman’s correlations are computed using the Fieller, Hartley,
& Pearson (1957) correction (see Bishara & Hittner, 2017).
rsxy =cov(rankx, r anky)
SD(rankx)×SD(ranky)
Kendall’s rank correlation: In the normal case, the Kendall correlation is preferred
to the Spearman correlation because of a smaller gross error sensitivity (GES) and a
smaller asymptotic variance (AV), making it more robust and more ecient. However,
the interpretation of Kendall’s tau is less direct compared to that of the Spearman’s rho,
in the sense that it quanties the dierence between the % of concordant and discordant
pairs among all possible pairwise events. Condence Intervals (CI) for Kendall’s corre-
lations are computed using the Fieller et al. (1957) correction (see Bishara & Hittner,
2017). For each pair of observations (i ,j) of two variables (x, y), it is dened as follows:
τxy =2
n(n1)
i<j
sign(xixj)×sign(yiyj)
Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:
//doi.org/10.21105/joss.02306
1
Biweight midcorrelation: A measure of similarity that is median-based, instead of
the traditional mean-based, thus being less sensitive to outliers. It can be used as a
robust alternative to other similarity metrics, such as Pearson correlation (Langfelder &
Horvath, 2012).
Distance correlation: Distance correlation measures both linear and non-linear associ-
ation between two random variables or random vectors. This is in contrast to Pearson’s
correlation, which can only detect linear association between two random variables.
Percentage bend correlation: Introduced by Wilcox (1994), it is based on a down-
weight of a specied percentage of marginal observations deviating from the median (by
default, 20 percent).
Shepherd’s Pi correlation: Equivalent to a Spearman’s rank correlation after outliers
removal (by means of bootstrapped Mahalanobis distance).
Point-Biserial and biserial correlation: Correlation coecient used when one variable
is continuous and the other is dichotomous (binary). Point-Biserial is equivalent to a
Pearson’s correlation, while Biserial should be used when the binary variable is assumed
to have an underlying continuity. For example, anxiety level can be measured on a
continuous scale, but can be classied dichotomously as high/low.
Polychoric correlation: Correlation between two theorised normally distributed con-
tinuous latent variables, from two observed ordinal variables.
Tetrachoric correlation: Special case of the polychoric correlation applicable when
both observed variables are dichotomous.
Partial correlation: Correlation between two variables after adjusting for the (linear) the
eect of one or more variables. The correlation test is here run after having partialized
the dataset, independently from it. In other words, it considers partialization as an
independent step generating a dierent dataset, rather than belonging to the same
model. This is why some discrepancies are to be expected for the t- and the p-values
(but not the correlation coecient) compared to other implementations such as ppcor.
Let ex.z be the residuals from the linear prediction of xby z(note that this can be
expanded to a multivariate z):
rxy.z =rex.z ,ey.z
Multilevel correlation: Multilevel correlations are a special case of partial correlations
where the variable to be adjusted for is a factor and is included as a random eect in a
mixed model.
These methods allow for dierent ways of quantifying the link between two variables (see
Figure 1).
Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:
//doi.org/10.21105/joss.02306
2
Figure 1: Illustration of the dierent correlation estimates (a measure of association, represent by
the height of the bars) obtained via dierent methods for the same data (the scatter plot).
Design
It relies on one main function, correlation(), which outputs a dataframe containing each
pairwise correlation per row. This long format is convenient for further data analysis, but not as
much to get a summary, which is usually obtained via a correlation matrix. To address this, we
added standard methods, such as summary() and as.matrix(), to automatically transform
the long output to a matrix. Moreover, correlation also includes plotting capabilities via the
see package (Lüdecke et al., 2019a).
An overview of the features is available on the GitHub page (https://github.com/easystats/
correlation). The typical core workow is as follows:
results <- correlation(iris)
results
# Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs
# ---------------------------------------------------------------------------------------------
# Sepal.Length | Sepal.Width | -0.12 | [-0.27, 0.04] | -1.44 | 148 | 0.152 | Pearson | 150
# Sepal.Length | Petal.Length | 0.87 | [ 0.83, 0.91] | 21.65 | 148 | < .001 | Pearson | 150
# Sepal.Length | Petal.Width | 0.82 | [ 0.76, 0.86] | 17.30 | 148 | < .001 | Pearson | 150
# Sepal.Width | Petal.Length | -0.43 | [-0.55, -0.29] | -5.77 | 148 | < .001 | Pearson | 150
# Sepal.Width | Petal.Width | -0.37 | [-0.50, -0.22] | -4.79 | 148 | < .001 | Pearson | 150
# Petal.Length | Petal.Width | 0.96 | [ 0.95, 0.97] | 43.39 | 148 | < .001 | Pearson | 150
The output is not a square matrix, but a (tidy) dataframe with all correlations tests per row.
One can also obtain a matrix using:
summary(results)
# Parameter | Petal.Width | Petal.Length | Sepal.Width
# -------------------------------------------------------
# Sepal.Length | 0.82*** | 0.87*** | -0.12
# Sepal.Width | -0.37*** | -0.43*** |
# Petal.Length | 0.96*** | |
Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:
//doi.org/10.21105/joss.02306
3
Availability
the GNU General Public License (v3.0), with all its source code stored at GitHub 2, and with
a corresponding issue tracker 2for bug reporting and feature enhancements. In the spirit of
honest and open science, we encourage requests/tips for xes, feature updates, as well as
general questions and concerns via direct interaction with contributors and developers.
Acknowledgments
correlation is part of the easystats ecosystem (relying on insight; Lüdecke et al., 2019b and
bayestestR; Makowski, Ben-Shachar, & Lüdecke, 2019), a collaborative project created to
facilitate the usage of R. Thus, we would like to thank the council of masters of easystats, all
other padawan contributors, as well as the users.
References
Bishara, A. J., & Hittner, J. B. (2017). Condence intervals for correlations when data are not
normal. Behavior research methods,49 (1), 294–309. doi:10.3758/s13428-016-0702-8
Fieller, E. C., Hartley, H. O., & Pearson, E. S. (1957). Tests for rank correlation coecients.
I. Biometrika,44(3/4), 470–481. doi:10.1093/biomet/48.1-2.29
Langfelder, P., & Horvath, S. (2012). Fast R functions for robust correlations and hierarchical
clustering. Journal of statistical software,46(11). doi:10.18637/jss.v046.i11
Lüdecke, D., Waggoner, P., Ben-Shachar, M. S., & Makowski, D. (2019a). See: Visualisation
toolbox for ’easystats’ and extra geoms, themes and color palettes for ’ggplot2’. Retrieved
from https://easystats.github.io/see/
Lüdecke, D., Waggoner, P., & Makowski, D. (2019b). Insight: A unied interface to access
information from model objects in r. Journal of Open Source Software,4(38), 1412.
doi:10.21105/joss.01412
Makowski, D., Ben-Shachar, M., & Lüdecke, D. (2019). bayestestR: Describing Eects and
their Uncertainty, Existence and Signicance within the Bayesian Framework. Journal of
Open Source Software,4(40), 1541. doi:10.21105/joss.01541
R Core Team. (2019). R: A language and environment for statistical computing. Vienna,
org/
Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:
//doi.org/10.21105/joss.02306
4
Article
Mobile sensing is a promising method that allows researchers to directly observe human social behavior in daily life using people’s mobile phones. To date, limited knowledge exists on how well mobile sensing can assess the quantity and quality of social interactions. We therefore examined the agreement among experience sampling, day reconstruction, and mobile sensing in the assessment of multiple aspects of daily social interactions (i.e., face-to-face interactions, calls, and text messages) and the possible unique access to social interactions that each method has. Over 2 days, 320 smartphone users (51% female, age range = 18–80, M = 39.53 years) answered up to 20 experience-sampling questionnaires about their social behavior and reconstructed their days in a daily diary. Meanwhile, face-to-face and smartphone-mediated social interactions were assessed with mobile sensing. The results showed some agreement between measurements of face-to-face interactions and high agreement between measurements of smartphone-mediated interactions. Still, a large number of social interactions were captured by only one of the methods, and the quality of social interactions is still difficult to capture with mobile sensing. We discuss limitations and the unique benefits of day reconstruction, experience sampling, and mobile sensing for assessing social behavior in daily life.
Article
Full-text available
The Bayesian framework for statistics is quickly gaining in popularity among scientists, for reasons such as reliability and accuracy (particularly in noisy data and small samples), the possibility of incorporating prior knowledge into the analysis, and the intuitive interpretation of results (Andrews & Baguley, 2013; Etz & Vandekerckhove, 2016; Kruschke, 2010; Kruschke, Aguinis, & Joo, 2012; Wagenmakers et al., 2017). Adopting the Bayesian framework is more of a shift in the paradigm than a change in the methodology; all the common statistical procedures (t-tests, correlations, ANOVAs, regressions, etc.) can also be achieved within the Bayesian framework. One of the core difference is that in the frequentist view, the effects are fixed (but unknown) and data are random. On the other hand, instead of having single estimates of the “true effect”, the Bayesian inference process computes the probability of different effects given the observed data, resulting in a distribution of possible values for the parameters, called the posterior distribution. The bayestestR package provides tools to describe these posterior distributions.
Article
Full-text available
With nonnormal data, the typical confidence interval of the correlation (Fisher z') may be inaccurate. The literature has been unclear as to which of several alternative methods should be used instead, and how extreme a violation of normality is needed to justify an alternative. Through Monte Carlo simulation, 11 confidence interval methods were compared, including Fisher z', two Spearman rank-order methods, the Box–Cox transformation, rank-based inverse normal (RIN) transformation, and various bootstrap methods. Nonnormality often distorted the Fisher z' confidence interval—for example, leading to a 95 % confidence interval that had actual coverage as low as 68 %. Increasing the sample size sometimes worsened this problem. Inaccurate Fisher z' intervals could be predicted by a sample kurtosis of at least 2, an absolute sample skewness of at least 1, or significant violations of normality hypothesis tests. Only the Spearman rank-order and RIN transformation methods were universally robust to nonnormality. Among the bootstrap methods, an observed imposed bootstrap came closest to accurate coverage, though it often resulted in an overly long interval. The results suggest that sample nonnormality can justify avoidance of the Fisher z' interval in favor of a more robust alternative. R code for the relevant methods is provided in supplementary materials.
Article
Full-text available
Many high-throughput biological data analyses require the calculation of large correlation matrices and/or clustering of a large number of objects. The standard R function for calculating Pearson correlation can handle calculations without missing values efficiently, but is inefficient when applied to data sets with a relatively small number of missing data. We present an implementation of Pearson correlation calculation that can lead to substantial speedup on data with relatively small number of missing entries. Further, we parallelize all calculations and thus achieve further speedup on systems where parallel processing is available. A robust correlation measure, the biweight midcorrelation, is implemented in a similar manner and provides comparable speed. The functions cor and bicor for fast Pearson and biweight midcorrelation, respectively, are part of the updated, freely available R package WGCNA.The hierarchical clustering algorithm implemented in R function hclust is an order n(3) (n is the number of clustered objects) version of a publicly available clustering algorithm (Murtagh 2012). We present the package flashClust that implements the original algorithm which in practice achieves order approximately n(2), leading to substantial time savings when clustering large data sets.
See: Visualisation toolbox for 'easystats' and extra geoms, themes and color palettes for 'ggplot2
• D Lüdecke
• P Waggoner
• M S Ben-Shachar
• D Makowski
Lüdecke, D., Waggoner, P., Ben-Shachar, M. S., & Makowski, D. (2019a). See: Visualisation toolbox for 'easystats' and extra geoms, themes and color palettes for 'ggplot2'. Retrieved from https://easystats.github.io/see/