Content uploaded by Indrajeet Patil

Author content

All content in this area was uploaded by Indrajeet Patil on Oct 28, 2022

Content may be subject to copyright.

Methods and Algorithms for Correlation Analysis in R

Dominique Makowski1, Mattan S. Ben-Shachar2, Indrajeet Patil3, and

Daniel Lüdecke4

1Nanyang Technological University, Singapore 2Ben-Gurion University of the Negev, Israel 3Max

Planck Institute for Human Development, Germany 4University Medical Center

Hamburg-Eppendorf, Germany

DOI: 10.21105/joss.02306

Software

•Review

•Repository

•Archive

Editor: Mikkel Meyer Andersen

Reviewers:

•@markhwhiteii

•@mmrabe

Submitted: 21 May 2020

Published: 16 July 2020

License

Authors of papers retain

copyright and release the work

under a Creative Commons

Attribution 4.0 International

License (CC BY 4.0).

Introduction

Correlations tests are arguably one of the most commonly used statistical procedures, and are

used as a basis in many applications such as exploratory data analysis, structural modelling,

data engineering etc. In this context, we present correlation, a toolbox for the R language

(R Core Team, 2019) and part of the easystats collection, focused on correlation analysis.

Its goal is to be lightweight, easy to use, and allows for the computation of many dierent

kinds of correlations, such as:

•Pearson’s correlation: This is the most common correlation method. It corresponds

to the covariance of the two variables normalized (i.e., divided) by the product of their

standard deviations.

rxy =cov(x, y)

SDx×S Dy

•Spearman’s rank correlation: A non-parametric measure of correlation, the Spearman

correlation between two variables is equal to the Pearson correlation between the rank

scores of those two variables; while Pearson’s correlation assesses linear relationships,

Spearman’s correlation assesses monotonic relationships (whether linear or not). Con-

dence Intervals (CI) for Spearman’s correlations are computed using the Fieller, Hartley,

& Pearson (1957) correction (see Bishara & Hittner, 2017).

rsxy =cov(rankx, r anky)

SD(rankx)×SD(ranky)

•Kendall’s rank correlation: In the normal case, the Kendall correlation is preferred

to the Spearman correlation because of a smaller gross error sensitivity (GES) and a

smaller asymptotic variance (AV), making it more robust and more ecient. However,

the interpretation of Kendall’s tau is less direct compared to that of the Spearman’s rho,

in the sense that it quanties the dierence between the % of concordant and discordant

pairs among all possible pairwise events. Condence Intervals (CI) for Kendall’s corre-

lations are computed using the Fieller et al. (1957) correction (see Bishara & Hittner,

2017). For each pair of observations (i ,j) of two variables (x, y), it is dened as follows:

τxy =2

n(n−1) ∑

i<j

sign(xi−xj)×sign(yi−yj)

Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:

//doi.org/10.21105/joss.02306

1

•Biweight midcorrelation: A measure of similarity that is median-based, instead of

the traditional mean-based, thus being less sensitive to outliers. It can be used as a

robust alternative to other similarity metrics, such as Pearson correlation (Langfelder &

Horvath, 2012).

•Distance correlation: Distance correlation measures both linear and non-linear associ-

ation between two random variables or random vectors. This is in contrast to Pearson’s

correlation, which can only detect linear association between two random variables.

•Percentage bend correlation: Introduced by Wilcox (1994), it is based on a down-

weight of a specied percentage of marginal observations deviating from the median (by

default, 20 percent).

•Shepherd’s Pi correlation: Equivalent to a Spearman’s rank correlation after outliers

removal (by means of bootstrapped Mahalanobis distance).

•Point-Biserial and biserial correlation: Correlation coecient used when one variable

is continuous and the other is dichotomous (binary). Point-Biserial is equivalent to a

Pearson’s correlation, while Biserial should be used when the binary variable is assumed

to have an underlying continuity. For example, anxiety level can be measured on a

continuous scale, but can be classied dichotomously as high/low.

•Polychoric correlation: Correlation between two theorised normally distributed con-

tinuous latent variables, from two observed ordinal variables.

•Tetrachoric correlation: Special case of the polychoric correlation applicable when

both observed variables are dichotomous.

•Partial correlation: Correlation between two variables after adjusting for the (linear) the

eect of one or more variables. The correlation test is here run after having partialized

the dataset, independently from it. In other words, it considers partialization as an

independent step generating a dierent dataset, rather than belonging to the same

model. This is why some discrepancies are to be expected for the t- and the p-values

(but not the correlation coecient) compared to other implementations such as ppcor.

Let ex.z be the residuals from the linear prediction of xby z(note that this can be

expanded to a multivariate z):

rxy.z =rex.z ,ey.z

•Multilevel correlation: Multilevel correlations are a special case of partial correlations

where the variable to be adjusted for is a factor and is included as a random eect in a

mixed model.

These methods allow for dierent ways of quantifying the link between two variables (see

Figure 1).

Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:

//doi.org/10.21105/joss.02306

2

Figure 1: Illustration of the dierent correlation estimates (a measure of association, represent by

the height of the bars) obtained via dierent methods for the same data (the scatter plot).

Design

It relies on one main function, correlation(), which outputs a dataframe containing each

pairwise correlation per row. This long format is convenient for further data analysis, but not as

much to get a summary, which is usually obtained via a correlation matrix. To address this, we

added standard methods, such as summary() and as.matrix(), to automatically transform

the long output to a matrix. Moreover, correlation also includes plotting capabilities via the

see package (Lüdecke et al., 2019a).

An overview of the features is available on the GitHub page (https://github.com/easystats/

correlation). The typical core workow is as follows:

results <- correlation(iris)

results

# Parameter1 | Parameter2 | r | 95% CI | t | df | p | Method | n_Obs

# ---------------------------------------------------------------------------------------------

# Sepal.Length | Sepal.Width | -0.12 | [-0.27, 0.04] | -1.44 | 148 | 0.152 | Pearson | 150

# Sepal.Length | Petal.Length | 0.87 | [ 0.83, 0.91] | 21.65 | 148 | < .001 | Pearson | 150

# Sepal.Length | Petal.Width | 0.82 | [ 0.76, 0.86] | 17.30 | 148 | < .001 | Pearson | 150

# Sepal.Width | Petal.Length | -0.43 | [-0.55, -0.29] | -5.77 | 148 | < .001 | Pearson | 150

# Sepal.Width | Petal.Width | -0.37 | [-0.50, -0.22] | -4.79 | 148 | < .001 | Pearson | 150

# Petal.Length | Petal.Width | 0.96 | [ 0.95, 0.97] | 43.39 | 148 | < .001 | Pearson | 150

The output is not a square matrix, but a (tidy) dataframe with all correlations tests per row.

One can also obtain a matrix using:

summary(results)

# Parameter | Petal.Width | Petal.Length | Sepal.Width

# -------------------------------------------------------

# Sepal.Length | 0.82*** | 0.87*** | -0.12

# Sepal.Width | -0.37*** | -0.43*** |

# Petal.Length | 0.96*** | |

Makowski et al., (2020). Methods and Algorithms for Correlation Analysis in R. Journal of Open Source Software, 5(51), 2306. https:

//doi.org/10.21105/joss.02306

3

Availability

The correlation package can be downloaded and installed from CRAN 1. It is licensed under

the GNU General Public License (v3.0), with all its source code stored at GitHub 2, and with

a corresponding issue tracker 2for bug reporting and feature enhancements. In the spirit of

honest and open science, we encourage requests/tips for xes, feature updates, as well as

general questions and concerns via direct interaction with contributors and developers.

Acknowledgments

correlation is part of the easystats ecosystem (relying on insight; Lüdecke et al., 2019b and

bayestestR; Makowski, Ben-Shachar, & Lüdecke, 2019), a collaborative project created to

facilitate the usage of R. Thus, we would like to thank the council of masters of easystats, all

other padawan contributors, as well as the users.

References

Bishara, A. J., & Hittner, J. B. (2017). Condence intervals for correlations when data are not

normal. Behavior research methods,49 (1), 294–309. doi:10.3758/s13428-016-0702-8

Fieller, E. C., Hartley, H. O., & Pearson, E. S. (1957). Tests for rank correlation coecients.

I. Biometrika,44(3/4), 470–481. doi:10.1093/biomet/48.1-2.29

Langfelder, P., & Horvath, S. (2012). Fast R functions for robust correlations and hierarchical

clustering. Journal of statistical software,46(11). doi:10.18637/jss.v046.i11

Lüdecke, D., Waggoner, P., Ben-Shachar, M. S., & Makowski, D. (2019a). See: Visualisation

toolbox for ’easystats’ and extra geoms, themes and color palettes for ’ggplot2’. Retrieved

from https://easystats.github.io/see/

Lüdecke, D., Waggoner, P., & Makowski, D. (2019b). Insight: A unied interface to access

information from model objects in r. Journal of Open Source Software,4(38), 1412.

doi:10.21105/joss.01412

Makowski, D., Ben-Shachar, M., & Lüdecke, D. (2019). bayestestR: Describing Eects and

their Uncertainty, Existence and Signicance within the Bayesian Framework. Journal of

Open Source Software,4(40), 1541. doi:10.21105/joss.01541

R Core Team. (2019). R: A language and environment for statistical computing. Vienna,

Austria: R Foundation for Statistical Computing. Retrieved from https://www.R-project.

org/

//doi.org/10.21105/joss.02306

4