PreprintPDF Available

Abstract

The {datawizard} package for the R programming language provides a lightweight toolbox to assist in key steps involved in any data analysis workflow: (1) wrangling the raw data to get it in the needed form, (2) applying preprocessing steps and statistical transformations, and (3) compute statistical summaries of data properties and distributions. Therefore, it can be a valuable tool for R users and developers looking for a lightweight option for data preparation.
datawizard: An R Package for Easy Data Preparation
and Statistical Transformations
Indrajeet Patil 1, Dominique Makowski 2, Mattan S. Ben-Shachar 3,
Brenton M. Wiernik 4, Etienne Bacher 5, and Daniel Lüdecke 6
1cynkra Analytics GmbH, Germany 2Nanyang Technological University, Singapore 3Ben-Gurion
University of the Negev, Israel 4Independent Researcher 5Luxembourg Institute of Socio-Economic
Research (LISER), Luxembourg 6University Medical Center Hamburg-Eppendorf, Germany
DOI: 10.21105/joss.04684
Software
Review
Repository
Archive
Editor: Øystein Sørensen
Reviewers:
@tomfaulkenberry
@garretrc
Submitted: 07 August 2022
Published: 09 October 2022
License
Authors of papers retain copyright
and release the work under a
Creative Commons Attribution 4.0
International License (CC BY 4.0).
Summary
The
{datawizard}
package for the R programming language (R Core Team, 2021) provides a
lightweight toolbox to assist in key steps involved in any data analysis workow: (1) wrangling
the raw data to get it in the needed form, (2) applying preprocessing steps and statistical
transformations, and (3) compute statistical summaries of data properties and distributions.
Therefore, it can be a valuable tool for R users and developers looking for a lightweight option
for data preparation.
Statement of Need
The
{datawizard}
package is part of
{easystats}
, a collection of R packages designed to
make statistical analysis easier (Ben-Shachar et al. (2020), Lüdecke et al. (2020), Lüdecke,
Ben-Shachar, et al. (2021), Lüdecke, Patil, et al. (2021), Lüdecke et al. (2019), Makowski et
al. (2019), Makowski et al. (2020)). As this ecosystem follows a “0-external-hard-dependency”
policy, a data manipulation package that relies only on base R needed to be created. In eect,
{datawizard}
provides a data processing backend for this entire ecosystem. In addition to its
usefulness to the
{easystats}
ecosystem, it also provides an option for R users and package
developers if they wish to keep their (recursive) dependency weight to a minimum (for other
options, see Dowle & Srinivasan (2021), Eastwood (2021)).
Because
{datawizard}
is also meant to be used and adopted easily by a wide range of users,
its workow and syntax are designed to be similar to
{tidyverse}
(Wickham et al., 2019), a
widely used ecosystem of R packages. Thus, users familiar with the
{tidyverse}
can easily
translate their knowledge and make full use of {datawizard}.
In addition to being a lightweight solution to clean messy data,
{datawizard}
also provides
helpers for the other important step of data analysis: applying statistical transformations
to the cleaned data while setting up statistical models. This includes various types of data
standardization, normalization, rank-transformation, and adjustment. These transformations,
although widely used, are not currently collectively implemented in a package in the R ecosystem,
so {datawizard} can help new R users in nding the transformation they need.
Lastly,
{datawizard}
also provides a toolbox to create detailed summaries of data properties
and distributions (e.g., tables of descriptive statistics for each variable). This is a common step
in data analysis, but it is not available in base R or many modeling packages, so its inclusion
makes {datawizard} a one-stop-shop for data preparation tasks.
Brenton Wiernik is currently an independent researcher and Research Scientist at Meta, Demography and
Survey Science. The current work was done in an independent capacity.
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.1
Features
Data Preparation
The raw data is rarely in a state that it can be directly fed into a statistical model. It often
needs to be modied in various ways. For example, columns need to be renamed or reshaped,
certain portions of the data need to be ltered out, data scattered across multiple tables needs
to be joined, etc.
{datawizard} provides various functions for cleaning and preparing data (see Table 1).
Table 1: The table below lists a few key functions oered by
{datawizard}
for data wrangling. To see
the full list, see the package website: https://easystats.github.io/datawizard/
Function Operation
data_filter() to select only certain observations
data_select() to select only certain variables
data_extract() to extract a single variable
data_rename() to rename variables
data_to_long() to convert data from wide to long
data_to_wide() to convert data from long to wide
data_join() to join two data frames
We will look at one example function that converts data in wide format to tidy/long format:
stocks <- data.frame(
time = as.Date(”2009-01-01”)+0:4,
X = rnorm(5,0,1),
Y = rnorm(5,0,2)
)
stocks
#> time X Y
#> 1 2009-01-01 -0.91474184 -0.5654808
#> 2 2009-01-02 1.00124785 -1.5270177
#> 3 2009-01-03 -0.05642291 -1.3700199
#> 4 2009-01-04 0.29664516 0.7341479
#> 5 2009-01-05 -2.79147086 0.3659937
data_to_long(
stocks,
select = -c(”time”),
names_to = ”stock”,
values_to = ”price”
)
#> time stock price
#> 1 2009-01-01 X -0.91474184
#> 2 2009-01-01 Y -0.56548082
#> 3 2009-01-02 X 1.00124785
#> 4 2009-01-02 Y -1.52701766
#> 5 2009-01-03 X -0.05642291
#> 6 2009-01-03 Y -1.37001987
#> 7 2009-01-04 X 0.29664516
#> 8 2009-01-04 Y 0.73414790
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.2
#> 9 2009-01-05 X -2.79147086
#> 10 2009-01-05 Y 0.36599370
Statistical Transformations
Even after getting the raw data in the needed format, we may need to transform certain
variables further to meet requirements imposed by a statistical test.
{datawizard}
provides a rich collection of such functions for transforming variables (see Table
2).
Table 2: The table below lists a few key functions oered by
{datawizard}
for data transformations. To
see the full list, see the package website: https://easystats.github.io/datawizard/
Function Operation
standardize() to center and scale data
normalize() to scale variables to 0-1 range
adjust() to adjust data for eect of other variables
slide() to shift numeric value range
ranktransform() to convert numeric values to integer ranks
We will look at one example function that standardizes (i.e. centers and scales) data so that it
can be expressed in terms of standard deviation:
d<- data.frame(
a = c(-2,-1,0,1,2),
b = c(3,4,5,6,7)
)
standardize(d, center = c(3,4), scale = c(2,4))
#> a b
#> 1 -2.5 -0.25
#> 2 -2.0 0.00
#> 3 -1.5 0.25
#> 4 -1.0 0.50
#> 5 -0.5 0.75
Summaries of Data Properties and Distributions
The workhorse function to get a comprehensive summary of data properties is
describe_distribution()
, which combines a set of indices (e.g., measures of cen-
trality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in
{datawizard}.
describe_distribution(mtcars)
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.3
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
mpg 20.09 6.03 7.53 10.4 33.9 0.67 -0.02 32 0
cyl 6.19 1.79 4.00 4.0 8.0 -0.19 -1.76 32 0
disp 230.72 123.94 221.52 71.1 472.0 0.42 -1.07 32 0
hp 146.69 68.56 84.50 52.0 335.0 0.80 0.28 32 0
drat 3.60 0.53 0.84 2.8 4.9 0.29 -0.45 32 0
wt 3.22 0.98 1.19 1.5 5.4 0.47 0.42 32 0
qsec 17.85 1.79 2.02 14.5 22.9 0.41 0.86 32 0
vs 0.44 0.50 1.00 0.0 1.0 0.26 -2.06 32 0
am 0.41 0.50 1.00 0.0 1.0 0.40 -1.97 32 0
gear 3.69 0.74 1.00 3.0 5.0 0.58 -0.90 32 0
carb 2.81 1.62 2.00 1.0 8.0 1.16 2.02 32 0
Licensing and Availability
{datawizard}
is licensed under the GNU General Public License (v3.0), with all source code
openly developed and stored on GitHub (https://github.com/easystats/datawizard), along
with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit
of honest and open science, we encourage requests, tips for xes, feature updates, as well as
general questions and concerns via direct interaction with contributors and developers.
Acknowledgments
{datawizard}
is part of the collaborative easystats ecosystem. Thus, we thank the members
of easystats as well as the users.
References
Ben-Shachar, M. S., Lüdecke, D., & Makowski, D. (2020). eectsize: Estimation of eect
size indices and standardized parameters. Journal of Open Source Software,5(56), 2815.
https://doi.org/10.21105/joss.02815
Dowle, M., & Srinivasan, A. (2021). Data.table: Extension of ‘data.frame‘.https://CRAN.
R-project.org/package=data.table
Eastwood, N. (2021). Poorman: A poor man’s dependency free recreation of ’dplyr’.https:
//CRAN.R-project.org/package=poorman
Lüdecke, D., Ben-Shachar, M. S., Patil, I., & Makowski, D. (2020). Extracting, computing and
exploring the parameters of statistical models using R. Journal of Open Source Software,
5(53), 2445. https://doi.org/10.21105/joss.02445
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance:
An R package for assessment, comparison and testing of statistical models. Journal of
Open Source Software,6(60), 3139. https://doi.org/10.21105/joss.03139
Lüdecke, D., Patil, I., Ben-Shachar, M. S., Wiernik, B. M., Waggoner, P., & Makowski, D.
(2021). see: An R package for visualizing statistical models. Journal of Open Source
Software,6(64), 3393. https://doi.org/10.21105/joss.03393
Lüdecke, D., Waggoner, P., & Makowski, D. (2019). insight: A unied interface to access
information from model objects in R. Journal of Open Source Software,4(38), 1412.
https://doi.org/10.21105/joss.01412
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.4
Makowski, D., Ben-Shachar, M. S., & Lüdecke, D. (2019). bayestestR: Describing eects and
their uncertainty, existence and signicance within the Bayesian framework. Journal of
Open Source Software,4(40), 1541. https://doi.org/10.21105/joss.01541
Makowski, D., Ben-Shachar, M. S., Patil, I., & Lüdecke, D. (2020). Methods and algorithms
for correlation analysis in R. Journal of Open Source Software,5(51), 2306. https:
//doi.org/10.21105/joss.02306
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation
for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund,
G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M.,
Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Yutani, H. (2019). Welcome
to the tidyverse. Journal of Open Source Software,4(43), 1686. https://doi.org/10.21105/
joss.01686
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.5
Article
Full-text available
Deadwood is a large global carbon store with its store size partially determined by biotic decay. Microbial wood decay rates are known to respond to changing temperature and precipitation. Termites are also important decomposers in the tropics but are less well studied. An understanding of their climate sensitivities is needed to estimate climate change effects on wood carbon pools. Using data from 133 sites spanning six continents, we found that termite wood discovery and consumption were highly sensitive to temperature (with decay increasing >6.8 times per 10°C increase in temperature)-even more so than microbes. Termite decay effects were greatest in tropical seasonal forests, tropical savannas, and subtropical deserts. With tropicalization (i.e., warming shifts to tropical climates), termite wood decay will likely increase as termites access more of Earth's surface.
Article
Full-text available
The see package is embedded in the easystats ecosystem, a collection of R packages that operate in synergy to provide a consistent and intuitive syntax when working with statistical models in the R programming language (R Core Team, 2021). Most easystats packages return comprehensive numeric summaries of model parameters and performance. The see package complements these numeric summaries with a host of functions and tools to produce a range of publication-ready visualizations for model parameters, predictions, and performance diagnostics. As a core pillar of easystats, the see package helps users to utilize visualization for more informative, communicable, and well-rounded scientific reporting. Statement of Need The grammar of graphics (Wilkinson, 2012), largely due to its implementation in the ggplot2 package (Wickham, 2016), has become the dominant approach to visualization in R. Building a model visualization with ggplot2 is somewhat disconnected from the model fitting and evaluation process. Generally, this process entails: 1. Fitting a model. 2. Extracting desired results from the model (e.g., model parameters and intervals, model predictions, diagnostic statistics) and arranging them into a dataframe. 3. Passing the results dataframe to ggplot() and specifying the graphical parameters. For example: library(ggplot2) # step-1 model <-lm(mpg~factor(cyl) * wt, data = mtcars) # step-2 results <-fortify(model) # step-3 ggplot(results) + geom_point(aes(x = wt, y = mpg, color = factor(cyl))) + geom_line(aes(x = wt, y = .fitted, color =`factor(cyl)`)) Lüdecke et al., (2021). see: An R Package for Visualizing Statistical Models. Journal of Open Source Software, 6(64), 3393. https: //doi.
Article
Full-text available
A crucial part of statistical analysis is evaluating a model's quality and fit, or performance. During analysis, especially with regression models, investigating the fit of models to data also often involves selecting the best fitting model amongst many competing models. Upon investigation, fit indices should also be reported both visually and numerically to bring readers in on the investigative effort. The performance R-package (R Core Team, 2021) provides utilities for computing measures to assess model quality, many of which are not directly provided by R's base or stats packages. These include measures like R 2 , intraclass correlation coefficient (ICC), root mean squared error (RMSE), or functions to check for vexing issues like overdispersion, singularity, or zero-inflation. These functions support a large variety of regression models including generalized linear models, (generalized) mixed-effects models, their Bayesian cousins, and many others. Statement of Need While functions to build and produce diagnostic plots or to compute fit statistics exist, these are located across many packages, which results in a lack of a unique and consistent approach to assess the performance of many types of models. The result is a difficult-to-navigate, unorganized ecosystem of individual packages with different syntax, making it onerous for researchers to locate and use fit indices relevant for their unique purposes. The performance package in R fills this gap by offering researchers a suite of intuitive functions with consistent syntax for computing, building, and presenting regression model fit statistics and visualizations. performance is part of the easystats ecosystem, which is a collaborative project focused on facilitating simple and intuitive usage of R for statistical analysis (Ben-Shachar et al., performance package offers functions for checking validity and model quality systematically and comprehensively for many regression model objects such as (generalized) linear models, mixed-effects models, and Bayesian models. performance also offers functions to compare and test multiple models simultaneously to evaluate the best fitting model to the data.
Article
Full-text available
In both theoretical and applied research, it is often of interest to assess the strength of an observed association. This is typically done to allow the judgment of the magnitude of an effect (especially when units of measurement are not meaningful, e.g., in the use of estimated latent variables; Bollen, 1989), to facilitate comparing between predictors' importance within a given model, or both. Though some indices of effect size, such as the correlation coefficient (itself a standardized covariance coefficient) are readily available, other measures are often harder to obtain. effectsize is an R package (R Core Team, 2020) that fills this important gap, providing utilities for easily estimating a wide variety of standardized effect sizes (i.e., effect sizes that are not tied to the units of measurement of the variables of interest) and their confidence intervals (CIs), from a variety of statistical models. effectsize provides easy-to-use functions, with full documentation and explanation of the various effect sizes offered, and is also used by developers of other R packages as the back-end for effect size computation, such as parameters (Lüdecke et al., 2020), ggstatsplot (Patil, 2018), gtsummary (Sjoberg et al., 2020) and more. Comparison to Other Packages effectsize's functionality is in part comparable to packages like lm.beta (Behrendt, 2014), MOTE (Buchanan et al., 2019), and MBESS (K. Kelley, 2020). Yet, there are some notable differences, e.g.: • lm.beta provides standardized regression coefficients for linear models, based on post-hoc model matrix standardization. However, the functionality is available only for a limited number of models (models inheriting from the lm class), whereas effectsize provides support for many types of models, including (generalized) linear mixed models, Bayesian models, and more. Additionally, in additional to post-hoc model matrix standardization, effectsize offers other methods of standardization (see below). • Both MOTE and MBESS provide functions for computing effect sizes such as Cohen's d and effect sizes for ANOVAs (Cohen, 1988), and their confidence intervals. However, both require manual input of For t-statistics, degrees of freedom, and sums of squares for the computation the effect sizes, whereas effectsize can automatically extract this information from the provided models, thus allowing for better ease-of-use as well as reducing any potential for error.
Article
Full-text available
The recent growth of data science is partly fueled by the ever-growing amount of data and the joint important developments in statistical modeling, with new and powerful models and frameworks becoming accessible to users. Although there exist some generic functions to obtain model summaries and parameters, many package-specific modeling functions do not provide such methods to allow users to access such valuable information. Aims of the Package parameters is an R-package (R Core Team, 2020) that fills this important gap. Its primary goal is to provide utilities for processing the parameters of various statistical models. Beyond computing p-values, standard errors, confidence intervals (CI), Bayesian indices and other measures for a wide variety of models, this package implements features like parameters bootstrapping and engineering (such as variables reduction and/or selection), as well as tools for data reduction like functions to perform cluster, factor or principal component analysis. Another important goal of the parameters package is to facilitate and streamline the process of reporting results of statistical models, which includes the easy and intuitive calculation of standardized estimates in addition to robust standard errors and p-values. parameters therefor offers a simple and unified syntax to process a large variety of (model) objects from many different packages. parameters is part of the easystats ecosystem, a collaborative project created to facilitate the usage of R for statistical analyses.
Article
Full-text available
Correlations tests are arguably one of the most commonly used statistical procedures, and are used as a basis in many applications such as exploratory data analysis, structural modelling, data engineering etc. In this context, we present correlation, a toolbox for the R language (R Core Team, 2019) and part of the easystats collection, focused on correlation analysis. Its goal is to be lightweight, easy to use, and allows for the computation of many different kinds of correlations.
Article
Full-text available
The Bayesian framework for statistics is quickly gaining in popularity among scientists, for reasons such as reliability and accuracy (particularly in noisy data and small samples), the possibility of incorporating prior knowledge into the analysis, and the intuitive interpretation of results (Andrews & Baguley, 2013; Etz & Vandekerckhove, 2016; Kruschke, 2010; Kruschke, Aguinis, & Joo, 2012; Wagenmakers et al., 2017). Adopting the Bayesian framework is more of a shift in the paradigm than a change in the methodology; all the common statistical procedures (t-tests, correlations, ANOVAs, regressions, etc.) can also be achieved within the Bayesian framework. One of the core difference is that in the frequentist view, the effects are fixed (but unknown) and data are random. On the other hand, instead of having single estimates of the “true effect”, the Bayesian inference process computes the probability of different effects given the observed data, resulting in a distribution of possible values for the parameters, called the posterior distribution. The bayestestR package provides tools to describe these posterior distributions.
Data.table: Extension of 'data
  • M Dowle
  • A Srinivasan
Dowle, M., & Srinivasan, A. (2021). Data.table: Extension of 'data.frame'. https://CRAN. R-project.org/package=data.table