Content uploaded by Indrajeet Patil
Author content
All content in this area was uploaded by Indrajeet Patil on Oct 28, 2022
Content may be subject to copyright.
datawizard: An R Package for Easy Data Preparation
and Statistical Transformations
Indrajeet Patil 1, Dominique Makowski 2, Mattan S. Ben-Shachar 3,
Brenton M. Wiernik ∗4, Etienne Bacher 5, and Daniel Lüdecke 6
1cynkra Analytics GmbH, Germany 2Nanyang Technological University, Singapore 3Ben-Gurion
University of the Negev, Israel 4Independent Researcher 5Luxembourg Institute of Socio-Economic
Research (LISER), Luxembourg 6University Medical Center Hamburg-Eppendorf, Germany
DOI: 10.21105/joss.04684
Software
•Review
•Repository
•Archive
Editor: Øystein Sørensen
Reviewers:
•@tomfaulkenberry
•@garretrc
Submitted: 07 August 2022
Published: 09 October 2022
License
Authors of papers retain copyright
and release the work under a
Creative Commons Attribution 4.0
International License (CC BY 4.0).
Summary
The
{datawizard}
package for the R programming language (R Core Team, 2021) provides a
lightweight toolbox to assist in key steps involved in any data analysis workow: (1) wrangling
the raw data to get it in the needed form, (2) applying preprocessing steps and statistical
transformations, and (3) compute statistical summaries of data properties and distributions.
Therefore, it can be a valuable tool for R users and developers looking for a lightweight option
for data preparation.
Statement of Need
The
{datawizard}
package is part of
{easystats}
, a collection of R packages designed to
make statistical analysis easier (Ben-Shachar et al. (2020), Lüdecke et al. (2020), Lüdecke,
Ben-Shachar, et al. (2021), Lüdecke, Patil, et al. (2021), Lüdecke et al. (2019), Makowski et
al. (2019), Makowski et al. (2020)). As this ecosystem follows a “0-external-hard-dependency”
policy, a data manipulation package that relies only on base R needed to be created. In eect,
{datawizard}
provides a data processing backend for this entire ecosystem. In addition to its
usefulness to the
{easystats}
ecosystem, it also provides an option for R users and package
developers if they wish to keep their (recursive) dependency weight to a minimum (for other
options, see Dowle & Srinivasan (2021), Eastwood (2021)).
Because
{datawizard}
is also meant to be used and adopted easily by a wide range of users,
its workow and syntax are designed to be similar to
{tidyverse}
(Wickham et al., 2019), a
widely used ecosystem of R packages. Thus, users familiar with the
{tidyverse}
can easily
translate their knowledge and make full use of {datawizard}.
In addition to being a lightweight solution to clean messy data,
{datawizard}
also provides
helpers for the other important step of data analysis: applying statistical transformations
to the cleaned data while setting up statistical models. This includes various types of data
standardization, normalization, rank-transformation, and adjustment. These transformations,
although widely used, are not currently collectively implemented in a package in the R ecosystem,
so {datawizard} can help new R users in nding the transformation they need.
Lastly,
{datawizard}
also provides a toolbox to create detailed summaries of data properties
and distributions (e.g., tables of descriptive statistics for each variable). This is a common step
in data analysis, but it is not available in base R or many modeling packages, so its inclusion
makes {datawizard} a one-stop-shop for data preparation tasks.
∗
Brenton Wiernik is currently an independent researcher and Research Scientist at Meta, Demography and
Survey Science. The current work was done in an independent capacity.
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.1
Features
Data Preparation
The raw data is rarely in a state that it can be directly fed into a statistical model. It often
needs to be modied in various ways. For example, columns need to be renamed or reshaped,
certain portions of the data need to be ltered out, data scattered across multiple tables needs
to be joined, etc.
{datawizard} provides various functions for cleaning and preparing data (see Table 1).
Table 1: The table below lists a few key functions oered by
{datawizard}
for data wrangling. To see
the full list, see the package website: https://easystats.github.io/datawizard/
Function Operation
data_filter() to select only certain observations
data_select() to select only certain variables
data_extract() to extract a single variable
data_rename() to rename variables
data_to_long() to convert data from wide to long
data_to_wide() to convert data from long to wide
data_join() to join two data frames
… …
We will look at one example function that converts data in wide format to tidy/long format:
stocks <- data.frame(
time = as.Date(”2009-01-01”)+0:4,
X = rnorm(5,0,1),
Y = rnorm(5,0,2)
)
stocks
#> time X Y
#> 1 2009-01-01 -0.91474184 -0.5654808
#> 2 2009-01-02 1.00124785 -1.5270177
#> 3 2009-01-03 -0.05642291 -1.3700199
#> 4 2009-01-04 0.29664516 0.7341479
#> 5 2009-01-05 -2.79147086 0.3659937
data_to_long(
stocks,
select = -c(”time”),
names_to = ”stock”,
values_to = ”price”
)
#> time stock price
#> 1 2009-01-01 X -0.91474184
#> 2 2009-01-01 Y -0.56548082
#> 3 2009-01-02 X 1.00124785
#> 4 2009-01-02 Y -1.52701766
#> 5 2009-01-03 X -0.05642291
#> 6 2009-01-03 Y -1.37001987
#> 7 2009-01-04 X 0.29664516
#> 8 2009-01-04 Y 0.73414790
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.2
#> 9 2009-01-05 X -2.79147086
#> 10 2009-01-05 Y 0.36599370
Statistical Transformations
Even after getting the raw data in the needed format, we may need to transform certain
variables further to meet requirements imposed by a statistical test.
{datawizard}
provides a rich collection of such functions for transforming variables (see Table
2).
Table 2: The table below lists a few key functions oered by
{datawizard}
for data transformations. To
see the full list, see the package website: https://easystats.github.io/datawizard/
Function Operation
standardize() to center and scale data
normalize() to scale variables to 0-1 range
adjust() to adjust data for eect of other variables
slide() to shift numeric value range
ranktransform() to convert numeric values to integer ranks
… …
We will look at one example function that standardizes (i.e. centers and scales) data so that it
can be expressed in terms of standard deviation:
d<- data.frame(
a = c(-2,-1,0,1,2),
b = c(3,4,5,6,7)
)
standardize(d, center = c(3,4), scale = c(2,4))
#> a b
#> 1 -2.5 -0.25
#> 2 -2.0 0.00
#> 3 -1.5 0.25
#> 4 -1.0 0.50
#> 5 -0.5 0.75
Summaries of Data Properties and Distributions
The workhorse function to get a comprehensive summary of data properties is
describe_distribution()
, which combines a set of indices (e.g., measures of cen-
trality, dispersion, range, skewness, kurtosis, etc.) computed by other functions in
{datawizard}.
describe_distribution(mtcars)
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.3
Variable Mean SD IQR Min Max Skewness Kurtosis n n_Missing
mpg 20.09 6.03 7.53 10.4 33.9 0.67 -0.02 32 0
cyl 6.19 1.79 4.00 4.0 8.0 -0.19 -1.76 32 0
disp 230.72 123.94 221.52 71.1 472.0 0.42 -1.07 32 0
hp 146.69 68.56 84.50 52.0 335.0 0.80 0.28 32 0
drat 3.60 0.53 0.84 2.8 4.9 0.29 -0.45 32 0
wt 3.22 0.98 1.19 1.5 5.4 0.47 0.42 32 0
qsec 17.85 1.79 2.02 14.5 22.9 0.41 0.86 32 0
vs 0.44 0.50 1.00 0.0 1.0 0.26 -2.06 32 0
am 0.41 0.50 1.00 0.0 1.0 0.40 -1.97 32 0
gear 3.69 0.74 1.00 3.0 5.0 0.58 -0.90 32 0
carb 2.81 1.62 2.00 1.0 8.0 1.16 2.02 32 0
Licensing and Availability
{datawizard}
is licensed under the GNU General Public License (v3.0), with all source code
openly developed and stored on GitHub (https://github.com/easystats/datawizard), along
with a corresponding issue tracker for bug reporting and feature enhancements. In the spirit
of honest and open science, we encourage requests, tips for xes, feature updates, as well as
general questions and concerns via direct interaction with contributors and developers.
Acknowledgments
{datawizard}
is part of the collaborative easystats ecosystem. Thus, we thank the members
of easystats as well as the users.
References
Ben-Shachar, M. S., Lüdecke, D., & Makowski, D. (2020). eectsize: Estimation of eect
size indices and standardized parameters. Journal of Open Source Software,5(56), 2815.
https://doi.org/10.21105/joss.02815
Dowle, M., & Srinivasan, A. (2021). Data.table: Extension of ‘data.frame‘.https://CRAN.
R-project.org/package=data.table
Eastwood, N. (2021). Poorman: A poor man’s dependency free recreation of ’dplyr’.https:
//CRAN.R-project.org/package=poorman
Lüdecke, D., Ben-Shachar, M. S., Patil, I., & Makowski, D. (2020). Extracting, computing and
exploring the parameters of statistical models using R. Journal of Open Source Software,
5(53), 2445. https://doi.org/10.21105/joss.02445
Lüdecke, D., Ben-Shachar, M. S., Patil, I., Waggoner, P., & Makowski, D. (2021). performance:
An R package for assessment, comparison and testing of statistical models. Journal of
Open Source Software,6(60), 3139. https://doi.org/10.21105/joss.03139
Lüdecke, D., Patil, I., Ben-Shachar, M. S., Wiernik, B. M., Waggoner, P., & Makowski, D.
(2021). see: An R package for visualizing statistical models. Journal of Open Source
Software,6(64), 3393. https://doi.org/10.21105/joss.03393
Lüdecke, D., Waggoner, P., & Makowski, D. (2019). insight: A unied interface to access
information from model objects in R. Journal of Open Source Software,4(38), 1412.
https://doi.org/10.21105/joss.01412
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.4
Makowski, D., Ben-Shachar, M. S., & Lüdecke, D. (2019). bayestestR: Describing eects and
their uncertainty, existence and signicance within the Bayesian framework. Journal of
Open Source Software,4(40), 1541. https://doi.org/10.21105/joss.01541
Makowski, D., Ben-Shachar, M. S., Patil, I., & Lüdecke, D. (2020). Methods and algorithms
for correlation analysis in R. Journal of Open Source Software,5(51), 2306. https:
//doi.org/10.21105/joss.02306
R Core Team. (2021). R: A language and environment for statistical computing. R Foundation
for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund,
G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M.,
Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome
to the tidyverse. Journal of Open Source Software,4(43), 1686. https://doi.org/10.21105/
joss.01686
Patil et al. (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software,7(78),
4684. https://doi.org/10.21105/joss.04684.5