Teaching modeling in introductory statistics:
A comparison of formula and tidyverse
Amelia McNamara ∗
Department of Computer & Information Sciences, University of St Thomas
February 1, 2022
This paper reports on an experiment run in a pair of introductory statistics labs,
attempting to determine which of two R syntaxes was better for introductory teach-
ing and learning: formula or tidyverse. One lab was conducted fully in the formula
syntax, the other in tidyverse. Analysis of incidental data from YouTube and RStudio
Cloud show interesting distinctions. The formula section appeared to watch a larger
proportion of pre-lab YouTube videos, but spend less time computing on RStudio
Cloud. Conversely, the tidyverse section watched a smaller proportion of the videos
and spent more time on RStudio Cloud. Analysis of lab materials showed that tidy-
verse labs tended to be slightly longer (in terms of lines in the provided RMarkdown
materials, as well as minutes of the associated YouTube videos), and the tidyverse
labs exposed students to more distinct R functions. However, both labs relied on a
quite small vocabulary of consistent functions. Analysis of pre- and post-survey data
show no diﬀerences between the two labs, so students appeared to have a positive ex-
perience regardless of section. This work provides additional evidence for instructors
looking to choose between syntaxes for introductory statistics teaching.
Keywords: R language, instruction, data science, statistical computing
arXiv:2201.12960v1 [stat.CO] 31 Jan 2022
When teaching statistics and data science, it is crucial for students to engage authentically
with data. The revised Guidelines for Assessment and Instruction in Statistics Education
(GAISE) College Report provides recommendations for instruction, including “Integrate
real data with a context and purpose” and “Use technology to explore concepts and analyze
data” (GAISE College Report ASA Revision Committee 2016). Many instructors have
students engage with data using technology through in-class experiences or separate lab
An important pedagogical decision when choosing to teach data analysis is the choice
of tool. There has long been a divide between ‘tools for learning’ and ‘tools for doing’ data
analysis (McNamara 2015). Tools for learning include applets, and standalone software like
TinkerPlots, Fathom, or their next-generation counterpart CODAP (Konold & Miller 2001,
Finzer 2002,The Concord Consortium 2020). Tools for doing are used by professionals,
and include software packages like SAS as well as programming languages like Julia, R,
Many tools for learning were inspired by Rolf Biehler’s 1997 paper, “Software for Learn-
ing and for Doing Statistics” (Biehler 1997). In it, Biehler called for more attention to the
design of tools used for teaching. In particular, he was concerned with on-ramps for stu-
dents (ensuring the tool was not too complex), as well as oﬀ-ramps (using one tool through
an entire class, which could also extend further) (Biehler 1997). At the time he wrote the
paper it was quite diﬃcult to teach using an authentic tool for doing, because these tools
lacked technological or pedagogical on-ramps.
However, recent developments in Integrated Development Environments (IDEs) and
pedagogical advances have opened space for a movement to teach even novices statistics
and data science using programming. In particular, curricula using Python and R have
become popular. In these curricula, educators make pedagogical decisions about what code
to show students, and how to scaﬀold it. In both the Python and R communities, there
have been movements to simplify syntax for students.
For example, the UC Berkeley Data 8 course uses Python, including elements of the
commonly-used matplotlib and numpy libraries as well as a specialized library written to
accompany the curriculum called datascience (Adhikari et al. 2021,DeNero et al. 2020).
The datascience library was designed to reduce complexity in the code. At the K-12 level,
the language Pyret has been developed as a simpliﬁed version of Python to accompany the
Bootstrap Data Science curriculum (Krishnamurthi et al. 2020).
In R, the development of less-complex code for students has been under consideration
for even longer. R oﬀers non-standard evaluation, which allows package authors to create
new ‘syntax’ for their packages (Morandat et al. 2012). In human language, syntax is the
set of rules for how words and sentences should be structured. If you use the wrong syntax
in human language, people will probably still understand you, but they will be able to
hear there is something wrong with how you structured your speech or writing. Syntax in
programming languages is even more formal– it governs what code will execute, run, or
compile correctly. Using the wrong syntax means getting an error from the language.
Typically, programming languages have only one valid syntax. For example, an aphorism
about the language Python is “There should be one– and preferably only one –obvious way
to do it” (Peters 2004). But, non-standard evaluation in R has allowed there to be many
obvious ways to do the same task. There is some disagreement over whether syntax is a
precise term for these diﬀerences. Other terms suggested for these variations in valid R
code are ‘dialects,’ ‘interfaces,’ and ‘domain speciﬁc languages.’ Throughout this paper, we
use the term syntax as a shorthand for these concepts. At present, there are three primary
syntaxes used: base, formula, and tidyverse (McNamara 2018).
The base syntax is used by the base R language (R Core Team 2020), and is characterized
by the use of dollar signs and square brackets. The formula syntax uses the tilde to separate
response and explanatory variable(s) (Pruim et al. 2017). The tidyverse syntax uses a
data-ﬁrst approach, and the pipe to move data between steps (Wickham et al. 2019).
A comparison of using the three syntaxes for univariate statistics and displays can be seen
in Code Block 1.1. This example code, like the rest in this paper, uses the palmerpenguins
data (Horst et al. 2020). All three pieces of code accomplish the same tasks, and all three
use the R language. But, the syntax varies considerably.
# base syntax
# formula syntax
gf_histogram(~bill_length_mm, data = penguins)
mean(~bill_length_mm, data = penguins)
# tidyverse syntax
geom_histogram(aes(x = bill_length_mm))
Code Block 1.1: Making a histogram of bill length from the penguins dataset, then
taking the mean, using three diﬀerent R syntaxes. Base syntax is characterized by
the dollar sign, formula by the tilde, and tidyvese is dataframe-ﬁrst. In order for
this code to run as-is, missing (NA) values need to be dropped before the code is
There is some agreement about pedagogical decisions for teaching R. In particular, most
educators agree that in order to reduce cognitive load, instructors should only teach one
syntax, and to be as consistent as possible about that syntax (McNamara et al. 2021a).
There is also some agreement base R syntax is not the appropriate choice for introduc-
tory statistics, but there is widespread disagreement on whether the formula syntax or
tidyverse syntax is better for novices.
While there are strongly-held opinions on which syntax should be taught (Pruim et al.
2017,C¸ etinkaya-Rundel et al. 2021), there is relatively little empirical evidence to support
these opinions. In the realm of computer science, empirical studies by Andreas Steﬁk,
et al have shown signiﬁcant diﬀerences in the intuitiveness of languages, as well as error
rates, based on language design choices (Steﬁk et al. 2011,Steﬁk & Siebert 2013). Thus, it
seems likely there are language choices that could make data science programming easier
(or harder) for users, particularly novices.
Steﬁk’s team is working to add data science functionality to their evidence-based pro-
gramming language. As a ﬁrst step toward understanding which elements of existing lan-
guages might be best to emulate, they ran an experiment comparing the three main R
syntaxes (Rafalski et al. 2019). The study showed no statistically signiﬁcant diﬀerence
between any of the three syntaxes with regard to time to completion or number of er-
rors. However, there were signiﬁcant interaction eﬀects between syntax and task, which
suggested some syntaxes might be more appropriate for certain tasks (Rafalski et al. 2019).
Beyond this, examining the results from the study with an eye toward data science ped-
agogy showed common errors made by students related to their conceptions of dataframes
and variables. For example, one of the ﬁgures from Rafalski et al. (2019) shows real stu-
dent code with errors. In the ﬁrst line of code, the student gets everything correct using
formula syntax, with the exception of the name of the dataframe. When that code does
not work, they try again using base R syntax, but again get the dataframe name wrong.
After both those failures, they appear to fall back on computer science knowledge and try
syntax quite diﬀerent from R. This is consistent with other studies of novice behavior in R
(Roberts 2015). It is not clear if this type of error was dependent on the syntax participants
were asked to use.
The other missing element in this study was instruction. The study was a quick inter-
vention showing students examples of a particular syntax, then asking them to duplicate
that syntax in a new situation. But without any instruction about data science concepts
like dataframes, it would be diﬃcult to troubleshoot the syntax error mentioned above.
The work served as the inspiration for the longer comparison of multiple R syntaxes in the
classroom context described in this paper.
The remainder of this paper is organized into three sections. Section 2describes the
setup of the study, the participants (2.1) and their experience (2.2), and the content of
the course under investigation (2.3). Section 3contains results of the analysis, including a
comparison of material lengths between the sections (3.2), the number of unique functions
shown in each section (3.3), results from the pre- and post-survey (3.4), and analysis of
YouTube (3.5) and RStudio Cloud (3.6) data. Finally, Section 4discusses the results and
opportunities for future study.
All materials used for this study are available on GitHub and are Creative Commons
licensed, so they can be used or remixed by anyone who wants to use them. All code
and anonymized data from this paper is also available on GitHub, for reproducibility. Data
analysis was performed in R, and the paper is written in RMarkdown. The categorical color
palette was chosen using Colorgorical (Gramazio et al. 2017), and colors for the Likert scale
plot are from ColorBrewer (Harrower & Brewer 2003). Example data used throughout the
paper is from palmerpenguins (Horst et al. 2020). Packages used for the formula section
were mosaic and ggformula (now loaded automatically with mosaic), for the tidyverse
section the tidyverse and infer packages (Pruim et al. 2017,Kaplan & Pruim 2020,
Wickham et al. 2019,Bray et al. 2021).
The author ran a pilot study in her introductory statistics labs. This study was run twice,
once in the Spring 2020 semester and once in the Fall 2020 semester. The disruption of
COVID-19 to the Spring 2020 semester made the resulting data unusable, so this paper
focuses on just Fall 2020 data.
Data was collected from YouTube analytics for watch times, from RStudio Cloud for
aggregated compute time, and from pre- and post-surveys of students. Participants for
the pre- and post-survey were recruited from this pool after Institutional Research Board
Participants in the study were students enrolled in an introductory statistics course at a
mid-sized private university in the upper Midwest. At this university, statistics students
enroll in a lecture (approximately 60-90 students per section), which is broken into several
smaller lab sections for hands-on work in statistical software. Lecture and lab sections are
taught by diﬀerent instructors, and the lab sections associated with a particular lecture
often use diﬀerent software. For example, one lab may use Minitab while the other two use
Excel. However, every lab section (no matter what lecture it is associated with, or what
software is used) does the same set of standardized assignments. This structure provides a
consistent basis for comparison.
No 10 9
Yes, but not with R 2 4
Table 1: Responses from pre-survey about prior programming experience. The
majority of students in both sections had no prior programming experience.
In Fall 2020, the author taught two labs associated with the same lecture section, so all
students saw the same lecture content. (A third lab was associated with the same lecture,
using a diﬀerent software, and was not considered.) Using random assignment (coin ﬂip),
the author selected one lab section to be instructed using formula syntax, and one to be
instructed using tidyverse syntax. The goal was to compare syntaxes head-to-head.
Because the lab took place during the coronavirus pandemic, the instructor recorded
YouTube videos of herself working through the pre-lab documents for each lab, and posted
them in advance. Students watched the videos and worked through the associated pre-lab
RMarkdown document on their own time, then came to synchronous class to ask questions
and get help starting on the real lab assignment. Students used R through the online
platform RStudio Cloud (RStudio PBC 2021).
The two labs were of the same size (n= 21 in both sections) and reasonably similar
in terms of student composition. In both sections, approximately half of students were
Business majors, with the other half a mix of other majors.
Participants for the pre- and post-survey were recruited from this pool after Institutional
Research Board ethics review. For the pre-survey, n= 12 and n= 13 students consented to
participate, and in the post-survey n= 8 and n= 13 responded. So, for paired analysis we
have n= 8 for the formula section, and n= 13 for the tidyverse section. These sample
sizes are very small, and because students could opt-in, may suﬀer from response bias.
However, because we have additional usage data from non-respondents, some elements of
the data analysis include the full class sample sizes of n= 21.
2.2 Prior programming experience
To verify both groups of students had similar backgrounds, we compared the prior program-
ming experience of the two groups of students. Table 1shows results from the pre-survey.
While two additional students in the tidyverse section had prior programming experience,
the overall pattern was the same. The majority of students in both sections had no prior
For the students who had programmed before, none had prior experience with R. Three
experience with other languages, including C++ and Python.
Each week, the lab instructor prepared a “pre-lab” document in RMarkdown. The pre-
lab covered the topics necessary to complete the standardized lab assignment done by all
students across lab sections. Pre-lab documents included text explanations of statistical
and R programming concepts, sample code, and blanks (both in the code and the text)
for students to ﬁll in as they worked. The instructor recorded YouTube videos of herself
working through the pre-lab documents for each lab, and posted them in advance. Students
were told to watch the pre-lab video and work through the RMarkdown document on their
own time, then come to synchronous class to ask questions and get help starting on the
real lab assignment.
The topics covered in Fall 2020 were as follows:
1. [No lab, short week]
2. Describing data: determining the number of observations and variables in a dataset,
3. Categorical variables: exploratory data analysis for one or two categorical variables.
Frequency tables, relative frequency tables, bar charts, two-way tables, and side-by-
side bar charts.
4. Quantitative variables: exploratory data analysis for one quantitative variable. His-
tograms, dot plots, density plots, and summary statistics like mean, median, and
5. Correlation and regression: exploratory data analysis for two quantitative variables.
Correlation, scatterplot, simple linear regression as a descriptive technique.
6. Bootstrap intervals: the use of the bootstrap to construct non-parametric conﬁdence
7. Randomization tests: the use of randomization to perform non-parametric hypothesis
8. Inference for a single proportion: use of the normal distribution to construct conﬁ-
dence intervals and perform hypothesis tests for a single proportion.
9. Inference for a single mean: use of the t-distribution to construct conﬁdence intervals
and perform hypothesis tests for a single mean.
10. Inference for two samples: use of distributional approximations (normal or t) to
perform inference for a diﬀerence of proportions or a diﬀerence of means.
11. [No lab, assessment]
12. [No lab, Thanksgiving]
13. ANOVA: inference for more than two means, using the F distribution.
14. Chi-square: inference for more than two counts, using the χ2distribution
15. Inference for Regression: inference for the slope coeﬃcient in simple linear regression,
prediction and conﬁdence intervals. Multiple regression.
Although this was a 15-week semester, there are only 12 lab topics. Labs were not held
during the ﬁrst week of classes or during Thanksgiving week. Additionally, there were two
“lab assessments” to gauge student understanding of concepts within the context of their
lab software. One took place during ﬁnals week, the other was scheduled in week 11.
3.1 Summative assessments
One obvious question arising when considering the comparison of the two syntaxes is
whether students performed better in one section or another. The IRB for this study
did not cover examining student work (an obvious place for improved further research),
so we cannot look at student outcomes on a per-assignment basis. However, running a
randomization test for a diﬀerence in overall mean lab grades showed no signiﬁcant diﬀer-
ence between the two sections. While they may have been interesting diﬀerences in grades
depending on the topic of the lab, we at least know these diﬀerences averaged out in the
Similarly, it would be interesting to know if student attitudes about the instructor were
diﬀerent from the summative student evaluations completed by all students at the end of
the semester. These evaluations are anonymous, and the interface only provides summary
statistics. Again, a test for a diﬀerence in means showed no diﬀerence in mean evaluation
score on the questions “Overall, I rate this instructor an excellent teacher.” and “Overall,
I rate this course as excellent.”
3.2 Lab lengths
The ﬁrst question we seek to answer is whether the materials presented to students were
of approximately the same length. We can assess this based on the length of the pre-lab
documents (in lines) and of the pre-lab videos (in minutes).
The length of the pre-lab RMarkdown documents can be measured using lines. Figure
1shows the number of lines of code for each section’s pre-lab document, per week.
It indicates RMarkdown documents for the tidyverse section tended to be longer. We
can compute a diﬀerence in lab lengths for each week, and compute the mean diﬀerence,
which is 19 lines. Because we only have 12 labs worth of data, we used a bootstrap procedure
to generate a conﬁdence interval for the mean of the diﬀerences. The 95% interval is (9,
1 3 5 7 9 11 13 15
Week of semester
Length (in lines) of pre−lab
Figure 1: Length of pre-lab RMarkdown documents each week, in lines. Data has
been adjusted for the formula section in weeks 8 and 9, because an instructor error
led this section to have only one document combining both weeks’ work.
29), which indicates labs for the tidyverse section were longer, but only by a few lines.
A slightly longer length for these labs makes sense, because tidyverse code is charac-
terized by multiple short lines strung together into a pipeline with %>%, while the formula
syntax typically has single function calls, sometimes with more arguments.
Then the question becomes if the longer lengths of documents lent themselves to longer
pre-lab videos. Figure 2shows the video lengths, which appear more consistent between
sections. Eﬀort was made to ensure the maximum video length was approximately 20
minutes, and some weeks had multiple videos.
Again, we can compute a pairwise diﬀerence in total video length (adding together
multiple videos in weeks that had them), and compute the mean of that diﬀerence. That
diﬀerence is 2 minutes (tidyverse videos being longer). We can then compute a 95%
bootstrap conﬁdence interval for the diﬀerence, (0.16, 4). Again, it appears ‘tidyverse
videos are longer, although just by a few minutes.
1 3 5 7 9 11 13 15
Week of semester
Total length of pre−lab
Figure 2: Length of pre-lab videos each week. Outlines help delineate multiple videos
for a single week.
3.2.1 Divergent labs
One place where the labs are of particularly diﬀerent lengths is in week 3, when the topic
was exploratory data analysis for one and two categorical variables. For the formula section
the RMarkdown document was 134 lines long, and the two videos totaled 28 minutes. The
RMarkdown document for the tidyverse section was 180 lines long, and the videos totaled
35 minutes. There is a clear reason why.
In the formula section, students found frequency tables and relative frequency tables
with code as in Code Block 3.1 and Code Block 3.2.
tally(~island, data = penguins)
tally(~island, data = penguins, format = "percent")
tally(species ~island, data = penguins)
Code Block 3.1: Making tables of one and two categorical variables using the formula
syntax and mosaic::tally().
tally(species ~island, data = penguins, format = "percent")
species Biscoe Dream Torgersen
Adelie 26.19048 45.16129 100.00000
Chinstrap 0.00000 54.83871 0.00000
Gentoo 73.80952 0.00000 0.00000
Code Block 3.2: Making a table of two categorical variables using the formula
syntax and mosaic::tally() function, almong with the percent option.
The mosaic::tally() function produces a familiar-looking two-way table, which took
very little explanation, other than to show how reversing the variables in the formula led
to diﬀerent percentages, as is seen in Code Block 3.3. Compare Code Block 3.2 and Code
Block 3.3 to see the eﬀect of swapping the order of variables.
tally(island ~species, data = penguins, format = "percent")
island Adelie Chinstrap Gentoo
Biscoe 28.94737 0.00000 100.00000
Dream 36.84211 100.00000 0.00000
Torgersen 34.21053 0.00000 0.00000
Code Block 3.3: Making a table of two categorical variables using the formula
syntax and mosaic::tally() function, with variables swapped.
However, in the tidyverse section, both the code and output took longer to explain.
Initial summary statistics for categorical variables are computed in Code Block 3.4, while
the tidy version of a relative frequency table is shown in Code Block 3.5.
summarize(n = n())
summarize(n = n()) %>%
mutate(prop = n/sum(n))
group_by(island, species) %>%
summarize(n = n())
Code Block 3.4: Computing summary statistics for one and two categorical variables
in the tidyverse syntax.
group_by(island, species) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
# A tibble: 5 x 4
# Groups: island 
island species n prop
<fct> <fct> <int> <dbl>
1 Biscoe Adelie 44 0.262
2 Biscoe Gentoo 124 0.738
3 Dream Adelie 56 0.452
4 Dream Chinstrap 68 0.548
5 Torgersen Adelie 52 1
Code Block 3.5: Computing summary statistics for two categorical variables in the
Again, reversing the order of the variables (this time, inside the dplyr::group by())
changed the percentages, but it was more diﬃcult to determine how the percents added up,
because the data was in long format, rather than wide format. Compare Code Block 3.5
and Code Block 3.6 to see the eﬀect of swapping the order of variables.
group_by(species, island) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
# A tibble: 5 x 4
# Groups: species 
species island n prop
<fct> <fct> <int> <dbl>
1 Adelie Biscoe 44 0.289
2 Adelie Dream 56 0.368
3 Adelie Torgersen 52 0.342
4 Chinstrap Dream 68 1
5 Gentoo Biscoe 124 1
Code Block 3.6: Computing summary statistics for two categorical variables in the
tidyverse syntax, with variables swapped.
A similar discrepancy can be seen in week 10, where the formula section’s RMark-
down document was pull(filter(lablines, week == 10, section == "formula"),
lines) lines long, and the videos totaled 19 minutes. That same week the
tidyverse RMarkdown document was pull(filter(lablines, week == 10, section
== "tidyverse"), lines) lines long, and the videos totaled 27 minutes.
The explanation for the varying time is similar, as well. Week 10 focused on inference
for two samples; that is, inference for a diﬀerence of proportions or a diﬀerence of means.
While a diﬀerence of means makes it fairly easy to know which variable should go where
(the quantitative variable is the response variable to take the mean of, and the categorical
variable is the explanatory variable splitting it), with a diﬀerence of two proportions the
concept comes back to thinking about two-way tables. Again, the tidyverse presentation
of a “two-way table” made this more diﬃcult to conceptualize.
In the formula section, students saw code like that in Code Block 3.7.
tally(island ~sex, data = penguins, format = "proportion")
prop.test(island ~sex, data = penguins, success = "Biscoe")
Code Block 3.7: Making a two-way table and performing inference for a diﬀerence
of proportions using the formula syntax. In order for this code to run as-is, the
Torgerson island has to be removed so there are just two categories in that variable.
The code for ﬁnding the point estimate using mosaic::tally() is quite similar to the
code for performing inference using prop.test().
In the tidyverse, the code is not as consistent. Students in this section saw code like
that shown in Code Block 3.8.
group_by(sex, island) %>%
summarize(n = n()) %>%
mutate(prop = n/sum(n))
response = island,
explanatory = sex,
alternative = "two-sided",
order = c("female","male")
Code Block 3.8: Making a ‘two-way table’ and performing inference for a diﬀerence
of proportions using the tidyverse syntax. Again, the Torgerson island data has been
In tidyverse syntax the code for ﬁnding the point estimate (dplyr’s group by(),
summarize() and then mutate()) is quite diﬀerent from the code performing the inference
(the infer::prop test() function). And, the output from the inferential prop test()
function makes it harder to determine the code was correct. In the prop.test() out-
put, sample estimates are provided, which allows you to check your work against a point
estimate computed earlier.
These discrepancies made it take longer to explain code in the tidyverse section. Com-
parisons of RMarkdown document length and YouTube video length, as well as the corre-
sponding reasons for those discrepancies are the ﬁrst hint of the computing time results to
come in Section 3.6.
3.3 Number of functions
Since both sections relied on the use of RMarkdown documents, there is a wealth of text
data to be explored. The instructor prepared the pre-lab documents with blanks, but also
saved a ‘ﬁlled-in’ copy after recording the accompanying video. She also completed each
lab assignment in an RMarkdown document to generate a key.
Students in each section were also given a “All the R you need for intro stats” cheat-
sheet at the beginning of the semester. These cheatsheets (one for formula and one for
tidyverse) were modeled on the cheatsheet of a similar name accompanying the mosaic
package (Pruim et al. 2017). The cheatsheets aimed to include all code necessary for the
entire semester, but were generated a priori.
These varied documents allow us to use automated methods to analyze the number
of unique functions shown in each section, using the getParseData() function from the
built-in utils package.
The cheatsheets given to students at the beginning of the semester contained 34 functions
for the formula section and 42 functions for the tidyverse section. There was an overlap
of 18 functions between the two cheatsheets.
Of course, while teaching a real class, an instructor often has to ad-lib at least a little.
So, it is also interesting to consider the number of functions actually shown throughout
the course of the semester. To do this, we can consider the functions shown in the ﬁlled-in
version of pre-lab documents the instructor ended up with after recording the associated
Considering this data, the formula section saw a total of 37 functions and the tidyverse
section saw 50, again with an overlap of 18 functions between the two sections. These
numbers make it appear as if in the formula section the instructor showed all functions
from the cheatsheet, and then a few additional functions. However, there were actually
several functions in the cheatsheet that were never shown in the actual class, and many
more functions that appeared in the class that did not make it onto the cheatsheet. For a
list of the functions used in both sections, see Appendix A.
In the tidyverse section, there were 9 functions shown in class that did not appear on
the cheatsheet, and only 1 function on the cheatsheet that was not discussed in class. In
the formula section, however, there were 10 functions shown in class that did not appear
on the cheatsheet, as well as 7 functions on the cheatsheet that were not discussed in class.
In both classes the majority of functions shown in class were on the cheatsheet.
Interestingly, there was quite a bit of overlap in the functions students saw in both
sections. Considering functions actually used in class, the two sections had 18 functions in
The functions both sections of students saw included helper functions like library(),
set.seed(), and set() (a function in the knitr options included in the top of each RMark-
down document), statistics like mean(),sd(), and cor(), and modeling-related functions
like aov(),lm(),summary() and predict().
Students in the formula section saw 19 functions unique from the set both sections saw,
while the tidyverse section saw 32 unique functions. It makes sense the number of unique
functions in the tidyverse section would be slightly larger. One reason is the ggplot2
helper functions ggplot() and aes().
Students in both sections saw how to make a barchart, boxplot, histogram, and scatter-
plot, but in the formula section they used standalone functions like gf boxplot() whereas
in the tidyverse section they needed to start with ggplot and add on a geom function
like geom boxplot(), while specifying the aesthetic values somewhere.
Similarly, both sections saw several common summary statistics, but in the formula
section they used the function (e.g. mean()) on its own, whereas in the tidyverse section
summary functions needed to be wrapped within summarize(). Students in the tidyverse
section also saw slightly more summary statistic functions, because one lab called for the
ﬁve number summary.
In the formula lab, students found the ﬁve number summary as shown in Code Block 3.9.
favstats(~bill_length_mm, data = penguins)
Code Block 3.9: The mosaic::favstats() function provides many common summary
statistics for one quantitative variable. The favstats() function automatically drops
This approach is particularly attractive because it deals with missing values as part of
the standard output.
In the tidyverse section, the instructor chose to show two approaches. (Probably a
bad pedagogical decision.) Both approaches are in Code Block 3.10, and both needed to
include drop na() to deal with missing values. Past those similarities, the approaches are
min = min(bill_length_mm),
lower_hinge = quantile(bill_length_mm, .25),
median = median(bill_length_mm),
upper_hinge = quantile(bill_length_mm, .75),
max = max(bill_length_mm)
Code Block 3.10: Two approaches for doing summary statistics of one quantitative
variable in tidyverse syntax. The ﬁrst is quite verbose, the second is more compact
but introduces a function never seen again.
The instructor should have chosen a single solution to present to students, but was faced
with a dilemma. The ﬁrst tidyverse approach is very verbose, but it follows nicely from
other summary statistics students had already seen, just adding a few more functions like
min,max, and quantile. The second solution is more concise, but it introduces the pull
function, which was never used again in the course.
This brings up an important consideration when teaching coding– how many times
students will see the same function. Because there is some cognitive load associated with
learning a new function, and repetition helps move information from working memory to
long term memory, it is ideal for students to see each function at least twice (?McNamara
et al. 2021b). When analyzing the number of functions shown in each section, we found
there were 7 functions shown only one time in the formula section, and 6 functions only
shown once in the tidyverse section.
The practice of analyzing the number of functions shown over the course of the semester
was eye-opening. It will provide valuable information for the instructor the next time she
teaches the course, as she can attempt to remove functions only shown once, and ensure
the cheatsheets better match what is actually shown throughout the semester.
3.4 Pre- and post-survey
As discussed in 2.1, the number of students who completed both the pre- and post-surveys
were low, so there is limited generalizability of the paired analysis.
The majority of the survey was modeled on a pre- and post-survey used by the Carpen-
tries, a global nonproﬁt teaching coding skills (Carpentries 2021). Questions ask respon-
dents to use a 5-step Likert scale, from 1 (strongly disagree) to 5 (strongly agree) to rate
their agreement with the following statements:
•I am conﬁdent in my ability to make use of programming software to work with data
•Having access to the original, raw data is important to be able to repeat an analysis
•Using a programming language (like R) can make me more eﬃcient at working with
•While working on a programming project, if I get stuck, I can ﬁnd ways of overcoming
•Using a programming language (like R) can make my analysis easier to reproduce
•I know how to search for answers to my technical questions online
In Figure 3, you can see a visualization of these Likert-scale questions, split by section.
It is diﬃcult to gather much of a conclusion from this ﬁgure. Many categories appear to
have made an improvement, while others seem to show a decrease in agreement from the
pre- to the post-survey. Additionally, the ﬁgure shows overall trends in the sections, and
does not utilize the potential for matching pre- and post-responses from the same student
to measure change at the individual level.
To consider this individual-level change, we can compute the diﬀerence between a stu-
dent’s response on the pre- and post-survey. We compute post score −pre score such that
positive diﬀerences mean the student’s attitude on the item improved from the beginning
of the class to the end, and negative diﬀerences mean they worsened.
Because the questions were on Likert scales, it is not appropriate to compute an arith-
metic mean of the diﬀerences, but median scores can be computed. To provide a broader
picture of the distribution of responses, we also compute the 25th and 75th percentiles
100 50 0 50 100100 50 0 50 100
1 − strongly disagree 2 3 − neutral 4 5 − strongly agree
Figure 3: Pre and post responses to Likert-scale questions. Most questions show
some level of improvement, such as the ﬁrst question, ‘I am conﬁdent in my ability
to make use of programming software to work with data.’ but others show no change
or even a decline in agreement.
−2 −1 0 1 2 3 −2 −1 0 1 2 3
Difference in Likert rating between
pre− and post−surveys
Figure 4: Distribution of paired diﬀerences for student responses to questions. A
score of 0 means the student responded the same way in the pre- and post-surveys,
whereas a negative score means their agreement was lower at the end of the course,
and a positive score means their agreement was higher. The boxes cross 0 for all
except those for ‘I am conﬁdent in my ability to make use of programming software
to work with data’, and boxes appear similar between sections.
for each section and score. This information is most easily displayed as a boxplot. The
boxplots in question can be seen in Figure 4.
Because the sample sizes are so small, we will not attempt to use inferential statistics,
but it is worth noting almost all boxes are centered at 0 (meaning the median response did
not change over the course of the semester).
The one question that is an exception to this rule is “I am conﬁdent in my ability to make
use of programming software to work with data.” The boxes for both sections are centered
at a median of 1, meaning the median student answered one level up on the question at
the end of the course. Both boxes (the middle 50% of the data) are fully positive, although
the lower whisker (minimum value) for both includes zero.
It is somewhat heartening to know students improved their conﬁdence in programming
over the course of the semester, but there is no clear diﬀerence between the sections, so
this does not provide any strong evidence for one syntax or the other.
Likely, the questions used by The Carpentries was inappropriate for this setting, and
a diﬀerent set of survey questions would have been more appropriate for this group. For
example, this class did not include any explicit instruction on searching for answers online.
This was an intentional choice, because novices typically struggle to identify which search
results are relevant to their queries and get overwhelmed by the multitude of syntactic
options they run across. Instead, students with questions were referred to the “all the R you
need” cheatsheet they had been given at the beginning of the semester, which attempted to
summarize every function they would encounter. Likely, students still attempted to Google
questions, which may be why the responses to this question got more negative over the
course of the semester.
In addition to the six questions asked on both the pre- and post-survey, the two surveys
also had some unique questions.
The pre-survey also asked students to share what they were most looking forward to,
and most nervous about. Both sections had similar responses. Students wrote they looked
forward to “learning how to code!” and “Gaining a better understanding of how to analyze
data.” Beyond worries related to the pandemic, they expressed apprehension about “getting
stuck,” “using R,” and “Figuring out how to do the programming and typing everything
On the post survey, students were asked to report which syntax they had learned, with
an option to respond “I don’t know.” All students in both sections correctly identiﬁed the
syntax associated with their lab. Then, they were asked if they would have preferred to
learn the other syntax. We hypothesized many students would say ‘yes,’ thinking the other
syntax would have been easier or lack some feature they found frustrating. Surprisingly,
though, the majority of students in both sections said ‘no,’ they preferred to learn the
syntax they had been shown. Responses to this question are shown in Table 2.
However, part of the explanation is likely that the students did not know what the
other syntax looked like. Throughout the semester, the instructor was careful to only
expose students to the syntax for the particular section. Several students asked to see the
alternate syntax during oﬃce hours, but this was the exception and not the norm.
An optional follow-up question asked students why they had responded the way they
did. Responses to this question are shown in Table 3. Several students suggested a cross-
Section Answer n Proportion
formula No 6 0.86
formula Yes 1 0.14
tidyverse No 10 0.91
tidyverse Yes 1 0.09
Table 2: Responses to the question, ‘Would you have preferred to learn the other
0% 20% 40% 60% 0% 20% 40% 60%
About what I expected −−
in a good way
Not what I expected −−
in a good way
About what I expected −−
in a bad way
Not what I expected −−
in a bad way
How was the experience of learning to program in R?
Figure 5: Responses to the question, “How was the experience of learning to program
over design for the experiment would have allowed them to better compare, which is both
a good direction for further work (and a possible indication the students were listening
during the chapter on experimental design).
Another question on the post-survey asked students “How was the experience of learning
to program in R?” Overall, students seem to have positive sentiment toward learning R,
whether in the formula or the tidyverse section. As seen in Figure 5, most students said
either the experience was “not what I expected – in a good way” or “About what I expected
– in a good way.”
Nothing from the survey responses seem to indicate a diﬀerence between the two sections.
formula I’ve heard that formula was more straightforward
formula I thought the syntax that I learned worked well
formula Because I am not familiar with it
formula I have no idea what the diﬀerences are, so I don’t really know how to answer this
formula Do not really know what the diﬀerence is, but also Prof. M was a very good teacher.
tidyverse I’m not sure I wish we got to experience both so we could compare, maybe learn one
for one half of the semester and the other for the other half?
tidyverse As per my plan to study data Science in graduate school, I would have preferred
learning both syntaxes
tidyverse I really enjoyed tidyverse, it was super easy to learn, and I liked the simplicity of
tidyverse Tidy, is well tidy. When looking online the other syntax seemed more
tidyverse Im not sure what the beneﬁt is.
tidyverse I’m not sure of the diﬀerence and I had 0 experience of coding or using anything like
r so I didn’t have a preference as to which one I learned.
tidyverse I really enjoyed this class and have learned a lot.
Table 3: Reasons stated by students for their preference of which syntax to learn.
While the pre- and post-survey results do not suggest interesting results, the incidental data
from YouTube and RStudio Cloud provided some insights.
3.5 YouTube analytics
Because of the format of the class, which was ﬂipped such that students watched videos
of pre-recorded content, we can study overall patterns of YouTube watch time. YouTube
oﬀers a data portal which allows for date targeting. We deﬁned each week of the semester
as running from Sunday to Saturday, which covered the time when videos were released
through to the time ﬁnished labs needed to be submitted (Fridays at 11:59 pm). For each
week, we downloaded YouTube analytics data for the channel, and ﬁltered the data to focus
only on the videos related to the introductory statistics labs.
Analytics data includes number of watches for each video, number of unique viewers,
and total watch time. We joined this data with data recording the length of the relevant
videos, which allowed us to calculate the approximate proportion of the videos watched by
Data from YouTube is aggregated, and since videos were posted publicly, could contain
viewers who were not enrolled in the class. However, when we checked view counts of lab
videos on subsequent weeks (e.g., looking at views for the “describing data” lab in weeks
3-15) there were rarely more than two views accumulated per section per week. While
the public nature of the videos means we do need to view these results with a level of
skepticism, we can be reasonably sure the majority of viewers were students. Studying the
data displays some interesting trends.
First, we can look at the number of unique watchers per video, seen in Figure 6. Inter-
estingly, at the start of the semester there are more unique viewers than enrolled students
in the class, but as time goes on, the number of unique viewers levels out at slightly less
than the number of enrolled students (n= 21 for both sections). The lower numbers later
on make sense because some students were likely unengaged, or found it possible to do
their lab work without watching the video. However, the high numbers at the start of the
semester are puzzling. Perhaps students were viewing the videos from a variety of devices
(phone, laptop, computer at school, etc) when the semester began.
4 8 12 16
Number of unique viewers
Figure 6: Average number of unique viewers per video. Horizontal line represents
the 21 students enrolled in each of the sections, a baseline for comparison.
If we assume all viewers were actually students (some students being counted as sepa-
rate viewers because of diﬀerent devices or cookie settings), we can ﬁnd an approximate
proportion of video content watched, per student. This is shown in Figure 7. It appears the
proportion of video content watched is larger for the formula videos than for the tidyverse
videos. This can be conﬁrmed by a 95% bootstrap interval, which suggests the formula
section watched between 0 and 0 percentage points more of the videos each week.
The discrepancy in watch proportions could be explained by the fact that videos for
the tidyverse section tended to be longer, as discussed in Section 3.2. Prior research has
shown shorter videos are better for ﬂipped classroom settings, so perhaps the videos for the
tidyverse section were just too long. Literature about ﬂipped classrooms suggests shorter
videos are better, although there is no consensus about the ideal length for videos, with
suggestions ranging from 5 to 20 minutes as a maximum length for a video (Zuber 2016,
Beatty et al. 2019,Guo et al. 2014). Most weeks the total number of minutes of video
content was below 20, and almost every week had video content split into multiple shorter
No matter the explanation, this trend is particularly interesting when considered in
4 8 12 16
Week of semester
Approximate proportion of video content watched, per student
Figure 7: Estimated proportion of YouTube video content watched, per student.
This data came from dividing the total amount of time watched by the number of
students in each section and the total length of the video(s) for the section that
conjunction with the RStudio Cloud usage patterns in the following section.
3.6 RStudio Cloud usage
The other source of unexpected data came from RStudio Cloud usage logs. RStudio Cloud
provides summary data per user in a project, aggregated by calendar month. This data
includes all students enrolled in the class.
Since the instructor set up separate projects for each section, it is easy to compare data
between sections. In Figure 8we can see the amount of compute time used by each student
in each section. Lines connect data from a particular student, to allow the reader to trace
over time. For a monthly overview, see Figure 9.
Note that the month of November is missing for the tidyverse section because of an
oversight on the part of the author.
While the tidyverse section seemed to watch less of the provided videos each week (as
September October November December
Hours of compute time on RStudio Cloud
Figure 8: Hours of compute time per student over the course of the semester.
0 10 20 30 40 50 0 10 20 30 40 50
Hours of compute time on RStudio Cloud
Figure 9: Hours of compute time on RStudio Cloud, per month of the semester.
Students in the tidyverse section appear to be spending more time on RStudio
Cloud, particularly in the months of October and December.
section September October November December
formula 10.4 (3.3) 13.9 (10.3) 9.4 (6) 7.7 (6)
tidyverse 7.7 (4.7) 17.1 (8.6) missing 11.5 (7.2)
Table 4: Mean student compute time on RStudio Cloud per month in hours (stan-
dard deviation in parentheses), broken down by section. Note diﬀerent months had
diﬀerent numbers of assignments, although the number of assignments was consistent
discussed in Section 3.5), they appear to spend more time on RStudio Cloud per month.
All the distributions are right-skewed, with several students spending many more hours
of compute time than the majority. It is also important to note these numbers are likely
inﬂated based on the way RStudio Cloud counts usage time. The spaces for both sections
were allocated 1 GB of RAM and 1 CPU, so one hour of clock time on the space counted as
one project hour (spaces with more RAM or CPU may consume more than one project hour
per clock hour), but student usage often includes a fair amount of idle time. RStudio Cloud
will put a project to sleep after 15 minutes without interaction, and based on observation
of student habits it is likely almost every session ends with a 15 minute idle time before
the project sleeps. In a month with four labs, this can add up to at least an hour of project
time that does not correspond to students actually using R.
Nevertheless, because the numbers would be inﬂated in the same way in both sections,
we can persist in comparing them. Using data over the entire semester, students in the
tidyverse section had an mean number of compute hours per month of 13.5 and students
in the formula section had a mean of 11.5 hours.
We can also study these numbers per month, as seen in Table 4. The mean compute
time for both sections increases from September to October, likely because of the increased
number of labs that month (two labs were due in September, ﬁve in October). Compute
time then drops down again for the formula section, and continues downward. November
data is missing for the tidyverse section, but time also appears to decrease in this section
as months progress, although not to the same degree as in the formula section.
Whereas in the pre- and post-surveys we have quite small sample sizes, the RStudio
Cloud data includes all students enrolled in the class. This means we perhaps have a large
eﬀect group term estimate std.error statistic
ﬁxed NA (Intercept) 11.381885 1.556911 7.3105558
ﬁxed NA sectiontidyverse -1.976604 2.175435 -0.9086018
ﬁxed NA monthOctober 4.359535 1.653232 2.6369779
ﬁxed NA monthNovember -1.715090 1.653232 -1.0374167
ﬁxed NA monthDecember -2.300425 1.653232 -1.3914717
ﬁxed NA sectiontidyverse:monthOctober 4.899422 2.310021 2.1209425
ﬁxed NA sectiontidyverse:monthDecember 5.200658 2.310021 2.2513466
ran pars ID sd (Intercept) 4.598662 NA NA
ran pars Residual sd Observation 5.227977 NA NA
Table 5: Linear mixed-eﬀects, using month as a categorical variable.
enough sample to perform inferential statistics.
Data was collected at the student level over time, so it is necessary to use a mixed eﬀects
model to account for clustering within students. We also need to take into account the
longitudinal nature of the data, so we included month as a predictor. We use the lme4
package to ﬁt the linear mixed eﬀect models (Bates et al. 2015).
Initially, we ﬁt an unconditional means model, to determine how much variability in
compute time was due to diﬀerences between students, without considering diﬀerences over
time or between section. Based on the intraclass correlation coeﬃcient, we can conclude
30% of the total variation in compute time is attributable to diﬀerences between students.
After iterating through several candidate models, we arrived at a ﬁnal model which pre-
dicts compute time per month (in hours) using section and month as ﬁxed eﬀect predictors,
as well as an interaction eﬀect between section and month. Student identiﬁer was used as a
random eﬀect. This ﬁnal model has the lowest AIC and BIC values of all candidate models.
Results from the model can be seen in Table 5.
The predicted values for each section/month combination match the means computed
in Table 4.
The lme4 package does not provide p-values for model coeﬃcients, but it does provide
a method for conﬁdence intervals. The conﬁdence intervals for each of the coeﬃcients are
shown in Table 6.
2.5 % 97.5 %
.sig01 3.2512430 6.0590086
.sigma 4.4708436 5.8874342
(Intercept) 8.3756022 14.3881678
sectiontidyverse -6.1772116 2.2240035
monthOctober 1.1696135 7.5494564
monthNovember -4.9050115 1.4748314
monthDecember -5.4903465 0.8894964
sectiontidyverse:monthOctober 0.4422206 9.3566237
sectiontidyverse:monthDecember 0.7434568 9.6578598
Table 6: Conﬁdence intervals for coeﬃcient estimates.
The conﬁdence interval on the sectiontidyverse coeﬃcient crosses zero, which sug-
gests the diﬀerence in number of hours of compute time between the sections in September
was not statistically signiﬁcant. The conﬁdence interval on monthOctober does not cross
zero, suggesting students in the formula section spent longer on RStudio Cloud that month
compared to September. But, the intervals for the formula section in November and De-
cember cross zero, which means the number of compute hours is not signiﬁcantly diﬀerent
from the number of hours in September for that section. For the tidyverse section it
is a little harder to assess. The intervals for the sectiontidyverse:monthOctober and
sectiontidyverse:monthDecember intervals do not cross zero, but if combined with the
intervals on monthOctober and monthDecember, they would.
As a model assessment strategy, we can use a likelihood ratio test to compare the
unconditional means model with our more complex model. A drop-in-deviance test suggests
the more complex model signiﬁcantly outperforms the unconditional means model.
Based on the signiﬁcance of the drop-in-deviance test, and the number of conﬁdence
intervals in the model that did not cross zero, it seems both month and section have some
predictive power for the number of compute hours students used on RStudio Cloud.
It appears students in the tidyverse section spent more time on RStudio Cloud. We
can concoct several diﬀerent scenarios to explain this diﬀerence. In one, students in the
tidyverse section were more engaged with their work, so spent more time playing with
code in R. In another, students in the tidyverse section struggled to complete their work,
so spent more time in R trying to get their lab material to work. Because the usage data
was collected incidentally after the fact, we have no information about which story is closer
to the truth. A follow-up study might conduct semi-structured interviews with students
after the completion of the class, to determine more about student experiences and work
It would also be interesting to know if students who spent more time on RStudio Cloud
received higher or lower grades on their assignments, but as discussed in Section 3.1, the
IRB for this study did not cover graded student work in that way. We do know the two
sections did not have an overall diﬀerence in mean grade.
Since these results are from a pilot study, they should not be used without caveats.
However, they do indicate that if instructors are worried about the amount of time assign-
ments take to complete, they may want to consider using the formula syntax rather than
the tidyverse syntax.
Another follow-up study that would be interesting to complete would look at student
success in subsequent courses. Because tidyverse syntax is frequently used for higher-
level courses, students who were in the tidyverse section may have an easier time in
those later courses. However, many students in this study will not go on to take further
statistics courses. So the takeaways about syntax choice may vary depending on the student
population to which they will be applied.
This pilot study provides a semester-long comparison of two sections of introductory statis-
tics labs using two popular R coding styles, the formula syntax and the tidyverse syntax.
Pre- and post-survey analysis showed limited diﬀerences between the two sections, but
analysis of other incidental data, including pre-lab document lengths and YouTube and
RStudio Cloud data presented interesting distinctions.
Materials for the tidyverse section tended to be longer, both in lines of code (likely
because of the convention of linebreaks after %>%) as well as the length of the associated
YouTube videos. Students in the tidyverse section watched a smaller proportion of the
weekly pre-lab videos than students in the formula section, but spent more time computing
on RStudio. Conversely, students in the formula section were watching a larger proportion
of the pre-lab videos each week, but spending less time computing each month.
These two insights are slightly contradictory– perhaps the formula section students found
the concepts more complex as they were watching the videos, but then had an easier time
applying them as they worked on the real lab.
There is much more interesting further work that could be considered. As students
suggested, a cross-over design where students saw one syntax for the ﬁrst half of the semester
and the other for the second half would allow for better comparisons. However, there are
a few caveats here.
First, anecdotal evidence from many instructors suggests it is best for students to see
only one consistent syntax over the course of the semester. The other challenge is the
formula syntax tends to seep (albeit only minorly) into the tidyverse section. For example,
when doing linear regression both sections saw the lm(y~x, data = data) formula syntax.
If a cross-over design used the existing materials from this study, just swapping the ﬁnal
few weeks, students in the formula section would likely see more that was familiar to them
than students in the tidyverse section.
By this consideration, the tidyverse students almost did have a cross-over design. This
may be why the number of hours of compute time for the tidyverse section remained
consistent from November to December (even though there were fewer instructional weeks
in December) while the formula section’s hours of compute time decreased.
Another interesting insight from this pilot is the number of unique functions needed to
cover a semester of introductory statistics in R. The tidyverse section saw more unique
functions, but both sections were limited to a small vocabulary of functions for the semester.
We recommend instructors follow this approach regardless of syntax. Instructors should
attempt to reduce the number of functions they expose students to over the course of a
semester, particularly in an introductory class. This will help reduce cognitive load.
One criticism of the tidyverse is how many functions the associated packages contain.
However, while the tidyverse section exposed students to 32, compared to the 19 functions
shown in the formula section, both labs focused on a relatively small number of functions.
Because there were 12 labs in the semester, this averages out to approximately 3 functions
per lab for the tidyverse section compared to an average 2 functions shown in the formula
The exercise of counting R functions in existing materials, using the getParseData()
function, is one we recommend all instructors attempt, particularly before re-teaching a
course. It can be eye-opening to discover how many functions you show students, and
which functions are only used once.
We hope this pilot helps answer some initial questions about the impact of R syntax on
teaching introductory statistics, while also raising further questions for future study. While
some aspects of the analysis in this study suggest the formula syntax is simpler for students
to learn and use, there are still many course scenarios for which we believe the tidyverse
syntax is the most appropriate choice. While formula syntax can be used throughout an
entire semester of introductory statistics, it does not oﬀer functionality for tasks like data
wrangling. This means students who will go on to additional statistics or data science
classes may be better served by an early introduction to tidyverse. However, in order to
determine this conclusively, additional study would be needed.
No matter which syntax an instructor chooses, it appears possible to limit the number
of functions shown in a semester, and provide students with a positive learning experience.
Thanks to Sean Kross for his guidance about parsing R function data, and Nick Horton
for his useful comments.
A Functions used
(a) Used in both sections
(b) Used only in formula
•get p value
(c) Used only in tidyverse
Table 7: Lists of functions, and which section(s) they were used in.
Adhikari, A., DeNero, J. & Jordan, M. I. (2021), ‘Interleaving Computational and Inferen-
tial Thinking: Data Science for Undergraduates at Berkeley’, arXiv:2102.09391 [cs] .
Bates, D., M¨achler, M., Bolker, B. & Walker, S. (2015), ‘Fitting Linear Mixed-Eﬀects
Models Using lme4’, Journal of Statistical Software 67(1).
Beatty, B. J., Merchant, Z. & Albert, M. (2019), ‘Analysis of Student Use of Video in a
Flipped Classroom’, TechTrends 63(4), 376–385.
Biehler, R. (1997), ‘Software for Learning and for Doing Statistics’, International Statistical
Review 65(2), 167–189.
Bray, A., Ismay, C., Chasnovski, E., Baumer, B. & Cetinkaya-Rundel, M. (2021), Infer:
Tidy Statistical Inference.
Carpentries, T. (2021), ‘The Carpentries Survey Archives’.
C¸ etinkaya-Rundel, M., Hardin, J., Baumer, B. S., McNamara, A., Horton, N. J. & Rundel,
C. (2021), ‘An educator’s perspective of the tidyverse’, arXiv:2108.03510 [stat] .
DeNero, J., Culler, D., Wan, A. & Lau, S. (2020), ‘datascience 0.15.7’.
Finzer, W. (2002), ‘Fathom: Dynamic Data Software (version 2.1)’, Key Curriculum Press.
GAISE College Report ASA Revision Committee (2016), Guidelines for Assessment and
Instruction in Statistics Education College Report 2016, American Statistical Associa-
Gramazio, C. C., Laidlaw, D. H. & Schloss, K. B. (2017), ‘Colorgorical: Creating discrim-
inable and preferable color palettes for information visualization’, IEEE Transactions on
Visualization and Computer Graphics 23(1), 521–530.
Guo, P. J., Kim, J. & Rubin, R. (2014), How video production aﬀects student engagement:
An empirical study of MOOC videos, in ‘Proceedings of the First ACM Conference on
Learning @ Scale Conference’, ACM, Atlanta Georgia USA, pp. 41–50.
Harrower, M. & Brewer, C. A. (2003), ‘ColorBrewer.org: An Online Tool for Selecting
Colour Schemes for Maps’, The Cartographic Journal 40(1), 27–37.
Horst, A. M., Hill, A. P. & Gorman, K. B. (2020), ‘Palmerpenguins: Palmer Achipelago
(Antarctica) penguin data. R package version 0.1.0’, Zenodo.
Kaplan, D. & Pruim, R. (2020), Ggformula: Formula Interface to the Grammar of Graph-
Konold, C. & Miller, C. D. (2001), ‘TinkerPlots (version 0.23). Data Analysis Software.’.
Krishnamurthi, S., Schanzer, E., Politz, J. G., Lerner, B. S., Fisler, K. & Dooman, S. (2020),
‘Data Science as a Route to AI for Middle- and High-School Students’, arXiv:2005.01794
McNamara, A. (2015), Bridging the Gap Between Tools for Learning and for Doing Statis-
tics, PhD thesis, University of California, Los Angeles.
McNamara, A. (2018), ‘R Syntax Comparison Cheatsheet’.
McNamara, A., Zieﬄer, A., Beckman, M., Legacy, C., Butler Basner, E., delMas, R. C. &
Rao, V. V. (2021a), Computing in the Statistics Curriculum: Lessons Learned from the
Educational Sciences, in ‘USCOTS 2021’.
McNamara, A., Zieﬄer, A., Beckman, M., Legacy, C., Butler Basner, E., delMas, R. &
Rao, V. V. (2021b), ‘Computing in the Statistics Curriculum: Lessons Learned from the
Morandat, F., Hill, B., Osvald, L. & Vitek, J. (2012), Evaluating the Design of the R
Language: Objects and Functions For Data Analysis, in ‘ECOOP’12 Proceedings of the
26th European Conference on Object-Oriented Programming’.
Peters, T. (2004), ‘PEP 20 – The Zen of Python’.
Pruim, R., Kaplan, D. & Horton, N. J. (2017), ‘The mosaic package: Helping students
‘think with data’ using R’, The R Journal 9(1).
R Core Team (2020), R: A Language and Environment for Statistical Computing, R Foun-
dation for Statistical Computing, Vienna, Austria.
Rafalski, T., Uesbeck, P. M., Panks-Meloney, C., Daleiden, P., Allee, W., McNamara, A.
& Steﬁk, A. (2019), A Randomized Controlled Trial on the Wild Wild West of Scientiﬁc
Computing with Student Learners, in ‘Proceedings of the 2019 ACM Conference on
International Computing Education Research’, pp. 239–247.
Roberts, S. (2015), Measuring Formative Learning Behaviors of Introductory Statistical
Programming in R via Content Clustering, PhD thesis, University of California, Los
RStudio PBC (2021), ‘RStudio Cloud - Do, Share, Teach, and Learn Data Science’.
Steﬁk, A. & Siebert, S. (2013), ‘An Empirical Investigation into Programming Language
Syntax’, ACM Transactions on Computing Education 13(4).
Steﬁk, A., Siebert, S., Steﬁk, M. & Slattery, K. (2011), An Empirical Comparison of
the Accuracy Rates of Novices using the Quorum, Perl and Randomo Programming
Languages, in ‘PLATAEU 2011’.
The Concord Consortium (2020), ‘CODAP - Common Online Data Analysis Platform’.
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., Fran¸cois, R., Grole-
mund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache,
S. M., M¨uller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K.,
Vaughan, D., Wilke, C., Woo, K. & Yutani, H. (2019), ‘Welcome to the Tidyverse’,
Journal of Open Source Software 4(43), 1686.
Zuber, W. J. (2016), ‘The ﬂipped classroom, a review of the literature’, Industrial and
Commercial Training 48(2), 97–103.