Introductory data science across
disciplines, using Python, case studies
and industry consulting projects
Jana Lasser1,2,3, Debsankha Manik2, Alexander
Silbersdorff3,4, Benjamin Säfken3,4, Thomas Kneib3,4
1 Complexity Science Hub Vienna, Josefstädterstrasse 39, 1080 Vienna, Austria
2 Max Planck Institute for Dynamics and Self-Organization, Am Fassberg 17, 37077 Göt-
3 Centre for Statistics, Georg August University Göttingen, Humboldtallee 3, 37073 Göt-
4 Campus-Institut Data Science, Georg August University Göttingen, Goldschmidtstraße
1, 37077 Göttingen, Germany
Data science and its applications are increasingly ubiquitous in the rapidly digitizing
world and consequently students across different disciplines face increasing demand to
develop skills and awareness [30, 27] to answer needs across all sectors to collect, man-
age, evaluate, apply and extract knowledge from data and critically reﬂect upon the de-
rived insights. Against this backdrop, competencies to implement essential data analysis
independently and to develop a basic understanding of more advanced processes and pro-
cedures used by data scientists in order to collaborate with them in a speciﬁc ﬁeld of work
are deemed desirable if not outright necessary in the future.
As part of a joint initiative of several German universities and German businesses,
the authors of this paper have developed a “service-course” that aims to teach fundamen-
tal data competencies to students from all disciplines at the University of Göttingen. See
https://www.stifterverband.org/data-literacy-education for further information on this col-
laboration. The course especially addresses those students from outside STEM-subjects
(science, technology, engineering, and mathematics) who generally have no prior expe-
rience with statistics or programming from their school education and highlights the im-
portance of data competencies in prospective occupational ﬁelds for those students at the
outset of their studies.
We aim to provide all participating students with a fundamental understanding of the
concepts and procedure of data science and motivate a fair share of them to pursue further
courses geared to evolve their competencies in that regard. Moreover, the course aims to
convey not only competencies from the domains of statistics and computer science but
equally aims to develop the soft skills associated with data analysis, such as communi-
cating the results in the context of tasks both inside and outside of university. The course
builds on some role models, most prominently among them data8 (see http://data8.org/),
the data science course developed at the University of Berkeley. Other similar courses are
presented in  and . However a number of factors and components of this course
bring together both well-established and current pedagogies and practices of interest and
value for current and future learning in introductory data science across disciplines.
Despite a general agreement regarding the importance of statistics and programming
competencies in most ministries, the complex federal structure of school education in
Germany and the general problems of overhauling established school curricula yield the
given status quo with no systematic programming, data science or statistics training prior
to university for most students. Hence this course cannot assume even the basic statistics
background commonly found in at least middle school in so many countries today. In
addition, students choosing to do the course are not only in areas which are traditional
foci of non-STEM statistics courses (such as Business), but also in Humanities programs,
including Linguistics, Archaeology and History. Such diversity of interests is catered
for in split tutorials which also use contemporary, complex and non-traditional data sets.
What computer programming and how much are signiﬁcant considerations in designing
introductory data science courses. Assuming no prior programming experience left us
free to choose a programming language. We chose to use Python despite the increasing
use of R and R-based products in statistics courses and new data science courses.
In the ﬁnal phase of the course, a mandatory project is carried out in authentic
workplace-linked data investigations involving all aspects of data science at an introduc-
tory level. The importance of authentic experiential learning of the full statistical data
investigation process has long been recognised, but facilitating this at the introductory
level has proved challenging. Data science adds more challenges to this through require-
ments in data wrangling, cleaning and visualisation. Therefore the course also builds on
and extends to contemporary needs in learning from data, the objectives of courses with a
particular focus on ﬁrst hand real investigative experiences, such as [16, 32, 4, 12, 35, 2].
In this paper, we describe some details of the course, including the learning pro-
gression phases, tutorial work, projects, and openly available online teaching resources,
consisting of slides, videos, exercises and solutions. Because the course is so new, only
some initial evaluations are available; these are included in the discussion.
2 Course outline
Within the teaching landscape of our university, this course is intended as a service course
to introduce students from all disciplines to data science. The structure of the course
is geared towards motivating students from the outset while gently introducing them to
important core skills required for working with data or cooperating with data scientists.
In addition, the interdisciplinary nature of the course and the time constraints imposed by
the students’ curricula require an approach that evades any discipline-speciﬁc overload,
while also giving the students from the participating disciplines ﬁrst insights into the
nature of the applications of the competencies conveyed to them in the course.
Regarding the required core competencies for all these students from different dis-
ciplines, the target was to develop basic competencies allowing the students to indepen-
dently ﬁnd, read and clean data for further analysis. Considerable time was then devoted
to developing the students ability regarding explorative and visual data analysis. Finally,
we aimed for a fundamental understanding regarding inference and prediction on the basis
of statistical models. While all of theses desired processes can be taught by (often ﬁeld
speciﬁc) point-and-click interfaces – and previously mostly were in most non-STEM dis-
ciplines, this course speciﬁcally opted to pursue this introductory course on the basis of
a programming language, following other recent initiatives to clear this methodological
hurdle at the outset of the statistical education [5, 34]).
The key argument for endeavouring to directly start with a programming language
(and thereby against using spreadsheet programmes like Excel or OpenOfﬁceCalc and
mostly interface-based data analytics programs such as SPSS or Gretl) were, that by us-
ing a programming language students were directly nudged towards thinking algorith-
mically  rather than looking for one single comprehensive process, thereby training a
fundamental skill of analytical processes at large. In addition, the use of the program-
ming language has the obvious advantage that students are introduced to the concept of
programming. This is paramount for subsequent more advanced courses which will hope-
fully be attended by a substantial share of the courses’ participants.
Concerning the programming language, we opted for Python. There are several
reasons for choosing Python over other programming languages such as R, Java, SAS,
STATA or MATLAB: Python is a syntactically simple programming language, which fa-
cilitates the learning of basic programming concepts [19, 1, 15]. Additionally, Python is
open source, eliminating the need for the acquisition of costly licenses. Python is highly
prevalent in industry [24, 31] and has a rich and thriving ecosystem of libraries for scien-
tiﬁc computing , enabling students to directly translate their competencies to potential
applications in their later jobs in industry or research. Lastly, Python is a general purpose
programming language, allowing for easy extension of the learned skills into different
application areas such as image analysis, natural language processing, data mining or
machine learning in subsequent lectures that build on our introductory course.
The course consists of a weekly lecture for all students. The lecture is accompanied
by weekly tutorials in which students are split into groups with each group working on
one of currently four domain-speciﬁc case studies (see Appendix A.1). Content-wise, the
course is split into three phases of about equal length and importance: (I) teaching of basic
programming skills in Python (II) teaching of data analysis methods and application in a
case study (III) outlook and work on a project, which is used for student assessment. The
time line of the lecture and accompanying tutorial on a week-to-week basis are shown in
table 1. The skills taught throughout the three course phases can be mapped to the skills
making up the deﬁnition of "data acumen" , which we extend by a dedicated sixth
skill for "data visualization":
(a) Combine many existing programs or codes into a “workﬂow” that will accomplish
some important task;
(b) “Ingest,” “clean,” and then “wrangle” data into reliable and useful forms;
(c) Visualize data;
(d) Think about how a data processing workﬂow might be affected by data issues;
(e) Question the formulation and establishment of sound analytical methods; and
(f) Communicate effectively about properties of computer codes, task workﬂows,
databases, and data issues.
Here, the lecture focuses predominantly on skills (d) and (e), whereas the tutorials focus
on skills (a) and (b). Visualization (c) is used and taught continuously in both the lecture
and tutorial to illustrate the application of other skills and motivate students. Skill (f)
is predominantly acquired by the students during their work on projects and subsequent
presentation of results. We note that we cover skill area (a) by introducing the students to
a programming language (Python), which – together with existing third-party libraries –
incorporates all necessary functionality to establish a data science workﬂow. In table 1,
mappings of lecture and tutorial content to these six skill areas are indicated by their
respective letters from the list above.
The lecture is conceptualised as a blended learning exercise entailing classical input
by a lecturer as well as live coding sessions and on-demand videos for key concepts that
students can watch at home. In phases (I) and (II), the content of the lecture serves two
goals: (1) to quickly introduce students to the programming and statistics concepts nec-
essary to work on the case study in the tutorials and (2) to present motivating examples of
data science applications. In phase (III), the lecture content is designed to address current
topics in data science and provide an outlook to domain speciﬁc applications that serve as
a connection to further lectures that build on this introductory course.
During the tutorials, the concepts introduced in the lecture are applied to domain-
speciﬁc case studies constructed around current and domain speciﬁc data sets. Every case
study is designed as a series of exercises and solutions that build on each other and iterate
the general phases of data acquisition, data cleaning, data exploration and visualization
and the subsequent answering of research questions using inference. All case studies con-
taining the individual tutorial exercises and solutions are freely available . Learning
in the tutorials follows a student-centered approach: Students are actively encouraged to
look for solutions to coding problems on their own and help each other before they ap-
proach tutors with questions. Tutors are instructed to coach students on how to interpret
error messages and use online resources such as library documentations or Stack Over-
ﬂow (https://stackoverﬂow.com/) to solve problems they encounter while working on the
exercises. This is deliberately done to mimic the iterative process of failing and learning
from failure which is inherent to programming. Consequently, students develop commu-
nication strategies and problem solving skills by learning together .
To accommodate the different skill levels of the heterogeneous students, exercises
are divided into optional and non-optional parts. Non-optional exercises are designed
in a way that any student–including those with no prior knowledge–should be able to
complete them during the course of the weekly, two-hour tutorial. Optional exercises are
more challenging or provide additional domain-speciﬁc context and are aimed at more apt
students and those interested in a deeper understanding.
The course assessment is speciﬁcally designed to convey substantial practical expe-
riences to the students which is known to be in high demand for their later vocational
development. The assessment consists of work on a project in small teams of 2-3 students
each. Projects are designed in collaboration with industry and research partners and based
on real-life data sets (see Appendix A.2 for a list of the projects used). Project outcomes
are then presented during a talk by all team members. The presentation as well as the
Jupyter Notebooks created by the students during the work on their projects are used to
assess student’s success at the end of the lecture.
In the following we will describe the content of both the lecture and the tutorial as
well ass the criteria for the ﬁnal assessment in more detail. To this end we follow an
exemplary case study in which a large corpus of tweets is explored.
phase week lecture tutorials
0 motivation and organisation
1 causality, correlationetech-check
2 Python, Jupyter Notebooksausing Python as a calculatora
3 data typesa,btext, numbers, lists, tablesa,b
4 tables, incomplete dataclogic, conditionalsb
5 mean, medianefunctions, data typesb
6 histograms, frequency tablescdata acquisition & cleaningb
7 scatter plots, time serieschistogram, descriptive statsc
9 clustering algorithmsetime seriesc
11 project selection inference, biasa,b,e
12 open topic: data ethicsework on projectsf
13 open topic: data protectiond,ework on projectsf
14 open topic: machine learningework on projectsf
Time line of the semester-long lecture and accompanying tutorials. Skills
covered in the respective lecture and tutorial units corresponding to the
"data acumen" deﬁnition  plus "data visualization" are indicated as
superscribed letters (see list above). Note that the project presentations take
place during the semester break, several weeks after the end of the lecture,
as students require additional time to work on their projects between the
last lecture and the project presentations.
3 Phase I: learning how to program
Learning how to code fulﬁlls two goals in the context of this course: (1) students learn
how to use the tools a programming language provides to analyze data. (2) learning how
to program introduces students to algorithmic thinking , a core skill students should
acquire through participation in this course.
The ﬁrst phase of the lecture takes six weeks. The goal is to enable students with
no background in programming to use Python for data analysis applications. Therefore
we limit the content of the lecture to core features of a programming language that are
useful for data analysis. More advanced programming paradigms such as object oriented
programming or algorithm complexity are consciously left out. Phase (I) is accompanied
by a series of learning videos for every core programming skill.
At the outset, we offer an obligatory lecture prior to the ofﬁcial start of the course that
informs the students about the requirements and organisational aspects of the course as
well as aiming to draw them into course by highlighting the importance of data analysis.
Starting out with the latter, we initiate the lecture by using the electronic voting system
mVote  to survey the opinion of students regarding the level of importance of data
analysis in our society. Subsequently, we ask one to three students of those indicating the
highest level as well as those indicating the lowest level (if existent), why they thought that
it was important. On the one hand this usually forebodes our arguments about the impor-
tance of data analysis while on the other hand it aims to instill an interactive atmosphere in
the lecture as well as later in the tutorials. Subsequently we use three illustrative examples,
relating data analysis with money making (via predictive advertising), health services (via
case number analysis of cardiac arrests during the football World Cup in Germany) and
love/sexuality (via illustrating the underlying logic of the Tinder match-algorithm). Last
but not least, we turn towards organisational issues of the course (schedule, examination,
In the ﬁrst week, students receive an introduction to Jupyter Notebooks , the pro-
gramming environment they are going to use throughout the course. Access to Jupyter
Notebooks is supplied centrally via a Jupyter Hub (https://jupyterhub.readthedocs.io/en/
stable/), therefore students do not have to undergo a lengthy software installation process
and can learn hands-on right from the start (see also supplement S1 for a more detailed
description of the programming environment and technical implementation).
In the second week, students learn to use Python as a calculator. Different data types
such as integers, ﬂoats and strings are introduced implicitly, along with the concept of
variables. Students also have ﬁrst contact with a function (print()), that allows them
to inspect variables.
Subsequently, students are introduced to lists and loops. Loops are directly applied to
inspect the content of a list of elements. The pandas  DataFrame is introduced as
an extension of a list and basic container to store numerical data. This way, students are
already introduced to the concept of programming libraries (such as pandas), that provide
additional functionality. Data access via index and column name is practiced and basic
statistics such as the sum and mean of elements are calculated.
In the fourth week, students learn simple logic operations such as testing whether a
speciﬁc element is contained in a list. They are then introduced to conditionals (if,
else and elif). Using conditionals they learn to create ﬁlters to access only a se-
lected portion of data at a time. Students also create their ﬁrst data visualization using
In the last week of phase one, students are formally introduced to functions in Python.
Additionally, they learn how to build their own simple functions to automate repetitive
steps and how different types of function arguments work in the context of functions from
programming libraries. Additionally, students are introduced to non-numeric data types,
namely images and text: they learn how to load, display and manipulate images and how
to load and clean text.
Exercises for the tutorials re-iterate the concepts taught in the lecture and ask students
to put them into practice. All exercises are provided as Jupyter Notebooks  in English
and German language.
4 Phase II: core methods for data analysis
The second phase of the lecture takes four weeks. The goal is to teach the students core
data science skills such as data acquisition, data cleaning, data exploration, data visual-
ization and data analysis as well as introducing core concepts like probabilistic thinking
by means of a data-driven learning process . In the tutorials, students apply the skills
conveyed in the lecture by working on a discipline-speciﬁc case study built around a con-
temporary data set. To this end, students are subdivided into smaller groups with similar
domain backgrounds and work on different case studies – ideally touching upon their ﬁeld
of study. During the ﬁrst run of our course we supplied three different case studies: an
economics-related case study based on GDP and strike data , an archaeology-based
case study based on data of pottery fragments found in the Mediterranean , and a gen-
eral case study based on Twitter data . All case studies follow a similar design and
teach the same skills, except for a few domain speciﬁc applications such as visualization
of data on a map (in the archaeology-based case study). In the following, we will illustrate
the design of the case studies using the example of the Twitter case study which was used
as a general-purpose case study.
In the lecture, the students are taught the underlying rationale and theoretical aspects of
descriptive statistics - focusing in particular on measures of location, like the arithmetic
mean and frequency tables. Using different historical examples the lecture also highlights
the scope of skewing statistics one way or another by using different subsets or different
presentation of results, ultimately pointing to the relevance of reﬂecting the origins of the
data that is to be analysed.
In the general purpose tutorial, the sources of the data are discussed. The Twitter
data is compiled from three sources, featuring tweets of Russian trolls , Donald J.
Trump  and regular Twitter users . The data sets contain the tweet text, user
account, tweet language and a timestamp. The ﬁnal goal of the case study is to ﬁnd out if
and how tweets from these three users or user groups are quantitatively different. Students
are encouraged to inform themselves about the data sources, the authors and the context
of the data collection. After the data is acquired, students explore the data sets by looking
at the column names and data types and exemplary tweets. They then proceed to clean the
data by identifying broken or unwanted entries (for example tweets in the Trump data set
that were not tweeted by the account of Trump) in the data set and removing these entries
using ﬁlters. They also learn how to save a data set to disk, after they have ﬁnished their
work on it.
During the second case study tutorial and the lecture, students learn how to visualize
data. They are introduced to histograms to visualize statistics such as the tweet length
and the number of words. To calculate the tweet length or number of words, they have
to apply text processing techniques that were introduced in the lecture before. Students
arrive at their ﬁrst interesting ﬁnding, showing that the distribution of tweet lengths is
very different between regular users and Russian trolls, and very optimized to a maxi-
mum number of 240 characters / tweet for Donald Trump. Students are also introduced
to line plots which are useful to visualize time series, using the time stamp information
contained in the data sets. While students try out different visualizations, they are intro-
duced to styling and annotating plots using axis labels, titles and colours. Additionally
students are asked to explore how information can be mis-represented by visualizations,
for example by curtailing the axis ranges or choosing different bin-sizes for histograms.
They are encouraged to look out for these kinds of mis-representations in public displays
of quantitative information, such as newspapers.
In week 8, students are given deeper insights into descriptive statistics. They learn how
to calculate the mean, median and standard deviation of numerical data and what these
measures mean. They combine this knowledge with a more domain speciﬁc exploration
of the data set, such as a search for the number of hashtags or links used in every tweet,
or the differences between tweet languages. They are also introduced to the idea of using
these descriptive statistics to detect possible inconsistencies or outliers, such as a tweet
with a a very large number of characters, in the data set.
During the last week of the case study, students look closer into domain speciﬁc analysis
of the data. Following an infamous YouTube Video  in which Donald Trump claims
he "has the best words", they try to ﬁnd out how the words used by him are different from
the words used by Russian trolls and regular Twitter users. To this end, students install a
third-party library locally that allows them to check whether a word is a proper English
word. They then proceed to ﬁlter the words and count the number of unique words used
in the different data sets. Additionally, students are introduced to scatter plots to visualize
the relation between tweet length and average word length. Finally, students learn how
to perform a linear regression and quantify the relation between two variables (in this
example tweet length and number of words used). Optionally, a short introduction to a
Twitter data scraping tool is given to allow students to compile their own Twitter data sets
and analyze them in the future.
5 Phase III: outlook and project phase
Given the importance of data analysis in today’s work inside and outside academia, phase
III moves towards the practical application of those fundamentals by the students in form
of a project. Work on the project is intended to yield an assessment for the participating
students. Additionally, following , project work is intended to aide the students in
their own contextualisation of the scope, limits and meaning of their grasp of the course’s
content as well as giving a last, extensive example for application of the presented ideas
that can help students to grasp them despite their often abstract nature .
Although the projects are the centrepiece of the third phase we additionally provide
two optional lectures on regulatory aspects and further advanced methodologies regarding
Regarding the projects, the students are aided by the tutors at the outset of the project
to ensure that they initiate the work in time and are given the necessary aid to clear the
ﬁrst decisive obstacles before they are asked to complete their project during the semester
Ideas and data sets for the examination projects are compiled in collaboration with lo-
cal companies and research groups. The main aim is to provide students with a realistic
insight into data driven applications in industry and research that is based on real-life
data sets. All projects require the students to apply skills from four main areas to answer
the project research question: (1) data cleaning and management, (2) visualization of in-
formation contained in the data and (3) calculation of descriptive statistics (4) inference
using linear regression and correlation coefﬁcients. These areas correspond to the previ-
ously given deﬁnition of "data acumen" : ingesting data and thinking about how data
processing might be affected by data issues (skills (b) and (d)) correspond to the practical
task of data cleaning and management (1). During the calculation of descriptive statistics
(3) and the answering of the research question using inference (4), students have to re-
ﬂect on the analytical methods they employ (skill (e)). To complete the project, students
have to combine code snippets and different programming libraries (skill (a)). For the
presentation of the results, students will have to communicate about their workﬂow, data
properties and issues and the approaches they employed (skill (f)). Choosing appropriate
visualizations of data and results (area (2) and skill (c)) is also a dedicated part of the
work of the project.
Some projects also include a data acquisition as subtask of (1). Many projects include
additional optional questions which do not enter the ﬁnal project assessment but guide stu-
dents if they are willing to put in more work to follow their interests. The number of tasks
involved in each of the four components differs between projects but is distributed in a
way such that students can pass the class if they successfully complete all tasks from
(1) and (2) and at least one task from (3). Additional completion of tasks from (3) and
(4) awards better grades in the project assessment and project assessments uniquely de-
termine the ﬁnal grade students receive for the lecture. Work on the projects is done in
groups of two to three students, aligning with the recommendations from . The work
is supposed to take approximately 80 hours per student and is partly completed during the
semester break. The results of the project are presented to the lecturers and collaboration
partners in the form of a short talk by all project group members and all group members
receive the same mark. Materials needed to complete the projects, including data sets
and code in the form of Jupyter Notebooks are also handed in at the end of the project.
Projects undergo summative assessment based on the number of tasks students were able
to complete and the quality of their presentation. In case of doubt of the originality of the
student’s work, the student’s Jupyter Notebooks are consulted.
An exemplary project provided to us by a local organic farm was the analysis of the farm’s
energy consumption versus the energy produced by the farm’s solar plant over the course
of a year. The description of the project and task list as given to the students is included in
supplement (S2). The data was provided by the farmer in the form of several .csv ﬁles for
the energy production of the plant and the energy consumption of the farm. The ﬁles were
found to be disorganized, featuring corrupted and missing entries, as well as different
time resolutions for different parameters. Therefore data management and cleaning was
paramount. Regarding the analysis, the farmer was particularly interested in the economic
viability of upgrading the solar plant by adding a storage facility. Accordingly, students
were asked to characterize the solar plant’s energy surplus during different seasons of the
year and compare it to the farm’s consumption and to evaluate the need for a storage fa-
cility. Students ﬁrst cleaned and aggregated the provided data such that they had a single
data table containing the timeline of all observables with the same time resolution. They
then visualized timelines, calculated the energy surplus and interpreted the results (posi-
tive energy balance in summer and during days vs. negative energy balance in winter and
during nights). Using information on the cost of additional storage capacity in the form
of batteries, they calculated the cost of a sufﬁciently large energy storage module to store
enough energy during days/summers to supply energy during nights/winter. The students
were able to show that the acquisition of additional storage capacity is not economically
viable and therefore answer the main research question of the farmer. As an optional
task the students were asked to prepare their analysis scripts in a user-friendly way for
the farmer. Students were very interested in this question and were able to prototype a
dashboard solution. Building onto this case study by the students a new research project
FIGURE 1: A map as a possible outcome of the student project on malnu-
trition in Zambia. The map shows the z-scores of the height of children up
to the age of six for the different districts in Zambia
has been initiated which explores the extension of the dash board by integrating an anal-
ysis of current weather data by means of deep learning algorithms to instruct the farmer
regarding the use of energy intensive equipment in the upcoming 24 hours.
A second exemplary project description is outlined in supplement S3. This project re-
quired students to assess the weather dependent energy consumption of a seed drying
facility. Unfortunately, because of data protection reasons, we cannot supply the original
data for these projects.
An example for a project from the research ﬁeld is about analysing childhood malnutri-
tion in Zambia. In this project the students for instance used maps for visualisations as in
Regarding the lecture during Phase III, the format is more open with regards to topics,
and contributions from other lecturers in the form of guest talks are welcome. Possible
topic extensions are data ethics and algorithmic bias, data protection, machine learning
application and big data.
Both the projects and the additional lectures are thus geared towards using data anal-
ysis in practice with a particular focus on data cleaning, data visualisation and description
by metrics as well as the communication with and the presentation of the results to the
collaboration partner. On the basis of Phase III, the students are guided into an active
FIGURE 2: Self-assessment of skills by course participants before the start
of the lecture. Skill level was assessed on a ﬁve point Likert scale, ranging
from zero (no skills) to 4 (expert) for programming skills, maths skills
and problem structuring skills. Note: not all students participating in the
lecture also participated in the self-assessment.
problem-tackling and problem-solving environment as described by . Thereby, they
are not only instructed with theoretical knowledge from lectures and custom made exer-
cises but have experienced the difﬁculty of dealing with real-life data ﬁrst hand and the
challenges of molding the available data into the right format to address the commercial
or research endeavours put before them as well as ﬁnally conveying the results of their
6 Discussion and limitations
With the ﬁrst implementation of our course we attracted 30 students, which is due to the
fact that the course was offered as a purely elective module for the ﬁrst year and was not
fully integrated in the curricula of most faculty programmes yet. The majority (i.e. 80%)
of the participants where undergraduates in their bachelor’s of which two thirds were in
their 4th semester or below. Male and female participants were nearly balanced with
40% females and 60% males. Participants had very different backgrounds, ranging from
Scandinavian studies over linguistics to history of economics. As illustrated in Fig. 2,
most participants had very little previous programming knowledge. Math and problem
structuring skills were described as medium to high in the self-assessment administered
before the ﬁrst lecture.
Using a course design that teaches common skills to the plenum while splitting the
participants into smaller, domain speciﬁc groups worked out very well and allowed stu-
dents to use their domain speciﬁc intuition while applying newly learned skills in relation
to data analysis. Students reported very high levels of interest (6.2 out of 7) and said they
learned a lot in the course evaluation (5.8 out of 7). They reported that the workload they
required was neither too low nor too high (4.5 out of 7) and gave a very good overall eval-
uation of the course (6.2 out of 7). In personal conversations students especially were par-
ticularly positive regarding the possibility to get in contact with companies and research
groups during their exam projects. Equally, we got very positive feedback from the par-
ticipating companies who pointed to their need of both practically minded and data-apt
students, mirroring accounts on practically minded statistics courses elsewhere . The
digital teaching toolset and platform we chose, namely Jupyter Notebooks and a centrally
administered Jupyter Hub instance, worked very well (see supplement S1 for a detailed
description). This lead to all our students and tutors having access to an uniform and
well-functioning computing environment without spending much personnel resources. In
addition, these tools are capable of scaling up to 100’s of students with minimal effort.
A limitation of our approach is certainly the rather high number of tutors needed to
supervise the students in the tutorials. Every tutorial had two tutors, where at least one
had advanced programming and statistics knowledge. For the second tutor position we
preferentially selected tutors with domain knowledge for the speciﬁc tutorial. Initially we
planned for an average of 15 students per tutorial. Given the initial turnout of 30 stu-
dents, the number of students per tutorial ranged between 4 and 10, yielding a somewhat
problematically high tutor-student ratio.
Another limitation of the course design is the high amount of work required to design
several parallel and domain speciﬁc case studies. This was possible because we received
a grant to develop teaching materials. We hope that by making our teaching materials
openly accessible, we can lower the barrier to teach a similarly styled course for other
colleges and universities.
There is a broad consensus ranging from human resources departments in Germany
to the international literature on teaching statistics regarding not only the eminent impor-
tance of data competencies but also the need to provide students with practical hands-on
experience of applying basic statistical tool sets in practice [10, 12, 16].
Given the growing need for such competencies among the full breadth of students
graduating German universities, new teaching formats such as the one documented here
are in need – as is an intensiﬁed discussion on the conveyed statistical content, the datasets
provided and the didactic methods employed to improve these much needed courses fur-
The authors thank the Stifterverband and the Heinz-Nixdorf foundation for providing the
funding for this work.
8 Conﬂict of interest
The authors declare no conﬂict of interest.
 Muhammad Ateeq, Hina Habib, Adnan Umer, and Muzammil Ul Rehman, C++
or python? which one to begin with: A learner’s perspective, 2014 International
Conference on Teaching and Learning in Computing and Engineering, IEEE, 2014,
 Dilhari Attygalle and Asoka Ramanayake, Statistics in practice: making of profes-
sional statisticians in a classroom, Proceedings of the 10th International Conference
on Teaching Statistics, 2018.
 Ben Baumer, A data science course for undergraduates: Thinking with data, The
American Statistician 69 (2015), no. 4, 334–342.
 Theodore Chadjipadelis and Ioannis Andreadis, Use of projects for teaching social
statistics: case study, In: Proceedings of the 8th International Conference on Teach-
ing Statistics, 2006.
 Bruno de Sousa and Dulce Gomes, Teaching statistics using R at a college or a
university level: it can be possible?, In: Proceedings of the 10th International Con-
ference on Teaching Statistics, 2018.
 FiveThirtyEight, Tweets from russian trolls, https://github.com/ﬁvethirtyeight/
russian-troll-tweets/, 2018, Accessed: 2020-03-05.
 Gerald Futschek, Algorithmic thinking: the key for understanding computer sci-
ence, International conference on informatics in secondary schools-evolution and
perspectives, Springer, 2006, pp. 159–168.
 Iddo Gal, Lynda Ginsburg, and Candace Schau, Monitoring attitudes and beliefs
in statistics education, The assessment challenge in statistics education 12 (1997),
 Joan Garﬁeld, How students learn statistics, International Statistical Review/Revue
Internationale de Statistique (1995), 25–34.
 KS Gibbons and Helen MacGillivray, Education for a workplace statistician, Topics
from Australian Conferences on Teaching Statistics, Springer, 2014, pp. 267–293.
 Alec Go, Richa Bhayani, and Lei Huang, Tweets from regular twitter users, http:
//help.sentiment140.com/for-students/, 2018, Accessed: 2020-03-05.
 Katherine Taylor Halvorsen, Formulating statistical quiestions and implementing
statistics projects in an introductory applied statistics course, In: Proceedings of
the 8th International Conference on Teaching Statistics, 2010.
 J. Hardin, R. Hoerl, Nicholas J. Horton, D. Nolan, B. Baumer, O. Hall-Holt, P. Mur-
rell, R. Peng, P. Roback, D. Temple Lang, and M. D. Ward, Data science in statistics
curricula: Preparing students to “think with data”, The American Statistician 69
(2015), no. 4, 343–353.
 John D. Hunter, Matplotlib: A 2d graphics environment, Computing in Science &
Engineering 9(2007), no. 3, 90–95.
 Ambikesh Jayal, Stasha Lauria, Allan Tucker, and Stephen Swift, Python for teach-
ing introductory programming: A quantitative evaluation, Innovation in Teaching
and Learning in Information and Computer Sciences 10 (2011), no. 1, 86–90.
 Brian Jersky, Statistical consulting with undergraduates – a community outreach ap-
proach, In: Proceedings of the 6th International Conference on Teaching Statistics,
International Association for Statistical Education, International Statistical Institute.
 Michael W. Kearny, Tweets from Donald J. Trump, https://github.com/mkearney/
trumptweets, 2018, Accessed: 2020-03-05.
 Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias
Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain
Corlay, Paul Ivanov, Damián Avila, Saﬁa Abdalla, and Carol Willing, Jupyter note-
books – a publishing format for reproducible computational workﬂows, Positioning
and Power in Academic Publishing: Players, Agents and Agendas (F. Loizides and
B. Schmidt, eds.), IOS Press, 2016, pp. 87 – 90.
 Theodora Koulouri, Stanislao Lauria, and Robert D Macredie, Teaching introduc-
tory programming: A quantitative evaluation of different approaches, ACM Trans-
actions on Computing Education (TOCE) 14 (2014), no. 4, 1–28.
 Jana Lasser and Debsankha Manik, Archaeology case study, https:
study-archaeologie, 2020, Accessed: 2020-03-05.
 , Economy case study, https://github.com/Daten-Lesen-Lernen/daten-lesen-
lernen-lecture/tree/master/case-study-wirtschaftswissenschaften, 2020, Accessed:
 , General purpose case study, https://github.com/Daten-Lesen-Lernen/
daten-lesen-lernen-lecture/tree/master/case- study-allgemein, 2020, Accessed:
 , Lecture "Daten Lesen Lernen", https://github.com/Daten-Lesen-Lernen/
daten-lesen-lernen-lecture, 2020, Accessed: 2020-03-05.
 Shanhong Liu, Most used languages among software developers globally 2020, 06
 Wes McKinney, Data structures for statistical computing in python, Proceedings of
the 9th Python in Science Conference, 2010, pp. 51–56.
 MSNBC, Donald J. Trump says he has the "best" words, 2018, Accessed: 2020-03-
 National Academies of Sciences, Engineering, and Medicine and others, Data sci-
ence for undergraduates: Opportunities and options, National Academies Press,
 Fernando Perez, Brian E Granger, and John D Hunter, Python: an ecosystem for
scientiﬁc computing, Computing in Science & Engineering 13 (2010), no. 2, 13–21.
 Almut Reiners, Sebastian Hobert, and Matthias Schumann, Lernen mit Smartphones
an der Georgia-Augusta-eine Zwischenbilanz, DeLFI Workshops, 2014, pp. 180–
 Chantel Ridsdale, James Rothwell, Michael Smit, Hossam Ali-Hassan, Michael
Bliemel, Dean Irvine, Daniel Kelley, Stan Matwin, and Bradley Wuetherick, Strate-
gies and best practices for data literacy education: Knowledge synthesis report,
Tech. report, 2015.
 David Robinson, The Incredible Growth of Python | Stack Overﬂow, September
 Rob Root and Trisha Thorme, Community-based projects in applied statistics, The
American Statistician 55 (2001), no. 4, 326–331.
 Jon Singer, Ronald W Marx, Joseph Krajcik, and Juanita Clay Chambers, Construct-
ing extended inquiry projects: Curriculum materials for science education reform,
Educational Psychologist 35 (2000), no. 3, 165–178.
 Charles C Taylor, Using R to Teach Statistics, In: Proceedings of the 10th Interna-
tional Conference on Teaching Statistics, 2018.
 Ian Westbrooke and Maheswaran Rohan, Statistical training in the workplace, Top-
ics from Australian Conferences on Teaching Statistics, Springer, 2014, pp. 311–
A List of subject-speciﬁc case studies and
A.1 List of subject-speciﬁc case studies
The three subject-speciﬁc tutorial threads offered in the Summer semester 2019 were:
• A thread aimed at students from linguistic ﬁelds using data sets from Twitter. This
thread was also used as a the default thread for students from ﬁelds not addressed
by the other three subject-speciﬁc threads.
• A thread aimed at economics students and students from (modern) history-related
ﬁelds considering data regarding domestic production and industrial action in sev-
eral countries in the past.
• A thread considering pottery-origins aimed at students from archaeology and (an-
cient) history-related ﬁelds
The students were free to choose which thread to attend, but generally speaking the
provisioned topic-subject allocation was as expected.
A.2 List of subject-speciﬁc projects
The twelve subject-speciﬁc projects offered in the Summer semester 2019 were:
• A project from an IT-company enquiring about insolvency rates in different indus-
tries to evaluate credit risk
• A second project from the same company requiring a match of customer ﬁles with
public offense registers
• A project from a retail company regarding customised advertising
• A project from a ecological farm regarding a solar power plant
• A project from a company focused on plant-breeding regarding agricultural pro-
duction in different countries
• A second project from the same company considering the drying in a production
• A third project from the same company considering measurements regarding the
breeding of plant-hybrids
• A project from an education start-up to evaluate cooperation potential between the
local universities and regional companies
• A project from an engineering company regarding predictive maintenance
• A project from a regional infrastructure initiative regarding the organisation of bus
• A project from a research-group regarding ancient pottery
• A project from another research-group regarding malnutrition among children