ArticlePDF Available

When didactics meet data science: process data analysis in large-scale mathematics assessment in France

Authors:

Abstract and Figures

Abstract During this digital era, France, like many other countries, is undergoing a transition from paper-based assessments to digital assessments in education. There is a rising interest in technology-enhanced items which offer innovative ways to assess traditional competencies, as well as addressing problem solving skills, specifically in mathematics. The rich log data captured by these items allows insight into how students approach the problem and their process strategies. Educational data mining is an emerging discipline developing methods suited for exploring the unique and increasingly large-scale data that come from such settings. Data-driven methods can be helpful when trying to make sense of process data. However, studies have shown that didactically meaningful findings are most likely generated when data mining techniques are guided by theoretical principles on subjects’ skills. In this study, theoretical didactical grounding has been essential for developing and describing interactive mathematical tasks as well as defining and identifying strategic behaviors from the log data. Interactive instruments from France’s national large-scale assessment in mathematics have been pilot tested in May 2017. Feature engineering and classical machine learning analysis were then applied to the process data of one specific technology-enhanced item. Supervised learning was implemented to determine the model’s predictive power of students’ achievement and estimate the weight of the variables in the prediction. Unsupervised learning aimed at clustering the samples. The obtained clusters are interpreted by the mean values of the important features. Both the analytical model and the clusters enable us to identify among students two conceptual approaches that can be interpreted in theoretically meaningful ways. If there are limitations to relying on log data analysis in order to determine learning profiles, one of them is the fact that this information remains partial when it comes to describing the complete cognitive activity at play, the potential of technology-enriched problem solving situations in large-scale assessments is nevertheless obvious. The type of findings this study produced is actionable from teachers’ perspective in order to address students’ specific needs.
Content may be subject to copyright.
When didactics meet data science: process
data analysis inlarge‑scale mathematics
assessment inFrance
Franck Salles*, Reinaldo Dos Santos and Saskia Keskpaik
Abstract
During this digital era, France, like many other countries, is undergoing a transition
from paper-based assessments to digital assessments in education. There is a ris-
ing interest in technology-enhanced items which offer innovative ways to assess
traditional competencies, as well as addressing problem solving skills, specifically
in mathematics. The rich log data captured by these items allows insight into how
students approach the problem and their process strategies. Educational data mining
is an emerging discipline developing methods suited for exploring the unique and
increasingly large-scale data that come from such settings. Data-driven methods can
be helpful when trying to make sense of process data. However, studies have shown
that didactically meaningful findings are most likely generated when data mining
techniques are guided by theoretical principles on subjects’ skills. In this study, theoreti-
cal didactical grounding has been essential for developing and describing interactive
mathematical tasks as well as defining and identifying strategic behaviors from the log
data. Interactive instruments from France’s national large-scale assessment in math-
ematics have been pilot tested in May 2017. Feature engineering and classical machine
learning analysis were then applied to the process data of one specific technology-
enhanced item. Supervised learning was implemented to determine the model’s
predictive power of students’ achievement and estimate the weight of the variables in
the prediction. Unsupervised learning aimed at clustering the samples. The obtained
clusters are interpreted by the mean values of the important features. Both the
analytical model and the clusters enable us to identify among students two concep-
tual approaches that can be interpreted in theoretically meaningful ways. If there are
limitations to relying on log data analysis in order to determine learning profiles, one
of them is the fact that this information remains partial when it comes to describing
the complete cognitive activity at play, the potential of technology-enriched problem
solving situations in large-scale assessments is nevertheless obvious. The type of find-
ings this study produced is actionable from teachers’ perspective in order to address
students’ specific needs.
Keywords: Large-scale assessment, Mathematics, Machine learning, Data science,
Theoretical framework, Technology, Didactics, Process data
Open Access
© The Author(s) 2020. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and
the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material
in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material
is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the
permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco
mmons .org/licen ses/by/4.0/.
RESEARCH
Sallesetal. Large-scale Assess Educ (2020) 8:7
https://doi.org/10.1186/s40536‑020‑00085‑y
*Correspondence:
franck.salles@education.
gouv.fr
Department of Evaluation
(DEPP), Ministry of Education,
65 rue Dutot, Paris, France
Page 2 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
Introduction
During this digital era, France, like many other countries, is undergoing a transition
from paper-based assessments to digital assessments in order to measure student per-
formance in education. New opportunities are emerging (cost reduction, innovative
items, adaptive testing, real-time feedback into learning, etc.) which themselves give rise
to new challenges (usability, security, equipment, digital divide, etc.).
ere is a rising interest in France in technology-enhanced items which offer inno-
vative ways to assess traditional competencies, as well as to address 21st century skills
and to link assessment feedback to learning. e technology-enhanced items can be
extremely valuable when measuring problem solving skills. Compared to traditional
assessments, they not only provide scoring information–whether the response is cor-
rect or not–but allow rich data to be collected that enable the way that the students have
arrived at their answers to be determined (Greiff etal. 2015).
ese complex technology-enhanced items can be used to reflect how students inter-
act in a given situation to analyze and solve a problem. e exercises and questions
included in the interactive items engage the students on multiple levels, and capture
not just their responses, but their thought process as well. e rich log data captured by
these items, such as the time at which students start and stop their work, mouse move-
ments, the use of different onscreen tools, idle time, and a screenshot of the last actions,
allow insights to be gained into how students approach the problem, and to identify
areas that might require additional focus.
Despite the potential gain in knowledge about student performance, the studies on
log data from educational assessments remain relatively scarce (Greiff etal. 2015). One
of the reasons for the scarcity of studies is the technicality that these analyses entail.
Although attempts are made in order to standardize the logs and develop specific data
analysis tools (Hao etal. 2016), logs are often messy, unstructured, and full of “noise”–all
of which leads traditional data analysis tools and techniques to work less well with these
data.
Process data, recorded as sequences of actions, can be likened to textual data and ana-
lyzed by making use of methodologies of natural language processing and text mining.
Hao etal. (2015) transform process data into a string of characters, encoding each action
name contained in the logs as a single character. e authors use the Levenshtein dis-
tance, defined as the minimum number of single-character edits needed to convert one
character string to another, in order to compare how far the students’ activities in game/
scenario-based tasks are from the best performance. In the same vein, He and von Davier
(2015), considering the similar structure between action sequences in process data and
word sequences in natural language, make use of N-grams–a contiguous sequence of n
items from a given sample of text–in order to discern action sequence patterns that are
associated with success or failure in complex problem solving tasks.
Educational data mining is an emerging discipline that is concerned with develop-
ing methods specifically suited for exploring the unique and increasingly large-scale
data that come from educational settings.1 Qiao and Jiao (2018) showcase various data
1 http://educa tiona ldata minin g.org/.
Page 3 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
mining techniques in analyzing process data and demonstrate how both supervised as
well as unsupervised learning methods can help revealing specific problem solving strat-
egies and distinguish between different performance profiles.
Data-driven methods can be very helpful when trying to make sense of huge amounts
of data and discover hidden patterns and relationships in these data. However, studies
have shown that didactically meaningful findings are most likely yielded when data min-
ing techniques are guided by theoretical principles allowing to describe subjects’ skills
(Gobert etal. 2013).
In the context of complex problem solving, studies have demonstrated how certain
student behaviors yield better performances than others (Greiff etal. 2015, 2016). eo-
retical grounding has been essential for defining and identifying these strategic behav-
iors from the log data and verifying their implementation among different samples of
students.
CEDRE (Subject-related sample-based assessment cycle) is a sample-based large-scale
assessment aiming to measure students’ abilities in mathematics at the end of grade 9
every 5years in France. Constructed and designed by the Department for Evaluation,
Prospective and Performance (DEPP) at the French ministry of education, its framework
is based on the national French curriculum in mathematics. First administered in 2008
and 2014, CEDRE was administered again in May 2019. is new cycle is computer-
based for the first time. As trends must be secured so that the comparability with the
previous cycles is guaranteed, a large part of the test instruments are similar to formerly
paper-based items. However, the DEPP developed technology-enriched items, very dif-
ferent from more classical item formats, in order to profit fully from the potentialities
of assessing mathematics with digital tools (Stacey and Wiliam 2013). Offering students
the possibility to use digital tools during the assessment may outsource basic proce-
dural work to such tools (Drijvers 2019). erefore opportunities are given to students
to engage more fully higher order skills such as problem solving, devising a strategy and
carrying out mathematical thinking, which the CEDRE aims to capture.
Problem solving strategies in complex mathematical tasks are targeted by the CEDRE
framework since the first cycle (MEN, DEPP 2017). Nevertheless, the way to capture
student’s strategies and processes in the paper-based assessment cycles used to be
based on students’ explanation of their answers and written arguments or sketches in
response boxes. Communicating a mathematical answer being a mathematical com-
petency in itself (OECD 2013), it does not directly indicate the actual strategy used by
students to solve the task but only the way they were able to express it. Logging and
analyzing process data can potentially lead to drawing a complete picture of how digital
tools and interactions have been activated during the problem solving process. A theo-
retical framework was designed for the purpose of describing in detail the mathematical
tasks in such items, with hypotheses about potential processes or strategic thinking to
which these items give rise, and eventually identifying variables of interest. In support
of this process, we convened several structuring concepts from research in mathematics
didactics and more generally from educational research (notion of “conceptions”, task
analysis, content analysis, semiotic analysis, assessment, etc.). is framework is based
on findings in mainly French didactic research regarding mathematic teaching activities
and technology-enriched environments. Its main and seminal references are the eory
Page 4 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
of Didactical Situations (Brousseau 2006), the Activity eory applied to mathematic
education (Robert 1998) and the Instrumental Approach (Rabardel 2002). According to
this framework, CEDRE’s item designers developed new interactive items and identified
which data need to be logged for future process analysis.
Research questions
Based on this preliminary work, this study aims to answer two main research questions:
To what extent can process data analysis provide information about students’ math-
ematics performance in large-scale assessments and explain achievement?
To what extent can process data be used to categorize students’ mathematical strate-
gic behaviors and procedures, allowing didactical interpretation and profiling?
Theoretical framework
Determining what type of mathematical knowledge and skills are involved in items is a
preliminary necessity for assessment task analysis. Beyond listing them, we have to iden-
tify and describe the way they must or could be operated and what operation’s adaptions
are necessary to resolve the tasks, given the underlying mathematical conceptions at
stake. is level of analysis can be a first step towards determining conceptions involved,
choices students have to make, the number of steps required, types of errors, etc. Based
on seminal work of Robert and Douady, Roditi and Salles (2015) first applied a didactical
framework to PISA 2012 mathematical assessment task analysis: the so called mathe-
matical knowledge operation levels (MKOL). On one hand, MKOLs allow to distinguish
between the object and tool characters of mathematical knowledge (Douady 1991).
Some assessment questions focus on mathematical content where students must dem-
onstrate an understanding of the concept without having to implement it, which some
authors refer to as a conceptual understanding (Kilpatrick etal. 2001, p. 115–135); these
questions address the object character of this content. Other questions assess the tool
nature of knowledge; the student must then put mathematical knowledge into opera-
tion in order to solve a problem in the context indicated. On the other hand, MKOLs
take into account the variety of ways mathematical knowledge can be implemented to
solve an item. e model identifies levels of knowledge implementation ranging from
direct operation and operation adaptations to introduction of intermediate steps (Rob-
ert 1998). However, this first framework was initially devised to tackle paper-based
assessment instruments. In a technology-enhanced environment, specifically when digi-
tal tools are available to students, “technology can impact the way students operate and
reason when working on tasks, for example while using CAS to solve equations or while
exploring a dynamic construction with a geometry package to develop a conjecture that
may be proved” (Drijvers etal. 2016, p. 12). erefore the initial didactical framework
was extended to additionally take into account this impact in which we distinguish tool’s
utilizations and instrumentations as well as student/machine interactions. In summary,
the didactical framework implemented in this study was structured around three main
questions: How does mathematical knowledge need to be adapted in order to resolve an
Page 5 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
interactive mathematical task? Which tools’ utilization is necessary to solve such prob-
lems? How can student/machine interactions influence the mathematical activity?
e way students may use a digital tool at the service of solving a primary task can
be described in reference to the instrumental approach (Rabardel 2002). is approach
distinguishes a tool from an instrument. e tool gets the status of an instrument when
it is used as a mean to solve the problem. In this case, its use is decomposed into “cogni-
tive schemes containing conceptual understanding and techniques for using a tool for a
specific type of task” (Doorman etal. 2012, p. 1248). e analysis will then focus on iden-
tifying and describing utilization schemes potentially involved in the task. Rabardel dis-
tinguishes two types of utilization schemes: “usage schemes, related to “secondary tasks”
(…) and instrument-mediated action schemes, (…) related to “primary tasks” (…) [which]
incorporate usage schemes as constituents.” (Rabardel2002, p. 83). In order to illustrate
these two levels of schemes, Rabardel gives the example of an experienced driver over-
taking a vehicle: “An instrument-mediated action scheme underlies the invariable aspects
of such an overtaking situation. is scheme incorporates as components usage schemes
subordinate to its general organization, such as those necessary to manage a change of
gears or a change of trajectory” (Rabardel 2002, p. 83).
Interactions and feedbacks are important features of problem-solving situations, espe-
cially in an assessment context. Being immerged in a digital environment leads to spe-
cific kinds of human/machine interactions. Most of them are intentional and planned
by developers when designing the assessment environment, others are not but all some-
how carry information students can grasp to proceed in the problem-solving process.
Laborde (2018) distinguishes two types of feedback with concern to technologically
enriched situations: one issued from task-specific digital tools, another being the “teach-
er’s voice”. is last type of feedback is meant to help students catch, adapt and retain
information given in the environment. In a summative assessment context, this type
of feedback should be very limited as it could interfere with the objective of measuring
students’ ability. Nonetheless, it can be considered paramount in a formative approach.
Even if a summative assessment platform such as the one used for the DEPP’s assess-
ments can be considered as a non-didactical environment where the teacher’s voice is
supposed to be absent, we can still consider this environment as potentially providing
student/machine interactions. In his theory of didactical situations, Brousseau (2006)
separates three levels of feedback depending on the nature of the environment’s math-
ematical reaction: the feedback can either reflect on actions, formulations or validations.
is last model has been used to help describe item/student interactions in DEPP’s
interactive items.
Task analysis
For the purpose of this study, a specific item Fig.1 has been chosen among CEDRE’s
new interactive items piloted in May 2017. In this task, two nonlinear functions model
two different tree growths. Both are given in linked numerical and graphical representa-
tions (Stacey and Wiliam 2013). Students act in the numerical representation (a table
of values), entering the age of the trees in months. A calculation tool returns the cor-
responding tree heights and a graphing tool spots the points in the graph. Both actions
are realized when one presses the “Calculate and graph” button. By default, values for
Page 6 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
300, 500 and 600months are given. Additional tools can be used: a pencil, an eraser (not
allowing to erase given information), a ruler, a compass, a non-scientific calculator. e
question can be translated as: “At what age (other than 0months) do both trees have the
same height?”
From a grade 9 students’ point of view, this task requires conceptual understanding
of functions and their representations (table of values, graph). Functions conception
has two different characters as Sfard (1991) and Doorman etal. (2012) showed: “In
lower secondary grades, functions mainly have an operational character and are seen
as an inputoutput ‘machine’ that process input values into output values. In higher
grades, functions have a more structural character with various properties (Sfard
1991). ey become mathematical objects that are represented in different ways, are
ordered into different types according to their properties, and are submitted to higher-
order processes such as differentiation and integration. We argue that the transition
from functions as calculation operations to functions as objects is fundamental for
conceptual understanding in this domain.” (Doorman etal. 2012, p. 1243). Students
can adapt the problem by adding intermediate information using the calculation and
graphing tool. On one hand, they can then opt for a “trial and error” method. is
would consist in entering a number of months, comparing the results returned either
in the numerical or graphical representation, deciding to enter another number of
months until the solution (390) is found. Alternating tries around the target value or
aiming at it from below or above could improve the trial and error process. ese stu-
dents show essentially good understanding of the concept of functions in their opera-
tional character. is method could imply a relatively large number of tries. On the
other hand, students understanding that both functions are increasing, notably from
Fig. 1 Interactive item “Tree growth (partly translated in English for the purpose of this publication.
Hauteur = Height; Âge (en mois) = Age (in months); Chêne = Oak; Sapin de Douglas = Douglas fir; Calculer et
tracer = Calculate and graph)
Page 7 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
studying them in the graphical representation, can quickly aim at the target number
of months. e pencil can for example be used to draw lines and introduce a continu-
ous representation of the functions. e inversion of tree heights between 300 and
500months can also be noticed. ese students understand functions as objects with
properties, in their structural character.
e following digital tools are at students’ disposal within the item:
A keyboard (with or without number pad) and mouse.
A “calculation and graph” tool, specific to the item.
A pencil (common to any item on the platform). Usage: click the starting point,
move the mouse to trace, click to stop writing.
An eraser only allowing erasing pencil traces or measurement tool traces. Usage:
clicking erases all pencil traces together.
A compass
A calculator
e “calculation and graph” tool does not require complex usage schemes. Two usage
schemes are identified for this tool: enter a number of months within the domain [0; 600]
via an input box and a popup number pad, and understand that the tool returns unique
heights (outputs) for both trees (numerical and graphical representations) when the but-
ton “calculate and graph” is clicked. No tutorial or tool training is proposed to students.
Usage schemes are close to relatively usual tools such as a currency converter. Neverthe-
less, we can imagine some students feeling the need to appropriate the tool by first using
it for testing purposes, for example entering extreme values or values not directly con-
nected to the primary task. Using this tool is compulsory to succeed the item. In refer-
ence to the instrumental approach, we describe next how the tool can be instrumented
in this situation, once assumed that students will build an instrument from the “calcula-
tion and graph” tool, within the item environment, in order to solve the task (Trouche
2003). Instrumented action schemes are organized around the core elements that follow:
1. Knowing the difference between input and output in a contextual use of a function as
a model.
2. Understanding that the tool returns unique heights (outputs) for both trees (numeri-
cal and graphical representations) when choosing a number of month (inputs).
3. Entering a number of months within the domain [0; 600].
4. Comparing outputs either in the numerical or graphical representation. Validating by
linking to the real life situation.
5. Deciding on the next number of month to enter considering the comparison to the
previous one.
6. Iterating the process.
is type of instrumentation is linked to an operational approach of the concept of
functions.
As mentioned earlier, the pencil can also be used in order to get a continuous model
on the domain or part of it. Students can link points together using it. is is an
Page 8 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
intermediate step towards the primary task. e following core elements participate
to instrumented action schemes using the pencil as well as the “calculate and graph”
tool and are characteristic of a structural understanding of the concept of a function.
1. Understanding that the growth phenomenon is a continuous one. Hence, the func-
tions modelling it are continuous.
2. Assuming both functions will be strictly increasing.
3. Using the pencil to link consecutive points together.
4. Decide on the next number of months to enter considering line intersection.
5. Going back to numerical values to aim at accuracy.
Of course one might operate composite instrumented action schemes, mixing the
use of both principal tools. For example, one can use the pencil to sketch continuous
graphs (potentially after trying 100 and or 200months to get a more complete view of
the graphs shapes), or rely on points’ colours and choose 400months in the first tries
and then use a trial and error strategy to aim precisely at the target with the “calculate
and graph” tool.
Interactions at stake within the item are principally addressing the two different repre-
sentations of the functions: numerical and graphical. When students use the “calculation
and graph” tool, the feedback is given in both representations as new numbers in the
table of values and two points on the graph. Students are consequently relieved from
having to convert one representation into another. e colours of numbers and points
related to the same function match so students can more easily interpret the feedback.
is important interaction participates in students’ reflections towards formulating the
problem in both representations and then comparing results (for example using colors’
inversions in the graphical display) to either conclude or decide on other attempts to
make. Besides it can also participate in invalidating attempts that are outside the domain
or very far from target.
Data andmethods
CEDRE’s interactive items, the “Tree growth” item among them, have been used in a
pilot test in May 2017 with a sample of 3000 grade 9 students per item. Students’ digital
traces have been recorded in log data files. As log data files contain a very large amount
of data and in order to aim at interpretable results as well as to avoid noisy signals, vari-
ables of interest were defined, as a result of the a priori didactical analysis. ey could
potentially lead to building a model able to explain success or failure in the task consid-
ering either the operational or structural character of functions used by students. Prob-
lem solving strategies based on a procedural understanding of functions imply using the
“calculation and graph” tool more often, potentially through a dichotomous strategy,
hence testing many numbers of months and spending more time on the item. Features
such as the month list length, the number of alternating within this list, the time spent
on the item could then participate bringing the light on such strategies in the data. Sym-
metrically, a strategy based on a structural conception could lead to optimizing both the
number of tries and the time spent, aiming at the target interval from the very first tests,
perhaps using the assistance of the pencil tool. Features such as the standard deviation
Page 9 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
of the values in the month list and the distance between first inputs and the target show
specific interest. Accordingly, the main features used in the analytical models and their
nature are the following:
Month list length:
integer variable
present in the log data
First input between 200 and 600:
• boolean
constructed through feature engineering
Number of alternating within the month list:
integer variable
constructed through feature engineering
Time spent on the item, in seconds:
continuous variable
present in the log data
Distance betweenthe first input and thetarget value:
integer variable
constructed through feature engineering
Distance between the second input and the target:
integer variable
constructed through feature engineering
Distance between the last input and the target:
integer variable
constructed through feature engineering
Standard deviation of the values in the month list:
continuous variable
constructed through feature engineering
Target value is in the month list:
• boolean
Page 10 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
present in the log data
Pencil use:
• boolean
present in the log data
Classical machine learning analysis was then applied to the data. Supervised learn-
ing was implemented to determine the predictive power of students’ achievement of the
model and estimate the weight of the variables in the prediction. Unsupervised learning
aimed at clustering the sample. Mean values of the most important variables were then
calculated for each of the clusters issued from unsupervised learning.
e type of statistical analysis used, including a power calculation when appropriate,
is presented in the following part of the paper. All of the analysis has been done using
Python 3.0 and specifically the scikit-learn library.
Supervised learning
e choice of the algorithms to use in order to build the model is partly determined by
the task we are trying to achieve. In our case, we are looking for a supervised classifica-
tion, as the objective is to predict a label (the correct boolean) from the other features.
Another criteria is the explain ability of our modeling. We are not trying in this study
to build a predictor in itself, we rather aim at determining which features are the most
predictive of the score of a student. erefore, we excluded the use of neural networks.
Finally, the efficiency of the model remains the final criteria. As to be able to compare
the different algorithms, we chose the area under the ROC curve as our fit, because it is
an indicator that can be calculated for any kind of algorithm.
Random forests
Random forests (Breiman 2001) are an ensemble learning method that works by building
a multitude of decision trees during learning, before returning the class mode (classifica-
tion) or the average forecast (regression) of individual trees. e purpose of a decision
tree is to create a model that predicts the value of a target variable from several input
variables (Hopcroft etal. 1983). A decision tree or classification tree is a tree in which
each internal node (non-leaf) is marked with an input characteristic. Arcs from a node
labelled with an input characteristic are labelled with each of the possible values of the
target or output characteristic, or the arc leads to a subordinate decision node on a dif-
ferent input characteristic. Each leaf of the tree is labelled with a class or probability dis-
tribution on the classes, which means the data set has been classified by the tree either in
a specific class or in a particular probability distribution.
Decision trees are nevertheless known for their many disadvantages. e first is the
tendency to overfit the training set. e second is its non-deterministic aspect: the order
of use of the functionalities generates a completely different tree structure.
e training algorithm for random forests applies the general technique of boot-
strap aggregating, or bagging (Breiman 1996), to tree learners. Given a training set
X = x1,…, xn with responses Y = y1,…, yn, bagging repeatedly (B times) selects a
Page 11 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
random sample with replacement of the training set and fits trees to these samples.
After training, predictions for unseen samples x’ can be made by averaging the pre-
dictions from all the individual regression trees on x’, or by taking the majority vote in
the case of classification trees.
e number of samples/trees, B, is a free parameter. An optimal number of trees B
can be found using cross-validation. However, the question of the order of the fea-
tures remains unsolved. at is why random forests differ slightly from the general
bagging: they use a modified tree learning algorithm that selects, at each candidate
split in the learning process, a random subset of the features. is process is some-
times called “feature bagging” or “random subspace method” (Barandiaran 1998).
e reason for doing this is the correlation of the trees in an ordinary bootstrap
sample: if one or a few features are very strong predictors for the response variable
(target output), these features will be selected in many of the B trees, causing them to
become correlated.
e number of trees for the random forest (200) has been determined through
cross-validation. e other hyperparameters (number of leaves, maximal depth) have
been tuned automatically.
Area undertheROC curve
e receiver operating characteristic (ROC), also known as the performance charac-
teristic or sensitivity/specificity curve, is a measure of the performance of a binary
classifier.
Graphically, the ROC measurement is often represented as a curve that gives the
rate of true positives (fraction of positives that are actually detected) versus the rate
of false positives (fraction of negatives that are incorrectly detected).
ey are often used in statistics to show progress using a binary classifier when the
discrimination threshold varies. Sensitivity is given by the fraction of Positives classi-
fied as Positive, and antispecificity (1 minus specificity) by the fraction of Negatives
classified as Positive. e antispecificity is plotted on the x-axis and the sensitivity on
the y-axis to form the ROC diagram. Each S value will provide a point on the ROC
curve, which ranges from (0, 0) to (1, 1).
At (0, 0) the classifier always declares ‘negative’: there are no false positives, but
also no true positives. e proportions of true and false negatives depend on the
underlying population.
At (1, 1) the classifier always declares ‘positive’: there are no true negatives, but
also no false negatives. e proportions of true and false positives depend on the
underlying population.
A random classifier will draw a line from (0, 0) to (1, 1).
At (0, 1) the classifier has no false positives or false negatives, and is therefore per-
fectly accurate, never getting it wrong.
At (1, 0) the classifier has no true negatives or true positives, and is therefore per-
fectly inaccurate, always being wrong. Simply invert its prediction to make it a
perfectly accurate classifier.
Page 12 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
When using normalized units, the area under the curve is equal to the probability that
a classifier will rank a randomly chosen positive instance higher than a randomly chosen
negative one (assuming ‘positive’ ranks higher than ‘negative’).
e area under the ROC curve (ROC AUC) is a common method used for model com-
parison (Bradley 1997).
Unsupervised learning
Unsupervised learning is a machine learning technique used to detect patterns in a data
set, with no prior information about this data. It is mainly used as a clustering technique,
to group or segment the data, in order to identify communalities.
Clustering algorithms can be classified into several families. e main ones are the
density-based or centroid-based algorithms. In this study, we chose to use the most
widely used algorithm of each of those families. (Wierzchoń and Kłopotek 2018).
DBScan
DBSCAN (density-based spatial clustering of applications with noise) (Ester etal. 1996)
is a density-based data partitioning algorithm to the extent that it relies on the estimated
density of clusters to perform partitioning. e DBSCAN algorithm uses 2 parameters:
the distance ϵ and the minimum number of MinPts points that must be within a radius
ϵ for these points to be considered as a cluster. e input parameters are therefore an
estimate of the point density of the clusters. e basic idea of the algorithm is then,
for a given point, to retrieve its ϵ-neighbourhood and tocheck that it contains MinPts
points or more. is point is then considered as part of a cluster. We go then through the
ϵ-neighbourhood step by step to find all the points in the cluster. e DBSCAN algo-
rithm can be abstracted into two steps. First, find the points in the ε-neighbourhood of
every point, and identify the core points with more than minPts neighbours. Second,
find the connected components of core points on the neighbour graph, ignoring all non-
core points. Assign each non-core point to a nearby cluster if the cluster is a ε-neighbour,
otherwise assign it to noise.
Advantages of the DBSCAN algorithms are numerous: DBSCAN does not require one
to specify the number of clusters in the data a priori, as opposed to k-means; DBSCAN
can find arbitrarily shaped clusters; due to the MinPts parameter, the single-link effect
(different clusters being connected by a thin line of points) is reduced; DBSCAN has
a notion of noise, and is robust to outliers; parameters MinPts and ε can be set by a
domain expert, if the data is well understood. Drawbacks, however can also be listed:
DBSCAN is not entirely deterministic, border points that are reachable from more
than one cluster can be part of either cluster, depending on the order the data are pro-
cessed; it is not comfortable with data sets with large differences in densities, because
the MinPts-ε combination cannot then be chosen appropriately for all clusters; if the
data and scale are not well understood, choosing a meaningful distance threshold ε can
be difficult.
e determination of the two parameters (ε = 0,55 and MinPts = 40) has been done in
order to minimize the number of outliers.
Page 13 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
K‑means
Partitioning into k-means is a data partitioning method and a combinatorial optimi-
zation problem (MacQueen 1967). Given points and an integer k, the problem is to
divide the points into k groups, often called clusters, in order to minimize a certain
function. We consider the distance from a point to the average of the points of its
cluster (called the centroid); the function to minimize is the sum of the squares of
these distances.
e initialization is a determining factor in the quality of the results (local mini-
mum). ere are two common initialization methods: Forgy’s algorithm (Lloyd 1982)
and Random Partitioning. Forgy’s algorithm assigns the k points of the initial averages
to k randomly selected input data. Random partitioning randomly assigns a cluster to
each data point and then (pre-first) calculates the initial mean points.
Given an initial set of k means randomly initialized, Lloyd-Forgy’s algorithm pro-
ceeds by alternating between two steps. First the assignment step is designed to
assign each observation to the cluster whose mean has the least distance. is is intu-
itively the “nearest” mean. Mathematically, the assignment step deals with partition-
ing the observations according to the Voronoi diagram generated by the means. en
the update step deals with calculating the new means (centroids) of the observations
in the new clusters. e algorithm has converged when the assignments no longer
change, although it does not guarantee to find the optimum. e k-means method is
a fast and simple method for clustering. It is systematically convergent, and offers an
easy visualization. On the other hand, the number of clusters k is an input parameter:
an inappropriate choice of k may yield poor results. at is why, when performing
k-means, it is important to determine the number of clusters in the data set through
cross-validation.
Moreover, convergence to a local minimum may produce counterintuitive results.
e k-means ++ algorithm (Arthur and Vassilvitskii 2006) tackles this by specify-
ing a procedure to initialize the cluster centers before proceeding with the standard
k-means optimization iterations.
Kmeans ++
e idea behind this method is that the more dispersed the initial k cluster centers
are, the better: the first cluster center is uniformly randomly selected from the data
points, and then each subsequent cluster center is selected from the remaining data
points with a probability proportional to its distance squared from the nearest exist-
ing cluster center.
is seeding method significantly improves the final error of the k-means. Although
the initial selection in the algorithm takes longer, the k-means part converges very
quickly after this seeding and the algorithm actually reduces the computation time.
Moreover, the k-means ++ algorithm guarantees an approximation ratio O (log k),
with k the number of clusters used. is is a significant improvement over the stand-
ard k-means, which can generate clusters that are arbitrarily worse than the optimum
(local minima).
Page 14 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
Results
Supervised learning: detect thesignicant features
Our first objective is to identify the variables that best explain the student’s success.
To do this, we will form a model based on the collected data and some secondary data
constructed from it. ese secondary elements come from the didactic analysis car-
ried out a priori.
After comparing several methods, we chose the random forests. For this method,
the area under the ROC curve reaches 0.78 (Fig.2). is is more than satisfactory for
a binary classifier.
e analysis of the variables of importance in this model allows us to select the
parameters that are decisive in the student’s success in completing the item (Fig.3).
Fig. 2 Receiver operating characteristic
Fig. 3 Feature importance
Page 15 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
e first important characteristic is the number of values tested in the tree data table.
e second is the variance of the values tested, i.e. the extent to which these values are
concentrated around the target value. In the same spirit, we also find the difference
between the first input value and the target value, and also (to avoid the noise of the first
value), the difference between the second input value and the target value. In addition,
we also find the number of alternations around the target value, a characteristic that
expresses the choice of a dichotomous search procedure. Finally, but to a lesser extent,
we find the time spent resolving the exercise and the number of tools used. e fact that
time is less important for classification than other features, if counterintuitive, is consist-
ent with previous research (Qiao and Jiao 2018).
Clustering using DBSCAN
In parallel with this modeling, we have sought to segment our population to be able to
describe the different strategies developed by the students.
To do this, we first chose the DBSCAN algorithm to avoid the difficulty of determining
a priori the number of clusters. e Fig.4 shows the result of this grouping in the space
created by the first two dimensions generated by a PCA on the data.
e PCA results are not of importance here. ey are simply used in order to allow a
graphical projection of the clusters on a two-dimensional space.
e limit of the DBSCAN is the difficulty of processing clusters with very different
densities. e DBSCAN avoids this constraint by treating as outliers all values that
would cause cluster densities to diverge too widely.
As it is for this dataset, the success rate of the item is 47, 3%, and the 4 clusters have
approximately the same size. is well balanced result might not be very surprising con-
sidering density based algorithm behavior with average difficult items. Let us suppose
that we were working on an item with a more drastic success rate, very easy or very
difficult. In that case, the profiles we are looking for would have been of very different
densities, with one or two large clusters and other small or sparse ones. e DBSCAN
might have failed in detecting all the clusters, and would have labeled the small ones as
outliers. Or worse, the algorithm would have merged them into the bigger ones.
Fig. 4 Result of DBSCAN
Page 16 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
Using ak‑means method toassess theclustering
is drawback of the DBSCAN is of no concern for the centroid-based algorithms,
such as the k-means. By trying to minimize the distance between each point and the
centroid of its cluster, this method allows for clusters with different shapes, sizes and
densities.
By choosing k = 4, we are observing the stability of the cluster distribution. e den-
sity is not a determining factor for the k-means clustering and all observations are to be
assigned to a cluster. erefore, if the cluster distribution we obtain is comparable to the
one from DBSCAN, it tells us that a 4-cluster partition with varying densities would not
be a better partition.
We will build 4 clusters in the same space constituted by the first two dimensions of
the PCA, and compare it to the clustering offered by the DBSCAN.
e distribution of observations is very close to the one resulting from the DBSCAN
(Fig.5). We will therefore use the segmentation from the DBSCAN to try to identify
each one of the clusters.
Fig. 5 K-means clustering (PCA-reduced data)
Fig. 6 Average proportion of value per Label
Page 17 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
Characterization ofthe4 categories ofstudents
To do this, we will calculate the reduced average proportion in the cluster for each
important characteristic (Fig.6).
We will also calculate the reduced average proportion for three additional character-
istics. First, those who express success or failure (is the value correct? Has the correct
value been entered in the spreadsheet?), and second, characteristics raised by the didac-
tic analysis (has the graphic tool “Pencil” been used by the student?).
e comparison between the clusters of these small average proportions shows very
clearly that two axes of division can be distinguished.
First, the model identifies successful and unsuccessful students. As expected, the dis-
criminating variables between these two groups are the presence of the “correct answer”
and whether the target value was entered in the “calculation and graph” tool.
Second, the model identifies two approaches to the problem. Students in clusters 3 and
4 used the pencil a lot, recorded few values in the “calculation and graph” tool, and these
values were highly concentrated around the target value. In contrast, students in clusters
1 and 2 used less the pencil and entered many values. e values were quite scattered
and from a certain distance from the target value. ese students were also more often
alternating between lower and higher values.
Discussion
We can argue that direct analysis of student process data on this interactive item is a
powerful tool to determine not only the student’s success in completing the item, but
also the strategy he or she uses to solve it. is log analysis confirms also the strategies
discerned by the didactical analysis.
Clustering analysis distinguishes 4 clusters corresponding to 4 different students’ pro-
files. Each cluster’s size represents 25% of the responding students. Two profiles (green
and orange on the graph) achieved the task. e other two (blue and red) correspond to
students who failed. Apart from the obvious variable “value 390 tested”, no other vari-
able allows to discriminate between success and failure. is result is disappointing in
the sense that the model cannot explain students’ achievement on the item, which is one
of our research questions. However, variables of interest in the a priori analysis allow
describing profiles along a dimension other than achievement. Two clusters (orange and
red on the Fig.2) show a large number of inputs, a large number of alternating inputs, a
first input far from target and a large distribution of inputs. ese characteristics allow
us to interpret that students from these groups preferred a “trial and error” solving strat-
egy, approaching the underlying concept of function in its operational aspect. Half of
them achieved the task successfully, the other half did not. e other two groups share a
different and opposite profile description according to the same variables: a small num-
ber of inputs, a small number of alternating inputs, a first input close from target, a nar-
row distribution of inputs. Moreover this second category of students used the pencil
more often, altogether identifying strategies related to the structural aspect of functions.
Like the first two groups, the structural approach led to failing the task for half of the
students favoring it.
Hence, if the didactical and analytical models used in this study could not help us
explain grade 9 students’ achievement to this item, they could nevertheless help us
Page 18 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
identify two solution strategies well known in didactics literature. From a curriculum
designer perspective, this kind of result on a national level could be very valuable. It
could help decision making based on the evidence of the manifestation of both struc-
tural and operational understanding among grade 9 students. Of course, the analysis
should be replicated and extended to a full set of items potentially aiming at discriminat-
ing between the two conceptions within the variation and relationships domain or even
in other domains. Furthermore, as it is strongly based on research findings in didactics,
such result can be fruitfully disseminated towards subject specialists such as policy mak-
ers and stake-holders in charge of teacher training, contributing to address better the
use of large-scale assessment findings at the classroom level.
However, there are limitations to relying on log data analysis in order to determine
learning profiles. One of them lies in the fact that, even if carrying a lot of informa-
tion, this information remains partial when it comes to describing the complete cogni-
tive activity at play when solving the item. Quite a large amount of “idle time” is usually
recorded in the logs. Log data alone is unable to help us understand what students did
during this time. Moreover, there is a strong chance that students are not that “idle”
when no activity is logged within the assessment system: a lot can happen within a class-
room, even during a standardized test administration. e use of a scratch paper or a
personal calculator, as well as interactions between students or with the test adminis-
trator, are sometimes critical within the problem-solving process. Taking these external
factors into account, in addition to the log data, would contribute to addressing fully
the question of the interactive items’ validity in terms of capturing learning strategies.
Complementary research, focusing on user experience, would then consist of collecting
and analyzing data from actual test administration observations, possibly enriched with
eye-tracking technology or “think aloud” recordings and case studies.
e potential of technology-enriched problem solving situations in large-scale assess-
ments is obvious. e type of findings this study produced is actionable from teachers’
perspective in order to address students’ specific needs. e DEPP is currently design-
ing and implementing a census based assessment in mathematics at the beginning of
grade 10. Its main objective is to report individual profiles in mathematics, consisting of
a score in various mathematical sub-domains. Being able to deliver additionally a quali-
tative profile based on students’ strategic behavior solving technologically-enhanced
problems would add value to the national feedback and help teachers to better support
students in their learning. Assuming they benefit from consistent and didactic-based
training, teachers would be able to differentiate teaching in the classroom according
to the cognitive profile of each student provided by the national testing platform. A lot
needs to be achieved before being able to devise this kind of assessment instrument. One
obstacle, in particular, to the generalization of the present study is the fact that analysis
depends on the studied item’s characteristics. e feature engineering and construction
of new variables, for instance, cannot be exactly replicated for another item, due to the
fact that it depends on the specific tool available in the item. erefore, this research
is not easily reproducible on a different set of items or other situations. Developing an
experimental methodology for each situation raises a lot of issues regarding large-scale
assessment constraints, but this very promising first step shows certainly the need for
further research and new partnerships. e DEPP is now exploring ways to industrialize
Page 19 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
its methods in order to run automatic log data analysis. More specifically, it is engaging
partnerships to address research objectives regarding the relationship between problem
solving strategies and achievement to a mathematic test rather than on a single item.
Among these objectives we are investigating whether lower-achieving/higher-achiev-
ing students consistently adopt one strategy over another, whether higher-performing
students adapt strategies to the task and whether the choice of strategy is part of the
competency.
Acknowledgements
We thank Victor Azria and Stéphane Germain of CapGemini France for supporting the statistical analysis. We thank the
SCRIPT of the Ministry of National Education, Children and Youth of Luxembourg and Vretta Inc. for supporting the
technology-enhanced items’ development.
Authors’ contributions
FS contributed to the design of the theoretical framework. SK and RDS contributed to the statistical analysis. All authors
read and approved the final manuscript.
Funding
The DEPP is a department of the ministry of education, therefore publically funded.
Availability of data and materials
Availability of data and materials is contingent on specific agreement between the ministry of education in France and
interested research institutions.
Competing interests
The authors declare that they have no competing interests.
Received: 10 January 2020 Accepted: 19 May 2020
References
Arthur D., Vassilvitskii S. (2006) k-means ++: The advantages of careful seeding, ilpubs.stanford.edu
Barandiaran, I. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analy-
sis and Machine Intelligence, 20(8), 1–22.
Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern
Recognition, 30(7), 1145–1159.
Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Brousseau, G. (2006). Theory of didactical situations in mathematics: Didactique des Mathématiques, 1970–1990. Berlin:
Springer Science & Business Media.
Doorman, M., Drijvers, P., Gravemeijer, K., Boon, P., & Reed, H. (2012). Tool use and the development of the function con-
cept: from repeated calculations to functional thinking. International Journal of Science and Mathematics Education,
10(6), 1243–1267. https ://doi.org/10.1007/s1076 3-012-9329-0.
Douady, R. (1991). Tool, Object, Setting, Window : Elements for Analysing and Constructing Didactical Situations in
Mathematics. In A. J. Bishop, S. Mellin-Olsen, & J. Van Dormolen (Éd.), Mathematical Knowledge : Its Growth Through
Teaching (p. 107–130). Springer Netherlands. https ://doi.org/10.1007/978-94-017-2195-0_6
Drijvers, P. (2019). Digital assessment of mathematics: opportunities, issues and criteria. Mesure et Évaluation En Éducation,
41(1), 41–66. https ://doi.org/10.7202/10558 96ar.
Drijvers, P., Ball, L., Barzel, B., Heid, M. K., Cao, Y., & Maschietto, M. (2016). Uses of digital technology in lower secondary math-
ematics education. Berlin: Springer.
Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. (1996). A density-based algorithm for discovering clusters in large spatial
databases with noise. Kdd, 96, 226–231.
Gobert, J. D., Pedro, M. S., Raziuddin, J., & Baker, R. S. (2013). From log files to assessment metrics: measuring students’
science inquiry skills using educational data mining. Journal of the Learning Sciences, 22(4), 521–563. https ://doi.
org/10.1080/10508 406.2013.83739 1.
Greiff, S., Niepel, C., Scherer, R., & Martin, R. (2016). Understanding students’ performance in a computer-based assess-
ment of complex problem solving: an analysis of behavioral data from computer-generated log files. Computers in
Human Behavior, 61, 36–46. https ://doi.org/10.1016/j.chb.2016.02.095.
Greiff, S., Wüstenberg, S., & Avvisati, F. (2015). Computer-generated log-file analyses as a window into students’ minds? A
showcase study based on the PISA 2012 assessment of problem solving. Computers and Education, 91, 92–105. https
://doi.org/10.1016/j.compe du.2015.10.018.
Hao, J., Shu, Z., & von Davier, A. (2015). Analyzing process data from game/scenario-based tasks: an edit distance
approach. Journal of Educational Data Mining, 7(1), 33–50.
Hao, J., Smith, L., Mislevy, R., von Davier, A., & Bauer, M. (2016). Taming log files from game/simulation-based assessments:
data models and data analysis tools: taming log files from game/simulation-based assessments. ETS Research Report
Series, 2016(1), 1–17. https ://doi.org/10.1002/ets2.12096 .
Page 20 of 20
Sallesetal. Large-scale Assess Educ (2020) 8:7
He, Q., & von Davier, M. (2015). Identifying feature sequences from process data in problem-solving items with n-grams.
Quantitative psychology research (pp. 173–190). Berlin: Springer.
Hopcroft, J. E., Ullman, J. D. (1983). Data structures and algorithms.
Kilpatrick, J., Swafford, J., Findell, B., National Research Council (U.S.), & Mathematics Learning Study Committee. (2001).
Adding it up : Helping children learn mathematics. National Academy Press. http://site.ebrar y.com/id/10038 695
Laborde, C. (2018). Intégration des technologies de mathématiques dans l’enseignement. In Guide de l’enseignant. Ensei-
gner les mathématiques. (Belin, pp. 336–366). https ://publi math.irem.univ-mrs.fr/bibli o/PGE18 015.htm
Lloyd, S. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129–137.
MacQueen, J., & others. (1967). Some methods for classification and analysis of multivariate observations. Proceedings of
the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1, 281–297. Oakland.
MEN, DEPP. (2017). Cedre 2014: Mathématiques en fin de collège (No. 209). Retrieved from Ministère de l’éducation nation-
ale website: https ://www.educa tion.gouv.fr/cid12 2693/cedre -2014-mathe matiq ues-en-fin-de-colle ge.html
OECD. (2013). PISA 2012 assessment and analytical framework: Mathematics, reading, science, problem solving and financial
literacy. Paris: OECD.
Qiao, X., & Jiao, H. (2018). Data mining techniques in analyzing process data: a didactic. Frontiers in Psychology, 9, 2231.
Rabardel, P. (2002). People and technology: A cognitive approach to contemporary instruments. (Université Paris 8).
Robert, A. (1998). Outils d’analyses des contenus mathématiques à enseigner au lycée et à l’université, Recherches
en didactique des mathématiques, Vol 18 2 pp. 139–190. Recherches En Didactique Des Mathématiques, Vol 18 2,
139–190.
Roditi, E., Salles, F. (2015). Nouvelles analyses de l’enquête PISA 2012 en mathématiques, un autre regard sur les résultats.
Revue Éducation et formations, (n° 86–87), 24.
Sfard, A. (1991). On the dual nature of mathematical conceptions: reflections on processes and objects as different sides
of the same coin. Educational Studies in Mathematics, 22(1), 1–36.
Stacey, K., & Wiliam, D. (2013). Technology and assessment in mathematics. In M. A. (Ken) Clements, A. J. Bishop, C. Keitel,
J. Kilpatrick, & F. K. S. Leung (Eds.), Third International Handbook of Mathematics Education (p. 721–751). https ://doi.
org/10.1007/978-1-4614-4684-2_23
Trouche, L. (2003). From artifact to instrument: mathematics teaching mediated by symbolic calculators. Interacting with
Computers, 15(6), 783–800. https ://doi.org/10.1016/j.intco m.2003.09.004.
Wierzchoń, S. T., & Kłopotek, M. A. (2018). Cluster analysis. In S. Wierzchoń & M. Kłopotek (Eds.), Modern Algorithms of Clus-
ter Analysis (pp. 9–66). Cham: Springer International Publishing. https ://doi.org/10.1007/978-3-319-69308 -8_2.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
... Technological innovations of computerized items also include tools offered with the item such as magnifier, digital pen for highlighting or taking notes on a digital scratchpad, ruler, or a calculator. Some tools may be compulsory to use for the test taker to be able to correctly respond to an item (Salles et al., 2020); they can also be available for all the items or test takers as universal tools across the entire test (WIDA, n.d.). Although examining how such tools contribute to the test taking experience in paper-pencil or classroom assessments has a long history of research, studies for digital tools in computerized assessments are limited. ...
... Analyzing process data regarding the use of digital tools can also contribute to item and test development processes as they provide clues for how to ease test-taking processes, eliminating construct-irrelevant variances and increasing the fidelity and validity of the item. For instance, Salles et al., (2020) showed that test takers who responded to a mathematical item with a graph correctly tended to use a digital pen for taking notes on the graph. Another study on computer based office simulation tests showed that successful test takers tended to use notepad and spreadsheets helping computation more efficiently (Ludwig & Rausch, 2022). ...
Article
Full-text available
In this study, the effect of using on-screen calculators on eighth grade students’ performance on two TIMSS 2019 Problem Solving and Inquiry Tasks items considered as examples of technology-enhanced items administered on computers was examined. For this purpose, three logistic regression models were run where the dependent variables were giving a correct response to the items and the independent variables were mathematics achievement and on-screen calculator use. The data of student from 12 countries and 4 benchmarking participants were analyzed and some comparisons were made based on the analyses. The results indicate that using on-screen calculators is positively associated with higher odds of giving correct responses for both items above and beyond students’ mathematics achievement scores. The results of this study promote the inclusion of on-screen calculator as a digital tool in technology-enhanced items that require problem solving.
... However, such bottom-up methods require that the data contains more information that contributes to pattern detection than irrelevant information blurring it. Alternatively, process indicators and solution strategies may be identified a priori based on theoretical assumptions and previous research (e.g., Hahnel et al., 2019;Salles et al., 2020), allowing for a targeted examination of behaviors of interest without regard to their probability of occurrence or the application of computationally intensive methods. ...
... Informed by previous research and theoretical concepts of why specific behavior occurs, assumptions about behavior in test situations can guide the creation of process indicators (see Hahnel et al., 2019;Salles et al., 2020). Our results stress this reasoning for web search tasks. ...
Article
Full-text available
Background. A priori assumptions about specific behavior in test items can be used to process log data in a rule-based fashion to identify the behavior of interest. In this study, we demonstrate such a top-down approach and created a process indicator to represent what type of information processing (flimsy, breadth-first, satisficing, sampling, laborious) adults exhibit when searching online for information. We examined how often the predefined patterns occurred for a particular task, how consistently they occurred within individuals, and whether they explained task success beyond individual background variables (age, educational attainment, gender) and information processing skills (reading and evaluation skills). Methods. We analyzed the result and log file data of ten countries that participated in the Programme for the International Assessment of Adult Competencies (PIAAC). The information processing behaviors were derived for two items that simulated a web search environment. Their explanatory value for task success was investigated with generalized linear mixed models. Results. The results showed item-specific differences in how frequently specific information processing patterns occurred, with a tendency of individuals not to settle on a single behavior across items. The patterns explained task success beyond reading and evaluation skills, with differences across items as to which patterns were most effective for solving a task correctly. The patterns even partially explained age-related differences. Conclusions. Rule-based process indicators have their strengths and weaknesses. Although dependent on the clarity and precision of a predefined rule, they allow for a targeted examination of behaviors of interest and can potentially support educational intervention during a test session. Concerning adults’ digital competencies, our study suggests that the effective use of online information is not inherently based on demographic factors but mediated by central skills of lifelong learning and information processing strategies.
... Évaluer des élèves sur support informatisé n'est pas un processus nouveau (Marc, Wirthner & Uldry, 2013 ;Salles, Dos Santos & Kespaik, 2020 ;Wyatt-Smith, Lingard & Heck, 2019) et présente plusieurs avantages. Différentes études en ont mis certains en évidence, comme celle de Blumenthal et Blumenthal (2020). ...
Article
Full-text available
Cet article traite d'une expérimentation entreprise par l'IRDP dans le cadre des travaux menés pour le projet EpRoCom-Banque d'items. Il s'agit d'évaluer les compétences des élèves en résolution de problèmes mathématiques par le biais d'un support informatisé, à des fins diagnostiques. Nous présentons comment nous avons traité les données d'un problème testé sur tablette et déterminé des indicateurs pour les différentes procédures identifiées, en vue d'une catégorisation automatisée.
... Este enfoque se basó en datos educativos de lectores de libros electrónicos y comportamientos de docentes. Además, en (Salles et al., 2020), se aplicaron DT, RF y algoritmos de clustering para predecir el rendimiento académico en evaluaciones matemáticas interactivas en Francia. ...
Article
Full-text available
La predicción del rendimiento académico se ha convertido en un área de creciente interés en la educación superior, debido a su potencial para identificar y apoyar a estudiantes en riesgo antes de que enfrenten dificultades académicas. Este estudio se centra en la aplicación de modelos de aprendizaje automático para predecir el rendimiento académico, explorando diferentes variables y técnicas utilizadas en investigaciones recientes. A través de una revisión sistemática de la literatura, se analizaron estudios que emplean ML para predecir el éxito académico, identificando las variables, criterios, técnicas y las metodologías más efectivas. Los resultados destacan el impacto de variables como el historial académico, factores sociodemográficos, económicos y culturales en el rendimiento estudiantil, así como la eficacia de técnicas como las redes neuronales artificiales, los árboles de decisión y las máquinas de vectores de soporte. Finalmente, se discuten las implicaciones de estos hallazgos para el desarrollo de intervenciones educativas más eficientes y personalizadas.
... This method includes five key network statistics: centralization, density, flow hierarchy, shortest path, and total number of nodes, each contributing to a comprehensive understanding of the interactions within the network. This approach aligns with recent trends in educational data mining, where network analysis is increasingly applied to understand learning processes (Salles et al., 2020;Zhu et al., 2016). ...
Article
Full-text available
The benefits of incorporating process information in a large-scale assessment with the complex micro-level evidence from the examinees (i.e., process log data) are well documented in the research across large-scale assessments and learning analytics. This study introduces a deep-learning-based approach to predictive modeling of the examinee’s performance in sequential, interactive problem-solving tasks from a large-scale assessment of adults' educational competencies. The current methods disambiguate problem-solving behaviors using network analysis to inform the examinee's performance in a series of problem-solving tasks. The unique contribution of this framework lies in the introduction of an “effort-aware” system. The system considers the information regarding the examinee’s task-engagement level to accurately predict their task performance. The study demonstrates the potential to introduce a high-performing deep learning model to learning analytics and examinee performance modeling in a large-scale problem-solving task environment collected from the OECD Programme for the International Assessment of Adult Competencies (PIAAC 2012) test in multiple countries, including the United States, South Korea, and the United Kingdom. Our findings indicated a close relationship between the examinee's engagement level and their problem-solving skills as well as the importance of modeling them together to have a better measure of students’ problem-solving performance.
Article
Evaluating differential item functioning (DIF) in assessments plays an important role in achieving measurement fairness across different subgroups, such as gender and native language. However, relying solely on the item response scores among traditional DIF techniques poses challenges for researchers and practitioners in interpreting DIF. Recently, response process data, which carry valuable information about examinees’ response behaviors, offer an opportunity to further interpret DIF items by examining differences in response processes. This study aims to investigate the potential of response process data features in improving the interpretability of DIF items, with a focus on gender DIF using data from the Programme for International Assessment of Adult Competencies (PIAAC) 2012 computer-based numeracy assessment. We applied random forest and logistic regression with ridge regularization to investigate the association between process data features and DIF items, evaluating the important features to interpret DIF. In addition, we evaluated model performance across varying percentages of DIF items to reflect practical scenarios with different percentages of DIF items. The results demonstrate that the combination of timing features and action-sequence features is informative to reveal the response process differences between groups, thereby enhancing DIF item interpretability. Overall, this study introduces a feasible procedure to leverage response process data to understand and interpret DIF items, shedding light on potential reasons for the low agreement between DIF statistics and expert reviews and revealing potential irrelevant factors to enhance measurement equity.
Chapter
Unlike traditional large-scale assessments, computer-based assessments collect process data that provide information about test-takers actions in the assessment platform while solving test items, for example, response and reaction times, number of clicks, and skipping. In computer-based assessments in multilingual contexts, the comparability of test scores across different assessment languages is essential and challenges assessment design principles regarding fairness. To achieve comparability, test items must be independent of language specificities, test-takers should be homogeneous in their ability to solve the test items, and the test setting must be standardized across test-takers. Using process data in assessments is often limited to identifying cognitive response mechanisms. On the other hand, little is known about how process data can contribute to multilingual assessments’ test development and administration process, hence, quality assurance. Based on the existing literature, we conceptualize a framework with five aspects of how process data can be used for assessment quality improvement. In addition, we set these aspects in the context of computer-based assessments in multilingual contexts. We illustrate three ways in which process data can contribute to the quality assurance of standardized educational computer-based assessments: (a) assisting in identifying suspicious items or blocks across different languages, (b) investigating potential issues of position effects and how they might affect different languages, and (c) clustering test-takers to understand their evolving behavior. This strand of research aims to better understand how process data in computer-based assessments can contribute to quality assurance and inform assessment developers, practitioners, and researchers in large-scale assessments.
Article
Full-text available
As the use of process data in large-scale educational assessments is becoming more common, it is clear that data on examinees’ test-taking behaviors can illuminate their performance, and can have crucial ramifications concerning assessments’ validity. A thorough review of the literature in the field may inform researchers and practitioners of common findings as well as existing gaps. This literature review used topic modeling to identify themes in 221 empirical studies using process data in large-scale assessments. We identified six recurring topics: response time models, response time-general, aberrant test-taking behavior, action sequences, complex problem-solving, and digital writing. We also discuss the prominent theories used by studies in each category. Based on these findings, we suggest directions for future research applying process data from large-scale assessments.
Article
Full-text available
Due to Covid-19, an inevitable restructuring of higher education teaching and learning pedagogies ensuring the continuous and effective learning of students is deemed important. Despite such vitality, a prevalent disparity worldwide on the usages and gains of digital and social media integration is still noticeable. Following a Scoping Literature Review and using the Atlas.ti software for a Grounded Theory qualitative analysis, this study aims to ascertain the significance of digital and social media tools during and after the Covid-19 pandemic. The study explains the common challenges and opportunities both students and educators faced in thirty countries. Drawing on the sentiment analysis of these stakeholders, results indicate that despite the acceleration of digital education into a flexible, and student-centered didactic approach, various barriers in effectively fulfilling online learning still exist. Findings also revealed the lack of, and therefore need for, proper teaching and learning material and strategies suitable for digital education.
Article
Full-text available
Digital assessment of mathematics is becoming widespread, but still comes with limitations and constraints. A central question is how to design digital tests that assess mathematical knowledge in a valid way. Based on literature on validity and on assessment with and through technology, we identify arguments for and opportunities of digital assessment of mathematics, as well as its main issues. Through three case descriptions, different ways to design digital tests are explored. As a conclusion, we make a plea for assessment environments which offer rich opportunities for students to “do” mathematics and for test designers to design rich items; automated scoring also needs further development, with respect to the scoring of intermediate steps in problem-solving strategies.
Article
Full-text available
Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k-means, fitted to one assessment data. The USA sample (N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.
Book
Full-text available
Available from Springer Open : http://link.springer.com/book/10.1007%2F978-3-319-33666-4
Chapter
This chapter outlines the major steps of cluster analysis. It starts with an informal introduction to clustering, its tools, methodology and applications. Then it proceeds with formalising the problem of data clustering. Diverse measures of object similarity, based on quantitative (like numerical measurement results) and on qualitative features (like text) as well as on their mixtures, are described. Various variants, how such similarity measures can be exploited when defining clustering cost functions are presented. Afterwards, major brands of clustering algorithms are explained, including hierarchical, partitional, relational, graph-based, density-based and kernel methods. The underlying data representations and interrelationships between various methodologies are discussed. Also possibilities of combining several algorithms for analysis of the same data (ensemble clustering) are presented. Finally, the issues related to easiness/hardness of the data clustering tasks are recalled.
Chapter
The classroom is a living place where complicated interactions take place between the teacher and the pupils. What is at stake is a certain mathematical knowledge. The pupils arrive in class in a certain state of knowledge and must reach another expected state of knowledge, under the teacher’s direction. Various factors act upon these relations — of a scientific, social, cultural, hierarchical, or personal order — often in contradictory ways. One of the functions of teacher training is to provide teachers with the means of recognizing those factors which they can influence, considering the constraints to which they are subject. Another point is to determine how they can manage these elements in order to obtain a desired result in the pupils’ learning.
Article
Extracting information efficiently from game/simulation-based assessment (G/SBA) logs requires two things: a well-structured log file and a set of analysis methods. In this report, we propose a generic data model specified as an extensible markup language (XML) schema for the log files of G/SBAs. We also propose a set of analysis methods for identifying useful information from the log files and implement the methods in a package in the Python programming language, glassPy. We demonstrate the data model and glassPy with logs from a game-based assessment, SimCityEDU.