ThesisPDF Available

Automatic Processing of Text Responses in Large-Scale Assessments

  • DIPF | Leibniz Institute for Research and Information in Education

Abstract and Figures

In automatic coding of short text responses a computer categorizes responses. In the thesis, a free software has been developed capable of (i) grouping text responses into semantically homogeneous types, (ii) coding the types, and (iii) extracting features. Results showed fair to good up to excellent agreement between the software’s and humans’ coding (76–98%) of 41,990 responses. Two further studies showed how the software can advance the assessment process and content-related research areas.
Content may be subject to copyright.
Lehrstuhl für Empirische Bildungsforschung
Zentrum für Internationale Bildungsvergleichsstudien (ZIB) e.V.
Automatic Processing of Text Responses in
Large-Scale Assessments
Fabian Zehner
Vollständiger Abdruck der von der Fakultät TUM School of Education der Technischen
Universität München zur Erlangung des akademischen Grades eines
Doktors der Philosophie (Dr. phil)
genehmigten Dissertation.
Vorsitzende: Prof. Dr. Kristina Reiss
Prüfer der Dissertation: 1. Prof. Dr. Manfred Prenzel
2. Prof. Dr. Frank Goldhammer,
Deutsches Institut für Internationale Pädagogische Forschung (DIPF)
3. Prof. Dr. Peter Hubwieser
Die Dissertation wurde am 31. März 2016 bei der Technischen Universität München eingereicht
und durch die Fakultät TUM School of Education am 22. Juli 2016 angenommen.
Automatic Processing of Text Responses
In automatic coding of short text responses, a computer categorizes or scores
responses. In the dissertation, a free software has been developed that is capable
of (i) grouping text responses into semantically homogeneous types, (ii) coding the
types (e.g., correct /incorrect), and (iii) extracting further features from the re-
sponses. The software overcomes the crucial disadvantages of open-ended response
formats, opens new doors for the assessment process, and makes raw responses in
large-scale assessments accessible as a new source of information. Three studies
analyzed n= 41,990 responses from the German sample of the Programme for In-
ternational Student Assessment (PISA) 2012 with different research interests, which
were balanced according to three pillars. Publication (A) introduced the software
and evaluated its performance. This involved pillar (I), which investigates, opti-
mizes, and evaluates the algorithms and statistical models for automatic coding.
Publication (B), studying how to train the software with less data, also concerned
pillar (I) but then covered pillar (II) too, which deals with potential innovations
in the assessment process. The article demonstrated how coding guides can be
created or improved by the automatic system incorporating the variety of the em-
pirical data. Pillar (III), attempting to add to content-related research questions,
was covered in publication (C). It analyzed differences in the text responses of girls
and boys to shed light on the gender gap in reading. Results showed fair to good up
to excellent agreement beyond chance between the software’s and humans’ coding
(76–98%), according to publication (A). Publication (B) demonstrated that, on av-
erage, established PISA coding guides only covered about 28 percent of empirically
occurring response types, and, at the same time, the software enabled the automatic
expansion of the coding guides in order to cover the remaining 72 percent. Publi-
cation (C) concluded that the difficulties some boys face in reading were associated
with a reduced availability and flexibility of their cognitive situation model and their
struggle to correctly identify a question’s aim. The analysis showed among others
that boy-specific responses were characterized by remarkably fewer propositions,
plus those few propositions turned out to be more often irrelevant than those of
girl-specific responses. The findings of the studies in pillars (II) and (III) illustrate
how the developed approach and software can advance research fields and innovate
educational assessment. The three pillars raised by this dissertation will be fortified
in further studies. One of the crucial challenges will be to balance use cases and
further software development, in order to take advantage of the innovations in nat-
ural language processing while identifying the demands in the social sciences. This
paper ends with a detailed discussion on limitations, implications, and directions of
the presented research.
Keywords: Computer-Automated Scoring, Automatic Short Answer Grading,
Automatic Coding, Open-Ended Responses, Short Text Responses, Reading Liter-
acy, Coding Guides, Gender Gap
Automatic Processing of Text Responses
A research project merely evolves in a researcher’s mind alone but rather within a diverse
setting of circumstances and individuals. While having been introduced to the academic
and scientific world, I was highly privileged to be embedded into the lively theme park of
the PISA studies with all its thrilling roller coasters and their most competent as well as
kind operators.
There is Professor Dr. Manfred Prenzel. He pointed out the demands of the park investors
and engineers to me and evaluated the newest inventions from the lab. His way to
communicate as well as to treat content-related issues hopefully left its mark on me.
There are Professor Dr. Frank Goldhammer and Dr. Christine Sälzer. They navigated
me through the labyrinth-like paths in the park. They helped me to focus on investigating
only one roller coaster instead of riding on two at a time; well, actually not the whole
roller coaster in one study but only the blue screw at the lower left wheel of the third
car, which tended to stagger a little in the final bend.
Further companions made the park attractions interesting and comforting at the same
time. There is the national PISA team with, again, Dr. Christine Sälzer, Dr. Anja Schie-
pe-Tiska, Stefanie Schmidtner, Jörg-Henrik Heine, Julia Mang, and Elisabeth González
Rodríguez. It is a pleasure to furnish the park with additional signposts, to install the
newest swing ride and bumper cars, and to discuss them with you. There are the innu-
merable research assistants who helped recording the park visitors’ talking with enormous
effort and engagement. There is Jörg-Henrik Heine, always available for a stimulating
chatting about techniques of log rides, L
EX, R, or the latest IRT gossip. There is Dr.
Markus Gebhardt, who showed me how to thrill the park visitors with an aquaponic
system. There is Professor Dr. Torsten Zesch, who helped me a lot in setting up the new
and fancy roller coaster firmware. There is the park’s Department of Communication in
the form of the TUM English Writing Center with Jeremiah Hendren, its staff, and its
tutors. They taught me not to unethically bother stakeholders with linguistic redundancy
and complexity.
There is my family. They are responsible for the outcome of my research for obvious
There is my wife. You are the plain motive for me to be in the park.
Automatic Processing of Text Responses CONTENTS
1 The Multiple Faces of Language 1
2 Methods and Results 5
2.1 Publication (A): The Software for Processing Text Responses . . . . . . . . 5
2.1.1 Context and Related Work . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.2 Proposed Collection of Methods for Automatic Coding . . . . . . . 7
2.1.3 Participants and Materials . . . . . . . . . . . . . . . . . . . . . . . 10
2.1.4 ResearchQuestions........................... 10
2.1.5 ResultHighlights ............................ 11
2.2 Publication (B): The Use and Improvement of Coding Guides . . . . . . . 12
2.2.1 Context and Related Work . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Adaptation of the Processing of Responses . . . . . . . . . . . . . . 14
2.2.3 ResearchQuestions........................... 15
2.2.4 ResultHighlights ............................ 15
2.3 Publication (C): The Reading Gender Gap in Text Responses . . . . . . . 16
2.3.1 Context and Related Work . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.2 Automatic Processing of Features in Text Responses . . . . . . . . 19
2.3.3 ResearchQuestions........................... 19
2.3.4 ResultHighlights ............................ 20
3 Discussion 21
3.1 Discussing and Linking the Main Findings . . . . . . . . . . . . . . . . . . 21
3.2 Limitations and Future Directions . . . . . . . . . . . . . . . . . . . . . . . 25
References 30
A Dissertation Publications 39
Automatic Processing of Text Responses 1 THE MULTIPLE FACES OF LANGUAGE
1 The Multiple Faces of Language
The studies in this dissertation deal with the automatic processing of short text responses
in assessments. Put in a nutshell, they used a software for the automatic processing
in order to (A) identify correct responses, (B) empirically improve reference responses
in coding guides, and (C) capture features in responses that are sensitive to subgroup
differences. While only the richness of natural language enabled these innovations at all,
at the same time, they were restricted by some facets of language. In this sense, language
has multiple faces that influenced the studies either constructively or destructively.
Language is one of our main ways to express cognitions. Since the quality of the cog-
nitions is often at the core of educational and psychological research, many assessment
instruments operate through language by presenting the stimulus and capturing the re-
sponse. For the latter, the response can be assessed via closed response formats, such as
multiple choice, or open-ended response formats, such as free-text fields. Because closed
formats allow data to be easily processed, they have become state of the art. However,
sometimes only open-ended opposed to closed response formats assess the full scope of
the intended construct (e.g., for reading: Millis, Magliano, Wiemer-Hastings, Todaro, &
McNamara,2011;Rauch & Hartig,2010;Rupp, Ferne, & Choi,2006; for mathematics:
Birenbaum & Tatsuoka,1987;Bridgeman,1991), which is why open-ended questions are
typically included despite their detrimental impact in established large-scale assessments,
such as the Programme for International Student Assessment (PISA; OECD,2013b).
Closed response formats evoke different cognitive processes than open-ended ones do.
The two formats involve at least two different faces of language. One is the severe judge
in a robe, faced with facts. The judge strikes the gavel onto the lectern upon evaluating
each given response option as either wrong or right. The other face is an old and com-
plex three-headed tortoise oracle, with its three heads in charge of information querying,
solution finding, and solution expressing, the three heads constantly arguing with each
other in order to come up with a result.
1 THE MULTIPLE FACES OF LANGUAGE Automatic Processing of Text Responses
PISA assesses hundreds of thousands of fifteen-year-old students, who are instructed
to respond to questions about a read text, mathematical problem, or scientific matter.
The variability and sheer mass of the responses necessitate huge manual efforts from
trained human coders deciding on the correctness of the responses. The human coders, in
turn, entail some disadvantages (cf. Bejar,2012): their subjective perspectives, varying
abilities (e.g., coding experiences and stamina), the need for consensus building measures
(i.a., coder trainings), and in turn a high demand on resources (i.a., time).
In order to overcome these disadvantages of the open-ended response format, a com-
puter can be used to automatically code the responses. Therefore in this dissertation, a
software has been developed that is capable of coding and scoring responses. Furthermore,
responses contain linguistic information that hitherto could not have been used in large-
scale assessments due to the large volume of data. Besides grouping the responses into
semantically homogeneous types, the software thus also captures response features that
provide further insights into the respondents. For example, the software checks whether
information given in a response has been repeated from the stimulus or constitutes newly
added information from the respondent.
The studies in this dissertation are centered around the software and assemble at
three pillars. Studies in pillar (I) concern the development of the software for processing
responses. They aid in identifying, evaluating, and improving appropriate algorithms,
techniques, and statistical models (Zehner, Goldhammer, & Sälzer,2015;Zehner, Sälzer,
& Goldhammer,2016). Studies in pillar (II) deal with new possibilities in the assessment
process introduced by automatic processing of text responses. For example, the just
mentioned publication Zehner et al. (2015) demonstrated how established coding guides
can be improved by sampling prototypical responses from the empirical data as new
reference responses. In this way, the coding guides cover the broad range of response
types that human coders meet during their work. In pillar (III), the studies use the
software to contribute to open content-related research questions. Zehner, Goldhammer,
and Sälzer (submitted) explored features in boys’ and girls’ responses to further explain
Automatic Processing of Text Responses 1 THE MULTIPLE FACES OF LANGUAGE
the gender differences repeatedly found in reading. The studies analyzed data from the
German PISA 2012 sample and mainly focused on questions assessing reading literacy,
but they also included first evidence for mathematical and scientific literacy. Further
studies will fortify and elaborate on each of the three pillars in the future.
All the software developments and analyses described have been accompanied by
the multiple faces of language. On the one hand, natural language is abundant in
information—a white-haired, long-bearded wise man sitting at the campfire who can
tell a dozen stories when stumbling across a single word. This facet of language al-
lows the assessment of multiple features from responses that furnish information about
a respondent. It also enables the automatic scoring of responses as opposed to manual
scoring, because the given linguistic information suffices to make scoring decisions. On
the other hand, natural language is not an absolute system with symbols (words, phrases)
of which each uniquely refers to one and only one entity (e.g., an object). Rather, in the
terminology of Peirce (1897/1955), a reader perceives a symbol (word) and forms an idea
of this symbol in their own mind, which is the interpretant—this process is often called
grounding (Glenberg, de Vega, & Graesser,2008). The symbol (word) itself is related
to an object. Since the grounding process with the resulting interpretant is related to
the object but not conclusively identical across different readers, language needs to be
understood as a communication medium with information loss. Hence, a writer’s inter-
pretant tends to differ from the reader’s interpretant. In this light, language is also a
mystical fortune teller swirling over a crystal ball and extrapolating questionable ideas
from observable signs. This face of language is very salient for human coders when con-
fronted with student responses. Fortunately, human language comprehension is designed
to deal with this fuzziness. However, this is a source for inconsistencies in human ratings.
Computers, in contrast, are designed to work precisely and reliably. Attempting to make
a computer work fuzzy, computer engineers experience remarkable obstacles when, for
example, trying to generate a true random number, which might appear as a trivial task
compared to the processing of natural language.
1 THE MULTIPLE FACES OF LANGUAGE Automatic Processing of Text Responses
The faces of language depicted here are not exhaustive. For example, there is the
red-headed, smirking twelve year-old boy, excitedly shifting from one foot to the other,
who is longing to tell the latest pun. Or there is the eager eleven year-old girl, starting
to learn a foreign language with eyes wide open due to the surprise when a new world
appears on her horizon. Good professional writers are experts in utilizing the natural lan-
guage’s different faces, in order to manipulate what their readers perceive while reading.
Analogously, we as researchers should be able to use the opportunities coming along with
open-ended response formats instead of only being constrained by its complications. This
dissertation contributes to lowering the required manual effort through automatic pro-
cessing and attempts to increase the usage frequency of the open-ended format and more
comprehensive information gain from the responses. That said, it needs to be emphasized
that closed response formats are not generally considered to be worse than open-ended
ones. But the researcher should simply not base the decision for either response format
on the coding effort but on the construct.
Automatic Processing of Text Responses 2 METHODS AND RESULTS
2 Methods and Results
In this dissertation, a software has been developed and the studies were balanced accord-
ing to three pillars. Publication (A) introduced the software, presented empirical evidence
regarding its performance, and showed which parameter values and methods worked best
for the German PISA data. This involved pillar (I) which investigates, optimizes, and
evaluates the algorithms and statistical models for automatic processing. Publication (B),
studying how to train the cluster model with less data, also partly concerned pillar (I)
but then covered pillar (II) too, which deals with potential innovations in the assessment
process through automatic text response processing. The article demonstrated how cod-
ing guides can be improved by automatic identification of response types in the empirical
data. Pillar (III), attempting to add to open, content-related research questions, was
covered in publication (C). It analyzed differences in the responses of girls and boys to
further shed light on the gender gap in reading. The following subsections outline the
corresponding research interests, methodological approaches, and main findings.
2.1 Publication (A): The Software for Processing Text Responses
The software development, its evaluation, further analyses, the designs of experiments,
and the structuring along with the writing of the manuscript were carried out in the
context of the dissertation by the first author. Also, the initial interest in the study
was raised by the first author. The two co-authors advised on the analyses, on strategic
decisions, such as the selection of data and the target journal, and on the final editing
of the manuscript. It has been submitted to the journal Educational and Psychological
Measurement in December, 2014, and accepted in February, 2015. It had been made
available online in June, 2015, and appeared in April, 2016, in a print issue.
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses via clus-
tering in educational assessment. Educational and Psychological Measurement,76 (2), 280–303. doi:
2 METHODS AND RESULTS Automatic Processing of Text Responses
2.1.1 Context and Related Work
The paper proposed a collection of baseline methods that is capable of categorizing or
scoring student responses automatically. The methods were employed by software com-
ponents under open licenses. The feasibility and validity of the method collection was
demonstrated by using data assessed in PISA 2012 in Germany. Large-scale assessments,
in particular, typically require enormous effort and resources in terms of human coding of
open-ended responses. Naturally, they are a highly suitable field to apply automatic cod-
ing. This seems particularly true for international studies, since their inherent endeavor
is to maximize consistency across different test languages (cf. OECD,2013b).
For twenty years by now (Burstein, Kaplan, Wolff, & Lu,1996), the issue of natural
language processing for automatic coding of short text responses has been addressed by
several research groups. But in contrast to the strongly related but somewhat older field
of essay grading (Dikli,2006), its methods are not commonly used in practice. In fact,
a large number of open natural language processing libraries are available. These can
be combined with statistical components in order to conduct automatic coding. In the
last two decades, software and approaches have been published that can be used for
automatic coding of short text responses, such as c-rater (Leacock & Chodorow,2003)
and AutoTutor (Graesser et al.,1999). Only in the last years, automatic coding has
been receiving notable attention by larger research projects such as Smarter Balanced
Assessment (Smarter Balanced Assessment Consortium,2014).
It is beyond the scope of this paper to provide an extensive overview of existing
systems; the reader might want to refer to the comprehensive overview by Burrows,
Gurevych, and Stein (2014). The proposed approach and software differ from existing
systems in the following characteristics. First, the approach does not yet include the
most powerful machine learning methods. This however, second, offers more flexibility
and transparency for researchers of the social sciences, because they are familiar with
the methods and tools for the most part. Third, the software will be made available free
Automatic Processing of Text Responses 2 METHODS AND RESULTS
and open to encourage researchers to use automatic coding. Fourth, the software can
be adapted, for example, by using the script language R, in order to tailor the response
processing to the study’s research questions.
2.1.2 Proposed Collection of Methods for Automatic Coding
The proposed procedure for automatic coding of text responses can be split into three
phases. First, each text response is transformed into a quantified representation of its
semantics. Second, the responses are grouped into response types by their semantics in a
clustering model. Third, the response types are assigned to interpretable codes, such as
incorrect, by machine learning. Figure 1 illustrates the procedure.
Phase One. In a first step of phase one, the text response is preprocessed (cf. part A
in Figure 1). Beside further basic transformations, such as punctuation removal,digit
removal, and decapitalization,tokenizing splits the response into word chunks. Next,
spelling correction is applied. Functional words (stop words), not carrying crucial se-
mantic information, are omitted. Finally, stemming cuts off affixes in the response. In
the next step, representations of the responses’ word semantics are needed. Therefore,
a big text corpus is analyzed with a statistical method in order to build the machine’s
lexical knowledge. Wikipedia, for instance, can serve as an adequate corpus. Vector
space models are commonly used to model semantics, called semantic spaces. These are
hyper-dimensional spaces in which each word is represented by a vector. The semantic
similarity of words is defined by the angle between two of these vectors. Semantic spaces
can be computed by, for instance, Latent Semantic Analysis (LSA; Deerwester, Dumais,
Furnas, Landauer, & Harshman,1990) or Explicit Semantic Analysis (ESA; Gabrilovich
& Markovitch,2007). Now having semantic knowledge about words, the computer can
extract the response semantics (cf. part B in Figure 1). The simplest way is to compute
the centroid vector of all words in the preprocessed response. In the case of utilizing LSA,
the result could be a 300-dimensional vector constituting the response’s total semantics.
This centroid vector represents the response in all further analyses.
2 METHODS AND RESULTS Automatic Processing of Text Responses
Figure 1: Schematic Figure of Automatic Coding (Zehner et al.,2016, p. 4)—In phase one, the
computer preprocesses the response (A) and extracts its semantics (B). In the second and third
phase, these numerical semantic representations are used for clustering (C) and for machine
learning, in which a code is assigned to each cluster (D). Finally, a new response receives the
code of the cluster which is most similar to it (E).
Automatic Processing of Text Responses 2 METHODS AND RESULTS
Phase Two. In the second phase, groups of similar responses are built by an ag-
glomerative hierarchical cluster analysis (cf. part C in Figure 1). The groups constitute
response types. The method has several adjustable parameters. First, (dis-)similarity is
typically defined as the (arc-)cosine of two vectors. Second, it is suitable to use Ward’s
method (Ward,1963) as the clustering algorithm. Third, the number of clusters is deter-
mined by the researcher in order to attain the best solution.
Up to this point, the procedure works with text response data only. This is especially
reasonable for data sifting and is called unsupervised learning. The procedure does not
utilize another criterion, such as a variable indicating whether a response is correct or
incorrect. Such an external criterion would allow model optimization, which is then
called supervised opposed to unsupervised learning. If some kind of supervised learning
is the task’s overall aim, such as scoring responses, the model parameter values should
be varied systematically in order to choose the best performing one. This is reasonable,
among others, for the most important parameter choice, which is the number of clusters.
Phase Three. Where the overall goal is automatic coding, meaning that a text
response needs to be assigned to one of a fixed range of values (e.g., correct), an external
criterion is used from which to learn the relation between the semantic vectors and the
intended code (cf. part D in Figure 1). This is supervised learning and constitutes
the third phase. The external criterion can be judgments by human coders. Supervised
machine learning procedures are separated into two steps: training and testing. Each step
in the procedure only uses a subset of the data and puts the rest aside. The method used
in the paper is called stratified, repeated tenfold cross-validation. Details on the method
can be found in Witten, Frank, and Hall (2011), and for comparisons with other methods
see Borra and Di Ciaccio (2010).
For building the classification cluster model, the code of a response type (e.g., correct)
is determined by the highest conditional probability of the codes for all responses that
are assigned to the type. In this way, unseen responses can be classified by using the
2 METHODS AND RESULTS Automatic Processing of Text Responses
cluster model and the codes of the response types (cf. part E in Figure 1). To do so, the
highest similarity between the response and a cluster centroid determines the assignment
of a response to a response type. The corresponding response type code is then assumed
to be the response code. This is applied to the test data which had not been included in
the model training, in order to compute the model performance. The simplest coefficient
for the model performance is the percentage of agreement between computer and human.
Another important coefficient for the inter-rater agreement is kappa (Cohen,1960), which
corrects for a-priori probabilities of code occurrences.
2.1.3 Participants and Materials
The analyzed n= 41,990 responses come from the German PISA 2012 sample. This
includes a representative sample of fifteen-year-old students as well as a representative
sample of ninth-graders in Germany. A detailed sample description can be found at
Prenzel, Sälzer, Klieme, and Köller (2013) and OECD (2014). In PISA 2012, reading,
mathematics, and science were assessed paper-based. Hence, the paper booklets needed
to be scanned, and responses were transcribed by six persons. That is why not all items
but only ten transcribed ones, including eight reading, one mathematics, and one science
item, were at hand. All items were coded dichotomously, that is, responses either got
full or no credit. Item and response contents could not be reported due to the items’
2.1.4 Research Questions
In order to evaluate the proposed collection of methods for automatic coding, the article
answers three main research questions. The second research question splits up to seven
different analyses that investigate the proposed collection of methods in detail.
1. How well does the proposed collection of methods for automatic coding perform
using German PISA 2012 responses?
Automatic Processing of Text Responses 2 METHODS AND RESULTS
2. Which of the following steps and parameter configurations in the proposed collection
of methods lead to higher system performance?
(I) vector space model opposed to the existence of plain words
(II) automatic opposed to none and manual spelling correction
(III) Latent Semantic Analysis (LSA) opposed to Explicit Semantic Analysis (ESA)
(IV) different text corpora as the base for the semantic space
(V) different numbers of LSA dimensions for the semantic space
(VI) different distance metrics for the clustering
(VII) different agglomeration methods for the clustering
3. How well does the proposed collection of methods for automatic coding perform
using smaller sample sizes?
2.1.5 Result Highlights
The agreement between human raters and the system reached high percentages from 76
to 98 percent (M= 88%) across the ten items. According to Fleiss’s definition (Fleiss,
1981), the kappa agreement ranged from fair to good to excellent agreement beyond
chance (.46 κ.96). The models’ reliabilities were acceptable with κ.80 except of
two items with κ=.74 and κ=.77.
The analyses studying the second research question demonstrated which parameter
values or methods outperformed others. (I) Even for a question in which four terms needed
to be repeated from the stimulus, a vector space model yielded higher performance than
testing for the existence of plain words. The vector space model represents semantic
concepts and, thus, among others, takes synonyms into account. (II) Spelling correc-
tion was important for some items but not for others. The automatic spelling correction
reached similar performance levels like the manual correction. (III) LSA outperformed
ESA. (IV) If text corpora were large enough, their domain-specificity did not matter
anymore. (V) With at least 100 dimensions, the variation of the number of LSA dimen-
sions did not impact the performance. (VI) The distance metrics arccosine,Euclidean
2 METHODS AND RESULTS Automatic Processing of Text Responses
distance, and Manhattan distance performed equally well but significantly better than
others. (VII) The agglomeration methods Ward’s and McQuitty’s method and Complete
Linkage performed equally well but also significantly better than others.
In the sample size simulation experiment, the system performed equally well despite a
large loss of data points from n= 4152 down to a sample size of about n= 1600. With a
small but acceptable loss of performance, the system worked well with sample sizes down
to about n= 250. With smaller sample sizes than this, the system should not be used
for similar data.
2.2 Publication (B): The Use and Improvement of Coding Guides
The study was initiated by the first author in the context of the dissertation. The ad-
ditional software development, analyses, and the structuring along with the writing of
the manuscript were carried out by the first author. The two co-authors again advised
on strategic decisions, such as the presentation of the manuscript. The article appeared
in the conference proceedings of the IEEE ICDM Workshop on Data Mining for Educa-
tional Assessment and Feedback (ASSESS 2015) that was held at the IEEE International
Conference on Data Mining in Atlantic City, NJ, in November, 2015.
Zehner, F., Goldhammer, F., & Sälzer, C. (2015). Using and improving coding guides for and by
automatic coding of PISA short text responses. In Proceedings of the IEEE ICDM Workshop on Data
Mining for Educational Assessment and Feedback (ASSESS 2015). doi: 10.1109/icdmw.2015.189
2.2.1 Context and Related Work
This study adapted the approach for automatic coding of text responses presented in
publication (A). The adaptation reduces the manual effort for model training by using
reference responses from so-called coding guides to start the model training. These are
documents for manual coding. At the same time, the study attempted to show how
to automatically improve the coding guides by adding empirical response types. The
Automatic Processing of Text Responses 2 METHODS AND RESULTS
procedure can also be used to systematically create coding guides from scratch. For the
analyses, the data described for publication (A) were used.
In order to satisfy requirements of machine learning procedures, most systems for
automatic coding rely on relatively large amounts of manually coded training data. The
training data are expensive to collect and can also contain incorrect codes, mainly due to
the required mass. Hence, different research groups have recently strived to find appro-
priate procedures to train models with less but most informative data (Zesch, Heilman, &
Cahill,2015;Dronen, Foltz, & Habermehl,2014;Ramachandran & Foltz,2015;Sukkarieh
& Stoyanchev,2009).
Conceptually comparable to the procedure proposed in this study, the Powergrading
approach was published with partly remarkable performance (Basu, Jacobs, & Vander-
wende,2013). Despite the powerful method, its excellent performance might have partly
stemmed from the relatively low language diversity in the analyzed responses evoked by
the questions. Powergrading’s drawback is, it uses a fixed number of clusters, which is
not plausible for typical assessment needs, because different questions naturally evoke
different numbers of response types. Moreover, the unsupervised system described in
Mohler and Mihalcea (2009) also makes use of LSA similarities between responses, how-
ever, without grouping them. Their central idea is to expand the response key by words
from the empirical responses with the highest similarity. A weakness of this is, instead
of allowing for different lines of reasoning, it might only add synonyms to the response
key, which already should have been considered similar by the vector space model. Since
the processing of natural language is relevant to various different domains, such as dialog
systems, a vast variety of implementations with different adaptations is available in the
literature. For example, extrema are an interesting concept of weighting single words
in vectors of a vector space model (Forgues, Pineau, Larchevêque, & Tremblay,2014,
December). Yet, the authors found that extrema do not outperform the baseline in the
domain of questions.
2 METHODS AND RESULTS Automatic Processing of Text Responses
2.2.2 Adaptation of the Processing of Responses
Briefly depicted, the approach described in publication (A) is used in order to group
the empirical responses into types. Different to the original procedure, the researcher
determines the number of clusters by the development of the residuals (Rasch, Kubinger,
& Yanagida,2011). In a next step, the software processes the reference responses from the
coding guides in the same way as the empirical responses. Next, the reference responses
are projected into the semantic space, and each is assigned to the most similar type.
Ideally, all types then have unambiguously coded reference responses assigned. This
model can serve for automatic coding, but it can contain the following conflicts.
Conflicts of type I represent response types without reference responses. Contrary, in
Conflict II, multiple reference responses are assigned to one response type, however, they
belong to different classes (e.g., correct and incorrect). This would reveal an insufficient
semantic space or cluster model. In Conflict III, a reference response is excluded, because
it is less similar to its cluster centroid than 95 percent of the responses are which had been
assigned to the type. This conflict unveils reference responses that do not have empirical
For Conflict I, a new empirical response needs to be sampled, so that it can be added
to the coding guide. When a regression is carried out, the most informative responses
can be selected by optimal design algorithms (e.g., Fedorov,1972), which turned out to
be highly effective (Dronen et al.,2014). For clustering approaches, the overall goal is
to identify one response that is most prototypical for the whole type. Often, responses
close to their centroid are simply assumed to be the most prototypical (e.g., Zesch et
al.,2015). In Ramachandran and Foltz (2015), a list heuristic was used to find the
response with the highest similarities and the most connections. In other terms, the
heuristic seeks for the densest area within a cluster. In order to examine dense areas
in clusters, the present study adopted an approximation similar to this list heuristic,
making use of the fact that dense areas comprise many responses with relatively low
Automatic Processing of Text Responses 2 METHODS AND RESULTS
pairwise distances. For this, responses’ pairwise distances to other responses within the
cluster were sorted increasingly. Finally, those responses with the most relatively low
distances belonged to the densest area. This procedure defines dense areas analogous to
kernel density estimates, which generally provide powerful methods for identifying dense
areas but cannot be applied to hyperdimensional spaces (cf. Scott,1992).
2.2.3 Research Questions
The analyses in the study answered three research questions in order to evaluate estab-
lished coding guides, provide empirical evidence about which responses should be sampled
as new prototypes, and evaluate the newly proposed procedure.
1. How many Conflicts I, II, and III occur for PISA coding guides and the German
PISA 2012 data?
(I) empirical response type without reference response equivalent
(II) contradicting reference responses within a response type
(III) reference response without empirical response type equivalent
2. Which responses constitute the densest area within a response type and, thus, good
3. How well does the new procedure perform ...
(a) compared to the original procedure?
(b) dependent on the number of clusters being extracted?
2.2.4 Result Highlights
A relatively small number of response types with contradicting reference responses (Con-
flict II) occurred across all items (0–2). These cases showed insufficiently specified re-
sponse types in the corresponding semantic space or clustering model. A larger number,
namely 21 percent on average, of the reference responses in the coding guides were redun-
dant due to the lack of empirical equivalents (Conflict III). Analogously, 72 percent of the
2 METHODS AND RESULTS Automatic Processing of Text Responses
empirical response types on average were not covered by the reference responses in the
coding guides (Conflict I). The presented numbers are highly dependent on the number
of extracted clusters, but nevertheless, they show the large potential for improvements in
the coding guides.
The analysis focusing on the second research question demonstrated that responses
close to the centroid belong to the relatively densest area within the response type. That
was true across all response types and items. Because the employed list heuristic can be
misleading in cases in which a very small dense area is present in the response type, the
responses close to the centroid appear to constitute the optimal prototypes. They are
located in the relatively densest area and are most representative for all responses in the
The performance of the new procedure turned out to vary unreliably if fewer than
100 clusters were extracted. From this point on, the performance became more accurate
and reliable with not too much deviation from the original procedure, which uses all
responses for training opposed to the new procedure which only uses about 2 percent
when extracting 100 clusters. The requirement for a relatively high number of, and thus
small, clusters probably stemmed from the fact that only one response was sampled as a
new prototype for response types which lacked a reference response. It appeared to be
an open question as to how to balance the number of clusters and the number of sampled
2.3 Publication (C): The Reading Gender Gap in Text Responses
The initial interest in the research matter was brought up by the first author in the
context of the dissertation. The additional software development, further analyses, and
the structuring along with the writing of the manuscript were carried out by the first
author. The two co-authors advised on the analyses, on strategic decisions, such as the
presentation of the data and the target journal, and on the final editing of the manuscript.
Automatic Processing of Text Responses 2 METHODS AND RESULTS
It has recently been submitted to a journal in March, 2016, and the journal’s peer review
is currently pending.
Zehner, F., Goldhammer, F., & Sälzer, C. (submitted). Further Explaining the Reading Gender Gap:
An Automatic Analysis of Student Text Responses in the PISA Assessment.
2.3.1 Context and Related Work
Reading literacy is characterized by remarkably consistent gender differences of relevant
magnitudes (e.g., Mullis, Martin, Foy, & Drucker,2012;NCES,2015;OECD,2014).
For this vivid research area, raw responses to open-ended questions have not been an
accessible source of information hitherto. Particularly in large-scale assessments, their
linguistic information could not be considered due to the mass of data. This study, first,
attempted to add to the understanding of the gender gap in reading literacy by analyzing
the natural language in short text responses. It identified responses that are typical for
either of the genders and contrasted their features by processing the responses via the
software described in publication (A). Second, it compiled a theoretical framework about
how and which linguistic features in responses can be mapped to the preceding cognitive
In PISA, girls consistently outperformed boys (OECD,2014). That is the case across
all countries—with only three non-significant exceptions—and across all cycles, constitut-
ing 268 comparisons in total. With the scale’s standard deviation of 100, the gender gap
average in the OECD ranged from 31 points in 2000 to 39 points in 2009. These numbers
show an astonishingly stable figure that appears even more remarkable in the light that
the effect was not always replicated in other studies (Elley,1994;Hyde & Linn,1988;
Thorndike,1973;White,2007;Wolters, Denton, York, & Francis,2014). The Progress
in International Reading Literacy Study (PIRLS; Mullis et al.,2012) and the National
Assessment of Educational Progress (National NAEP; NCES,2015) found consistent but
smaller effects. For adults, the Programme for the International Assessment of Adult
2 METHODS AND RESULTS Automatic Processing of Text Responses
Competencies (PIAAC) did not find significant gender differences in reading literacy for
most participating countries including Germany (OECD,2013a).
All these variations across studies as well as the consistency within but not across
PISA, PIRLS, and NAEP support the view of Lafontaine and Monseur (2009). They
attributed the varying gender differences across studies to methodological decisions such
as age- versus grade-based populations, and they emphasized the effect of whether the
response format is open-ended or closed. Beside methodological reasons, mostly reading
engagement and strategies have been emphasized (Artelt, Naumann, & Schneider,2010).
Whether they are the original source for the differences or only a mediator of exter-
nal influences, the difference would always be inherent to the students’ cognitions. The
theoretical framework depicts the features in responses that can be mapped back to the
cognitions. Here, the framework is only briefly sketched. During reading and compre-
hending, the students build cognitive situation models comprising propositions explicitly
given by, inferred from, or associated with the text, called micro- and macropropositions
(Kintsch & van Dijk,1978). In order to answer a question such as in the PISA assess-
ment, the students identify the question focus and category. Then, they query their
episodic and generic knowledge structures, including the situation model, and winnow
them down to propositions that are relevant and compatible to the question focus and
category (Graesser & Franklin,1990). The students use these propositions in order to
formulate the final response by concatenating and enriching them by linguistic structures
specific to the question category (Graesser & Clark,1985).
For the semantic measures (micro and relevance), gender types instead of the students’
real gender served as the split criterion. That is, only semantic response types that com-
prised a significantly dominating proportion of one of the genders were included in these
analyses and determined the group assignment. This was the rationale because it was not
reasonable to assume that all responses by one gender were semantically homogeneous
but that cognitive types exist that are dominated by one gender. For the analyses, the
Automatic Processing of Text Responses 2 METHODS AND RESULTS
data of the reading items described for publication (A) were used.
2.3.2 Automatic Processing of Features in Text Responses
Beside its capability to group responses into semantic types, the software now additionally
annotated words with their parts of speech (POS). Specific words of particular POS were
then considered proposition entities (PEs), genuinely referring to the situation model.
The software captured (I) the proposition entity count (PEC), representing how many
PEs were incorporated into a response. (II) The micro measure indicated the semantic
similarity of the response’s PEs to the stimulus and question. PEs with low values in the
micro measure constituted macropropositions. (III) The relevance measure indicated the
semantic similarity of the response’s PEs to correct example responses from the PISA
coding guides. Both, (II) and (III), used the maximum cosine similarity.
2.3.3 Research Questions
The analyses resulted from three research questions that investigated the gender gap in
reading literacy and further explored the measures that were derived from the theoretical
1. How do girls and boys differ in the number of propositions they use in their re-
2. How do gender-specific responses differ with respect to the use of micro- vs. macro-
propositions and the extent to which these are relevant?
3. How are the measures for response features related to further variables, such as
reading literacy and test motivation?
2 METHODS AND RESULTS Automatic Processing of Text Responses
2.3.4 Result Highlights
A linear regression, controlling for the response correctness, revealed differences across
gender types from 2.8 to 4.9 PEs. This shows that girl types always integrated significantly
more elements from the situation model into their responses. That was true for correct
as well as for incorrect responses. For four items, correct responses were additionally
associated with more PEs, irrespective of the gender type. The analyses furthermore
delivered empirical evidence for the plausibility to analyze gender types opposed to or on
top of plain genders.
In order to not confound the measures with the PEC, the relative frequencies of
relevant and irrelevant micro- and macropropositions within a response were analyzed.
Summarized, girl types used more relevant and less irrelevant PEs than boys did. Similar
to the PEC, this was generally the case for correct and incorrect responses, whereas
the effect of the gender type on the frequency of relevant PEs was more pronounced for
incorrect responses than for correct ones. This was similarly true for irrelevant PEs. Only
for three items, the correct responses were similar across gender types, while the incorrect
responses repeated the described figure of girl types using more relevant PEs. The girl
types furthermore adapted more successfully to the level of reasoning within the situation
model. That is, they used micropropositions if the question referred to the text base and
macropropositions when the question asked for information that was not explicitly stated
in the text. Boy types tended to do the exact opposite.
Briefly summed up, the boy types seem to involve a less stable situation model and
struggle with retrieving and inferring from it. The high number of PEs integrated in
girl-specific responses emphasizes that girl types allow to liberally juggle the information
in the situation model. The relatively few PEs that boy types tend to use are more often
irrelevant than those in girl-specific responses—and this is true regardless of the response
correctness for almost all items.
Automatic Processing of Text Responses 3 DISCUSSION
3 Discussion
The first subsection disentangles the approaches and findings of the three studies and
then links them. Corresponding to pillar (I), publication (A) constitutes the basic soft-
ware development and evaluation. It forms the base for the two other studies. They are
representatives of pillars (II) and (III), demonstrating by means of use cases how auto-
matic processing of text responses can enhance assessment in research and how it can add
to open content-related research questions. Both, the achievements and limitations are
discussed. The second subsection takes up the constraints and outlines the implications
and potential concrete future developments of the research initiated in the dissertation.
3.1 Discussing and Linking the Main Findings
Prior to publication (A), a software had been developed in the context of the disserta-
tion that automatically categorizes and codes text responses. The software assigns, for
instance, the codes correct and incorrect to responses. Despite the common terminology
scoring (e.g., Leacock & Chodorow,2003) and grading (e.g., Burrows et al.,2014), the
dissertation’s publications throughout use the term coding, in order to underline that
scoring is a special case of coding and the proposed collection of methods is also capa-
ble of nominal, polytomous coding. Not all scoring techniques are capable of nominal
coding, for example, those employing regression (e.g., Dronen et al.,2014). Generally,
the plain development of the software in publication (A) served as the base for publica-
tions (B) and (C). The software’s positive evaluation constituted a necessary requirement
for legitimating further applications, such as the ones in the subsequent publications. The
agreement beyond chance between the automatically generated codes and those produced
by humans was judged as fair to good up to excellent, and the performance was judged as
acceptable down to sample sizes of about 250 participants. Beside this minimum require-
ment of a positive evaluation, particularly the software’s potential to group responses
into semantically homogeneous types without requiring supervised training, which would
3 DISCUSSION Automatic Processing of Text Responses
involve manually annotated data, catalyzed the wide-ranging possibilities the software
offers. The significance of innovations attainable through the software is apparent in two
key findings of the latter two publications. First, publication (B) showed that, on average,
the analyzed established coding guides only covered about 28 percent of empirically oc-
curring response types. At the same time, the software enabled the automatic extension
of the coding guides in order to cover the remaining 72 percent. Second, publication (C)
concluded that the difficulties some boys face in reading were associated with a reduced
availability and flexibility of their cognitive situation model and their struggle to correctly
identify a question’s aim. The two examples show how the software can be utilized as a
tool in assessments and the research process. Previously, text responses had been present
in studies but had seldom been accessible as a source of information due to their large
mass and unstructured form. Except for the goal of scoring, the volume of responses in
large-scale assessments as well as the complications in objectively processing them made
text responses a productive dairy cow that has rarely been milked.
The potential implications of the dissertation’s development and findings are best
regarded in the picture of the previously described three pillars. In future extensions of
the presented research, it will be necessary to balance the further development of the
software, on the one hand (pillar I), and its use cases for the assessment process and
content-related research, on the other hand (pillars II and III). The technical aspect of
natural language processing is a research matter of great interest in computer science
at the moment (Cambria & White,2014). Due to the growing demand for language
processing in artificial intelligence (e.g., Woo, Botzheim, & Kubota,2014), the analysis
of big data (cf. Jin & Hammer,2014), and dialog systems (e.g., Shrestha, Vulić, & Moens,
2015), an unwieldy body of corresponding technical innovations is being published and will
continue to be published over the next decade. For the aim of improving human-machine
interaction, the processing of natural language constitutes an indispensable component
of complex computer systems, such as intelligent cars (e.g., Kennewick et al.,2015), and
will become even more prevalent in every day human life with the further spreading of the
Automatic Processing of Text Responses 3 DISCUSSION
Internet of Things (cf. Ding, Jin, Ren, & Hao,2013;Whitmore, Agarwal, & Da Xu,2015).
Claiming to be devoted to applied research, the extension of the presented studies will
face the crucial challenge of selecting the relevant innovations from this technical research
that can further enhance social science research and assessments. Thus, the balance of the
three pillars will be important since only the identification of (feasible) demands in the
social sciences, in the form of use cases, will point out which implementations of natural
language processing are capable of advancing them.
In the assessment area, textual responses have customarily been used to extrapolate,
for example, the participants’ competency. Because this textual information is elevated to
an aggregated level by the further processing through scoring and scaling, it might appear
acceptable that the prevailing paradigm in educational and psychological assessment goes
without an explicit theoretical framework about the underlying language. Contrary, for
a study that uses basic linguistic information from text responses, it appears imperative
to supply a corresponding theoretical framework for the operationalizations. Which were
the processes that resulted in the analyzed text? How are the processes and their outcome
related to the construct / domain of interest? The discussion of these questions has
primarily been evolving in and is noticed by discourse and survey research (Graesser &
Clark,1985;Graesser & Franklin,1990;Moore & Wiemer-Hastings,2003;Tourangeau,
Rips, & Rasinski,2009). A corresponding theoretical framework for the context of reading
assessments was specified in publication (C). It draws the path that information takes
from the reading stimulus through the tested person’s cognitions to finally show up in
the text response. Nevertheless, further elaboration on the framework will be necessary
and will add further insights about the observed measures.
Another crucial question has not been dealt with explicitly in the dissertation’s articles
yet and will be the subject of a work yet to come: What is the conceptual understanding
of semantics (meaning) in the implemented software and studies? This matter is often
critically discussed and considered in philosophy, discourse research, or computer science
disciplines such as artificial intelligence (cf. Cambria & White,2014;de Vega, Glenberg,
3 DISCUSSION Automatic Processing of Text Responses
& Graesser,2008;Foltz,2003;Peirce,1897/1955). In the dissertation, the basic concept
for operationalizing semantics utilizes LSA. The founders of LSA believe that we gain the
knowledge about the meaning of words mainly by the occurrences of words themselves
(Landauer & Dumais,1997;Landauer,2011), opposed to the sensory world experiences
we associate with the objects related to the words. These two model types are referred
to as amodal symbol models versus embodiment models (Glenberg et al.,2008). Both
are supported by empirical evidence. LSA’s heart is the singular value decomposition.
It is a powerful, purely mathematical method to implement an amodal symbol model,
because it sensitively traces relationships of words via their co-occurrences and, even
more important, via their non-co-occurrences (Martin & Berry,2011). In the end, each
word’s meaning is represented by the direction its vector points at. To put it more
precisely, the meanings are encoded by the angles between the vectors. Thus, the words
are assumed to be pure symbols, and their interrelations represent their meanings. Studies
that work at the linguistic level of text responses need to explicitly specify their implied
understanding of language and meaning in order to verify the appropriateness of the
employed operationalizations.
In addition to the symbolic concept of word meaning, further characteristics of the
applied language models need consideration. For the most part, the dissertation employed
the paradigm bag of words, which is inherent to LSA as well. That is, words are assumed
to be isolated units not interacting with each other and syntactic information is most often
neglected. All words in a response or in a text corpus document are thrown into the bag,
irrespective of their original order and relations. It is apparent to every individual with
a minimum of verbal abilities that there is more to natural language than the oblivious
concatenation of syntactically context-free words. Such language use could be pictured
as a beheaded chicken, wildly running around in its barn and crashing into every wall and
object it meets on its way. At the same time, every individual might have experienced
situations with improper communication, such as a phone call with a bad signal, trying
to make a foreign language speaker understand the way to the underground, or the
Automatic Processing of Text Responses 3 DISCUSSION
chatting with a very small child. Sometimes these situations come along with momentous
misunderstandings, sometimes they run surprisingly well and clear. The limitations of the
bag of words are intuitive, diverse, and routinely criticized, among others, in the context
of co-occurrence, which is the second paradigm in LSA (e.g., Glenberg & Mehta,2008;
Higgins et al.,2014;Leacock & Chodorow,2003;Perfetti,1998). However, the studies
presented in the dissertation are only three representatives of a large body of studies
(a) using the paradigms bag of words as well as co-occurrence and yet (b) resulting
in remarkable, valid findings that advance the studies’ research areas (e.g., Graesser
et al.,1999). Apparently in spite of their shortcomings, these paradigms accomplish
to capture the relationships of words sufficiently well in order to produce new, valid,
reliable, and objective knowledge. For researchers of applied disciplines, the trade-off
between feasibility and complexity of models is a critical challenge. The bag of words
is a prototypical example for this trade-off when the condition of feasibility needs to be
met. That said, it is most important to emphasize that, at the same time, it is a fine
line to not simply justify any concept by its easy implementation, but the models need to
prove their effectiveness in producing knowledge of scientific value. As shown, the applied
paradigms are capable of doing so.
3.2 Limitations and Future Directions
The limitations as well as the general structure of future works stemming from the pre-
sented studies have already been touched on in the previous subsection. This final subsec-
tion names concrete research questions that need to be answered in subsequent studies.
The discourse is structured along the three pillars.
The central limitation of the dissertation’s software and analyses is introduced in pil-
lar (I) in the general layout of the software, using the bag of words. Among others, it
neglects syntax, relations between words, pronoun correlates, and negation by dedicated
words such as not (opposed to negation by using antonyms). As the result of the use of
3 DISCUSSION Automatic Processing of Text Responses
the bag of words, so-called stop words are often ignored, because they are assumed to only
carry unimportant semantic information. As already mentioned, the techniques in natural
language processing still show excitingly steep improvements with regard to their concep-
tual scopes. For example, textual entailment is a promising concept that is able to verify
whether one sentence can be transformed into another one without loosing or biasing
the original information; the interested reader might want to check the EXCITEMENT
project (, [2016-03-09]). Inter-
estingly, it is a well-known phenomenon in the area of machine learning that the steep
conceptual improvement is often not mirrored in accordingly steep performance improve-
ments on tasks when the new concepts are implemented. Rather, very simple models,
like the bag of words, already attain remarkable performance levels, whereas the use of
additional, more sophisticated, and maybe also more complex techniques often only adds
a fraction of incremental performance (e.g., see Higgins et al.,2014). Textual entailment
is a prototypical example for the phenomenon. Although one would assume that the con-
sideration of linguistic dependencies, negation, and linguistic inferences would give large
raise to systems’ performance levels, that is often not the case, if at all (e.g., Dzikovska,
Nielsen, & Leacock,2016). The main weakness of more fine-grained computational lin-
guistic techniques, like textual entailment, often is their high dependency on well-formed
language (Dzikovska et al.,2016;Higgins et al.,2014); even with regard to punctuation,
about which students in low-stake assessments typically do not care at all. Especially
responses in low-stake assessments, such as in the analyzed PISA data, are teeming with
linguistic mistakes (spelling, syntax, punctuation), and regularly even human coders are
not able to identify the response intention. Although it seems paradox, for this data
imprecise methods like the bag of words appear to work more precise, because the neces-
sary fuzziness for extrapolating the response intention is inherent to their low precision.
Further developments for normalizing or recovering the language (similar to automatic
spelling correction) might aid in making the fine-grained techniques usable for this area
in the next decade. In addition to the discussed issues, further progress in the context
Automatic Processing of Text Responses 3 DISCUSSION
of pillar (I) will be possible when considering further models such as Latent Dirichlet
Allocation (Blei, Ng, & Jordan,2003; e.g., Saif, Ab Aziz, & Omar,2016) or using hybrid
techniques such as LSA combined with fuzzy logic (Mittal & Devi,2016). Several other
steps in the automatic coding process can and will be improved; for example, by utilizing
more powerful machine learning methods such as support vector machines. Nonetheless,
the balance between methodological power and transparency as well as flexibility for
practitioners (researchers in the social sciences) is a central objective of the presented
research and needs to be considered carefully. Next, the software was only evaluated by
German data. Generally, the collection of methods and the software are easily scalable
to other languages. This will be the core of another line of according studies. Snowball
can serve as a showcase for the collection’s scalability, being one implementation of one
of many stemming algorithms. It is available for six Germanic, six Italic, two Slavic, and
two Uralic languages as well as for Turkish, Irish, Armenian, and Basque (Porter,2001).
Finally, the publication of the free software is due, which will allow practitioners to take
advantage of automatic coding of short text responses without any costs.
In the context of pillar (II), further innovations regarding the assessment process
are to be expected. For example, automatic coding is highly suitable to be employed
in computerized adaptive testing (CAT), because it allows to extend CAT’s scope to
open-ended response formats (e.g., He,2015, April). Since the aim of CAT is high
economical efficacy, the corresponding studies will need to investigate the threshold of how
many coding errors are acceptable before the adaptive testing procedure is not efficient
anymore. Another possible innovation of the assessment process could be a probing
mechanism during assessment. The background is that some test taker responses lack
one single bit of information in order to exceed the threshold from incorrect to correct.
But this does not mean that the test taker was not able to access this last bit. Since
assessment is typically interested in the latent construct, it indeed should distinguish
between those test takers who are able to give the missing bit and those who are not.
It might turn out that a specific automatically identified response type contains such
3 DISCUSSION Automatic Processing of Text Responses
kinds of responses; that means, they are almost correct but miss the last important
bit of information. Automatic coding employed in computer based assessment would
then give the opportunity to probe the test taker for the missing bit. Of course, the
differential implications of such a mechanism, selectively probing specific students, would
need thorough investigation whether it harmed the comparability across test cases. This
would be similar to the underlying concept of learning tests (Guthke,1992). The learning
tests offer feedback to the student about the misconception that is observable in the
(multiple-choice) response. The difference, of course, is that the learning tests’ goal is
to indeed teach the student by evoking specific cognitive processes, whereas this would
be a confounding aspect for the probing mechanism, in which the feedback only serves
as another stimulus to check if the student actually has access to the missing bit of
information. This probing would be similar to what human test administrators do when
test persons give borderline responses. More exemplary ideas, how the developed software
can innovate assessments, can be found in publication (A).
For pillar (III), aiming to enhance content-related research questions, the main in-
novation of the dissertation is the new accessibility of raw responses as a new source of
information. First, new assessment instruments can be developed that use open-ended
response format. This way, they could improve the assessed scope of the construct. For
example, personality questionnaires often force the respondents to evaluate their person-
ality for given categories. But in the sense of Kelly (1955), it is reasonable to assume that
some personality characteristics might be constructs that are rather peripheral relevant
for some persons, others might be central to the behavioral tendencies. Open-ended as-
sessment would allow to capture personality profiles that are not forced into a previously
specified dimensionality structure, such as in the Big Five (Costa & McCrae,1985) or
16 PF (Cattell, Cattell, & Cattell,1993). Nonetheless, at the same time, the automatic
processing would allow grouping the responses in order to know where the responses are
located relative to others’ responses. Analogous instruments in educational research with
other constructs are similarly reasonable. Second, more objective assessment of person-
Automatic Processing of Text Responses 3 DISCUSSION
ality traits is possible by the automatic coding of text responses. For example, He (2013)
impressively demonstrated how written texts can be used to predict psychiatric classifi-
cations. Third, cultural comparisons in international studies can be carried out at a new
level of information; namely, semantic types of responses can be compared. Fourth, not
only research could profit from the new developments in natural language processing, but
also assessment in the educational context can be improved by digital learning environ-
ments that operate through written language (e.g., Graesser et al.,1999). Particularly,
most recent innovations in the processing of speech (e.g., Zechner, Higgins, & Xi,2007)
promise to foster developments in the area of learning environments. Fifth, further con-
struct domains will be of interest and require the development of other approaches of
processing of the responses. For example, the processing of mathematical equations is a
recently addressed topic (e.g., Lan, Vats, Waters, & Baraniuk,2015;Smarter Balanced
Assessment Consortium,2014), as are scientific reasoning (e.g., Guo, Xing, & Lee,2015)
and programming source texts (e.g., Glassman, Singh, & Miller,2014;Nguyen, Piech,
Huang, & Guibas,2014;Srikant & Aggarwal,2014). Particularly the last research in-
terest is recently driven by Massive Open Online Courses (MOOCs) platforms, but the
entire research field of automatic coding will be dominated by this application in the
next years (e.g., Jorge Diez, Alonso, Troncoso, & Bahamonde,2015;Mi & Yeung,2015).
Similarly to large-scale assessments, in MOOCs, partly thousands of students need to be
evaluated, and this mass of data challenges the evaluation procedures.
References Automatic Processing of Text Responses
Artelt, C., Naumann, J., & Schneider, W. (2010). Lesemotivation und Lernstrategien.
In E. Klieme et al. (Eds.), PISA 2009 (pp. 73–112). Münster u.a.: Waxmann.
Basu, S., Jacobs, C., & Vanderwende, L. (2013). Powergrading: A clustering approach
to amplify human effort for short answer grading. Transactions of the Association
for Computational Linguistics,1, 391–402.
Bejar, I. I. (2012). Rater cognition: Implications for validity. Educational Measurement:
Issues and Practice,31 (3), 2–9. doi: 10.1111/j.1745-3992.2012.00238.x
Birenbaum, M., & Tatsuoka, K. K. (1987). Open-ended versus multiple-choice response
formats - It does make a difference for diagnostic purposes. Applied Psychological
Measurement,11 (4), 385–395. doi: 10.1177/014662168701100404
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of
Machine Learning Research,3, 993–1022.
Borra, S., & Di Ciaccio, A. (2010). Measuring the prediction error. A comparison of cross-
validation, bootstrap and covariance penalty methods. Computational Statistics &
Data Analysis,54 (12), 2976–2989. doi: 10.1016/j.csda.2010.03.004
Bridgeman, B. (1991). A comparison of open-ended and multiple-choice question formats
for the quantitative section of the Graduate Record Examinations General Test.
ETS Research Report Series,1991 (2), i–25. doi: 10.1002/j.2333-8504.1991.tb01402
Burrows, S., Gurevych, I., & Stein, B. (2014). The Eras and Trends of Automatic Short
Answer Grading. International Journal of Artificial Intelligence in Education, 1–58.
doi: 10.1007/s40593-014-0026-8
Burstein, J., Kaplan, R. M., Wolff, S., & Lu, C. (1996). Using lexical semantic techniques
to classify free-responses. In E. Viegas (Ed.), Proceedings of the ACL SIGLEX Work-
shop on Breadth and Depth of Semantic Lexicons (pp. 20–29). Santa Cruz, Califor-
nia: Association for Computational Lingustics.
Cambria, E., & White, B. (2014). Jumping NLP curves: A review of natural language
processing research [review article]. IEEE Computational Intelligence Magazine,
9(2), 48–57. doi: 10.1109/MCI.2014.2307227
Automatic Processing of Text Responses References
Cattell, R. B., Cattell, A. K., & Cattell, H. E. P. (1993). 16PF Fifth Edition Question-
naire. Champaign, IL: Institute for Personality and Ability Testing.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psy-
chological Measurement,20 (1), 37–46. doi: 10.1177/001316446002000104
Costa, P. T., & McCrae, R. R. (1985). The NEO personality inventory. Odessa: Psy-
chological Assessment Resources.
de Vega, M., Glenberg, A. M., & Graesser, A. C. (Eds.). (2008). Symbols and embodiment:
Debates on meaning and cognition. Oxford: Oxford Univ. Press.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American Society for Infor-
mation Science,41 (6), 391–407. doi: 10.1002/(SICI)1097-4571(199009)41:6<391::
Dikli, S. (2006). An overview of automated scoring of essays. Journal of Technology,
Learning, and Assessment,5(1).
Ding, Y., Jin, Y., Ren, L., & Hao, K. (2013). An intelligent self-organization scheme for
the Internet of Things. IEEE Computational Intelligence Magazine,8(3), 41–53.
doi: 10.1109/MCI.2013.2264251
Dronen, N., Foltz, P. W., & Habermehl, K. (2014). Effective sampling for large-scale
automated writing evaluation systems. arXiv preprint arXiv:1412.5659 .
Dzikovska, M. O., Nielsen, R. D., & Leacock, C. (2016). The joint student response
analysis and recognizing textual entailment challenge: Making sense of student
responses in educational applications. Language Resources and Evaluation,50 (1),
67–93. doi: 10.1007/s10579-015-9313-8
Elley, W. B. (1994). The IEA study of reading literacy. Oxford: Pergamon Press.
Fedorov, V. V. (1972). Theory of optimal experiments. New York: Academic Press.
Fleiss, J. L. (1981). The measurement of interrater agreement. In J. L. Fleiss, B. Levin,
& M. C. Paik (Eds.), Statistical methods for rates and proportions (pp. 598–626).
New York: John Wiley & Sons.
Foltz, P. W. (2003). Quantitative Cognitive Models of Text and Discourse Com-
prehension. In A. C. Graesser, M. A. Gernsbacher, & S. R. Goldman (Eds.),
References Automatic Processing of Text Responses
Handbook of discourse processes (pp. 487–523). Mahwah, NJ: Erlbaum. doi:
Forgues, G., Pineau, J., Larchevêque, J.-M., & Tremblay, R. (2014, De-
cember). Bootstrapping Dialog Systems with Word Embeddings. Mon-
treal. Retrieved from
Gabrilovich, E., & Markovitch, S. (2007). Computing semantic relatedness using
Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th Interna-
tional Joint Conference on Artificial Intelligence (pp. 1606–1611).
Glassman, E. L., Singh, R., & Miller, R. C. (2014). Feature engineering for clustering
student solutions. In Proceedings of the first ACM conference on Learning@ scale
conference (pp. 171–172).
Glenberg, A. M., de Vega, M., & Graesser, A. C. (2008). Framing the debate. In M. de
Vega, A. M. Glenberg, & A. C. Graesser (Eds.), Symbols and embodiment (pp. 1–9).
Oxford: Oxford Univ. Press.
Glenberg, A. M., & Mehta, S. (2008). The limits of covariation. In M. de Vega, A. M. Glen-
berg, & A. C. Graesser (Eds.), Symbols and embodiment (pp. 11–32). Oxford: Ox-
ford Univ. Press.
Graesser, A. C., & Clark, L. F. (1985). Structures and procedures of implicit knowledge
(Vol. 17). Norwood, NJ: Ablex.
Graesser, A. C., & Franklin, S. P. (1990). QUEST: A cognitive model of question
answering. Discourse Processes,13 (3), 279–303. doi: 10.1080/01638539009544760
Graesser, A. C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R., Tutoring Re-
search Group, et al. (1999). AutoTutor: A simulation of a human tutor. Journal
of Cognitive Systems Research,1(1), 35–51.
Guo, Y., Xing, W., & Lee, H.-S. (2015). Identifying students’ mechanistic explanations in
textual responses to science questions with association rule mining. In Proceedings
of the IEEE ICDM Workshop on Data Mining for Educational Assessment and
Feedback (ASSESS 2015). doi: 10.1109/icdmw.2015.225
Guthke, J. (1992). Learning tests-the concept, main research findings, problems and
trends. Learning and Individual Differences,4(2), 137–151. doi: 10.1016/1041
Automatic Processing of Text Responses References
He, Q. (2013). Text mining and IRT for psychiatric and psychological assessment. Uni-
versity of Twente.
He, Q. (2015, April). Combining automated scoring constructed responses and com-
puterized adaptive testing. Presented at the 2015 Annual Meeting of the National
Council on Measurement in Education (NCME), Chicago, IL.
Higgins, D., Brew, C., Heilman, M., Ziai, R., Chen, L., Cahill, A., . . . Blackmore, J.
(2014). Is getting the right answer just about choosing the right words? The role
of syntactically-informed features in short answer scoring. CoRR,abs/1403.0801 .
doi: arXiv:1403.0801v2
Hyde, J. S., & Linn, M. C. (1988). Gender differences in verbal ability: A meta-analysis.
Psychological Bulletin,104 (1), 53–69. doi: 10.1037/0033-2909.104.1.53
Jin, Y., & Hammer, B. (2014). Computational intelligence in big data [guest editorial].
IEEE Computational Intelligence Magazine,9(3), 12–13. doi: 10.1109/MCI.2014
Jorge Diez, O. L., Alonso, A., Troncoso, A., & Bahamonde, A. (2015). Including content-
based methods in peer-assessment of open-response questions. In Proceedings of the
IEEE ICDM Workshop on Data Mining for Educational Assessment and Feedback
(ASSESS 2015). doi: 10.1109/icdmw.2015.256
Kelly, G. A. (1955). The psychology of personal constructs: Volume 1: A theory of
personality. New York: WW Norton and Company.
Kennewick, R. A., Locke, D., Kennewick, M. R., Kennewick, R., Freeman, T., & Elston,
S. F. (2015). Mobile systems and methods for responding to natural language speech
utterance. Google Patents. Retrieved from
Kintsch, W., & van Dijk, T. A. (1978). Toward a model of text comprehension and
production. Psychological Review,85 (5), 363. doi: 10.1037/0033-295X.85.5.363
Lafontaine, D., & Monseur, C. (2009). Gender gap in comparative studies of reading com-
prehension: To what extent do the test characteristics make a difference? European
Educational Research Journal,8(1), 69–79. doi: 10.2304/eerj.2009.8.1.69
Lan, A. S., Vats, D., Waters, A. E., & Baraniuk, R. G. (2015). Mathematical language
processing: Automatic grading and feedback for open response mathematical ques-
tions. In Proceedings of the Second (2015) ACM Conference on Learning@ Scale
References Automatic Processing of Text Responses
(pp. 167–176). doi: 10.1145/2724660.2724664
Landauer, T. K. (2011). LSA as a theory of meaning. In T. K. Landauer, D. S. McNamara,
S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp. 3–34).
Erlbaum: Mahwah. doi: 10.4324/9780203936399
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The Latent
Semantic Analysis theory of acquisition, induction, and representation of knowledge.
Psychological Review,104 (2), 211–240. doi: 10.1037/0033-295x.104.2.211
Leacock, C., & Chodorow, M. (2003). C-rater: Automated scoring of short-answer
questions. Computers and the Humanities,37 , 389–405.
Martin, D. I., & Berry, M. W. (2011). Mathematical foundations behind Latent Semantic
Analysis. In T. K. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.),
Handbook of Latent Semantic Analysis (pp. 35–55). Erlbaum: Mahwah. doi: 10
Mi, F., & Yeung, D.-Y. (2015). Temporal models for predicting student dropout in
Massive Open Online Courses. In Proceedings of the IEEE ICDM Workshop on Data
Mining for Educational Assessment and Feedback (ASSESS 2015). doi: 10.1109/
Millis, K., Magliano, J., Wiemer-Hastings, K., Todaro, S., & McNamara, D. S. (2011). As-
sessing and improving comprehension with Latent Semantic Analysis. In T. K. Lan-
dauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Se-
mantic Analysis (pp. 207–225). Erlbaum: Mahwah. doi: 10.4324/9780203936399
Mittal, H., & Devi, M. S. (2016). Computerized evaluation of subjective answers using
hybrid technique. In S. H. Saini, R. Sayal, & S. S. Rawat (Eds.), Innovations in
Computer Science and Engineering (pp. 295–303). Singapore: Springer Singapore.
doi: 10.1007/978-981-10-0419-3_35
Mohler, M., & Mihalcea, R. (2009). Text-to-text semantic similarity for automatic short
answer grading. In Proceedings of the 12th Conference of the European Chapter of
the Association for Computational Linguistics (pp. 567–575). doi: 10.3115/1609067
Moore, J. D., & Wiemer-Hastings, P. (2003). Discourse in computational linguistics and
artificial intelligence. In A. C. Graesser, M. A. Gernsbacher, & S. R. Goldman
(Eds.), Handbook of Discourse Processes (pp. 439–485). Mahwah, NJ: Erlbaum.
doi: 10.4324/9781410607348
Automatic Processing of Text Responses References
Mullis, I. V. S., Martin, M. O., Foy, P., & Drucker, K. T. (2012). PIRLS 2011 Interna-
tional results in reading. Boston College: Chestnut Hill, MA.
NCES. (2015). The nation’s report card: 2015 mathematics and reading assessments.
Retrieved 16.01.2016, from
Nguyen, A., Piech, C., Huang, J., & Guibas, L. (2014). Codewebs: scalable homework
search for massive open online programming courses. In Proceedings of the 23rd
international conference on World wide web (pp. 491–502).
OECD. (2013a). OECD Skills Outlook 2013: First Results from the Survey of Adult
Skills. OECD Publishing. doi: 10.1787/9789264204256-en
OECD. (2013b). PISA 2012 assessment and analytical framework: Mathematics, reading,
science, problem solving and financial literacy. OECD Publishing. doi: 10.1787/
OECD. (2014). PISA 2012 results: What students know and can do (volume I, revised
edition, February 2014). Paris: OECD Publishing. doi: 10.1787/9789264208780-en
Peirce, C. S. (1897/1955). Logic as Semiotic: The Theory of Signs. In J. Buchler (Ed.),
Philosophical writings of Peirce (pp. 98–119). New York NY: Dover Publications.
Perfetti, C. A. (1998). The limits of co-occurrence: Tools and theories in language.
Discourse Processes,25 (2-3), 363–377. doi: 10.1080/01638539809545033
Porter, M. (2001). Snowball: A language for stemming algorithms. Retrieved from
Prenzel, M., Sälzer, C., Klieme, E., & Köller, O. (Eds.). (2013). PISA 2012: Fortschritte
und Herausforderungen in Deutschland. Münster: Waxmann.
Ramachandran, L., & Foltz, P. W. (2015). Generating reference texts for short answer
scoring using graph-based summarization. In Association for Computational Lin-
guistics (Ed.), Proceedings of the Tenth Workshop on Innovative Use of NLP for
Building Educational Applications (pp. 207–212). doi: 10.3115/v1/w15-0624
Rasch, D., Kubinger, K. D., & Yanagida, T. (2011). Statistics in psychology using
R and SPSS. Chichester, West Sussex, UK: John Wiley & Sons. doi: 10.1002/
References Automatic Processing of Text Responses
Rauch, D. P., & Hartig, J. (2010). Multiple-choice versus open-ended response formats
of reading test items: A two-dimensional IRT analysis. Psychological Test and
Assessment Modeling,52 (4), 354–379.
Rupp, A. A., Ferne, T., & Choi, H. (2006). How assessing reading comprehension with
multiple-choice questions shapes the construct: a cognitive processing perspective.
Language Testing,23 (4), 441–474. doi: 10.1191/0265532206lt337oa
Saif, A., Ab Aziz, M. J., & Omar, N. (2016). Reducing explicit semantic representation
vectors using latent dirichlet allocation. Knowledge-Based Systems. doi: 10.1016/
Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization.
New York: Wiley. doi: 10.1002/9780470316849
Shrestha, N., Vulić, I., & Moens, M.-F. (2015). Semantic role labeling of speech
transcripts. In A. Gelbukh (Ed.), Computational Linguistics and Intel ligent Text
Processing (Vol. 9042, pp. 583–595). Springer International Publishing. doi:
Smarter Balanced Assessment Consortium. (2014). Smarter Balanced pi-
lot automated scoring research studies. Retrieved 2016-02-25, from
Srikant, S., & Aggarwal, V. (2014). A system to grade computer programming skills
using machine learning. In Proceedings of the 20th ACM SIGKDD international
conference on Knowledge discovery and data mining (pp. 1887–1896). doi: 10.1145/
Sukkarieh, J. Z., & Stoyanchev, S. (2009). Automating model building in c-rater. In
Proceedings of the 2009 Workshop on Applied Textual Inference (pp. 61–69). doi:
Thorndike, R. L. (1973). Reading comprehension education in fifteen countries: An
empirical study. International studies in evaluation, III. Stockholm: Almqvist and
Tourangeau, R., Rips, L. J., & Rasinski, K. A. (2009). The psychology of survey
response (10. print ed.). Cambridge: Cambridge Univ. Press. doi: 10.1017/
Automatic Processing of Text Responses References
Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Ameri-
can Statistical Association Journal,58 (301), 236–244. doi: 10.1080/01621459.1963
White, B. (2007). Are girls better readers than boys? Which boys? Which girls?
Canadian Journal of Education/Revue canadienne de l’éducation, 554–581. doi:
Whitmore, A., Agarwal, A., & Da Xu, L. (2015). The Internet of Things: A survey of
topics and trends. Information Systems Frontiers,17 (2), 261–274. doi: 10.1007/
Witten, I. H., Frank, E., & Hall, M. A. (2011). Data mining: Practical machine learning
tools and techniques (3rd ed ed.). Burlington, MA: Morgan Kaufmann. doi: 10
Wolters, C. A., Denton, C. A., York, M. J., & Francis, D. J. (2014). Adolescents’
motivation for reading: group differences and relation to standardized achievement.
Reading and Writing,27 (3), 503–533. doi: 10.1007/s11145-013-9454-3
Woo, J., Botzheim, J., & Kubota, N. (2014). Conversation system for natural communi-
cation with robot partner. In 2014 10th France-Japan/ 8th Europe-Asia Congress
on Mecatronics (MECATRONICS) (pp. 349–354). doi: 10.1109/MECATRONICS
Zechner, K., Higgins, D., & Xi, X. (2007). Speechrater: A construct-driven approach to
scoring spontaneous non-native speech. In Proceedings of the SLaTE Workshop on
Speech and Language Technology in Education.
Zehner, F., Goldhammer, F., & Sälzer, C. (2015). Using and improving coding guides
for and by automatic coding of PISA short text responses. In Proceedings of the
IEEE ICDM Workshop on Data Mining for Educational Assessment and Feedback
(ASSESS 2015). doi: 10.1109/icdmw.2015.189
Zehner, F., Goldhammer, F., & Sälzer, C. (submitted). Further Explaining the Read-
ing Gender Gap: An Automatic Analysis of Student Text Responses in the PISA
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text
responses via clustering in educational assessment. Educational and Psychological
Measurement,76 (2), 280–303. doi: 10.1177/0013164415590022
References Automatic Processing of Text Responses
Zesch, T., Heilman, M., & Cahill, A. (2015). Reducing annotation efforts in supervised
short answer scoring. In Association for Computational Linguistics (Ed.), Proceed-
ings of the Tenth Workshop on Innovative Use of NLP for Building Educational
Applications (pp. 124–132). doi: 10.3115/v1/w15-0615
Automatic Processing of Text Responses A DISSERTATION PUBLICATIONS
A Dissertation Publications
Publication (A)
Zehner, F., Sälzer, C., & Goldhammer, F. (2016). Automatic coding of short text responses
via clustering in educational assessment. Educational and Psychological Measurement,76 (2),
280–303. doi: 10.1177/0013164415590022
Publication (B)
Zehner, F., Goldhammer, F., & Sälzer, C. (2015). Using and improving coding guides for
and by automatic coding of PISA short text responses. In Proceedings of the IEEE ICDM
Workshop on Data Mining for Educational Assessment and Feedback (ASSESS 2015). doi:
Publication (C)
Zehner, F., Goldhammer, F., & Sälzer, C. (submitted). Further Explaining the Reading Gender
Gap: An Automatic Analysis of Student Text Responses in the PISA Assessment.
... Newer publications are based on bag-of-words, a procedure based on term frequencies [9,10]. The latest trend which has proven to outperform traditional approaches is to use neural network-based embeddings, such as Word2vec [11]. ...
Full-text available
Massive open online courses and other online study opportunities are providing easier access to education for more and more people around the world. However, one big challenge is still the language barrier: Most courses are available in English, but only 16% of the world’s population speaks English [1]. The language challenge is especially evident in written exams, which are usually not provided in the student’s native language. To overcome these inequities, we analyze AI-driven cross-lingual automatic short answer grading. Our system is based on a Multilingual Bidirectional Encoder Representations from Transformers model [2] and is able to fairly score free-text answers in 26 languages in a fully-automatic way with the potential to be extended to 104 languages. Augmenting training data with machine translated task-specific data for fine-tuning even improves performance. Our results are a first step to allow more international students to participate fairly in education.KeywordsCross-lingual automatic short answer gradingArtificial intelligence in educationNatural language processingDeep learning
... A good overview of rule-based and statistical-based approaches in ASAG before the deep learning era is given in [5]. Newer publications are based on bag-of-words, a procedure based on term frequencies [5,6]. The latest trend which has proven to outperform traditional approaches is to use neural network-based embeddings, such as Word2vec [7]. ...
Full-text available
We investigate and compare state-of-the-art deep learning techniques for Automatic Short Answer Grading. Our experiments demonstrate that systems based on the Bidirectional Encoder Representations from Transformers (BERT) [1] performed best for English and German. Our system achieves a Pearson correlation coefficient of 0.73 and a Mean Absolute Error of 0.4 points on the Short Answer Grading data set of the University of North Texas [2]. On our German data set we report a Pearson correlation coefficient of 0.78 and a Mean Absolute Error of 1.2 points. Our approach has the potential to greatly simplify the life of proofreaders and to be used for learning systems that prepare students for exams: 31% of the student answers are correctly graded and in 40% the system deviates on average by only 1 point out of 6, 8 and 10 points.KeywordsAutomatic short answer gradingArtificial intelligence in educationNatural language processingDeep learning
Full-text available
Lay Description What is already known about this topic? Formative assessments are needed to test and monitor the development of learners’ knowledge throughout a unit to provide them with appropriate automated feedback. Constructed response items which require learners to formulate their own free‐text responses are well suited for testing their active knowledge. Assessing constructed responses in an automated fashion is a widely researched topic, but the problem is far from solved and most of the work focuses on predicting holistic scores or grades. To allow for a more fine‐grained and analytic assessment of learners’ knowledge, systems which go beyond predicting simple grades are required. To guarantee that models are stable and make their predictions for the correct reasons, methods for explaining the models are required. What this papers adds? A core topic in physics education is the concept of energy. We implement and evaluate multiple systems based on natural language processing technology for assessing learners’ conceptual knowledge about energy physics using transformer language models as well as feature‐based approaches. The systems assess students’ knowledge about various forms of energy, indicators for the same and the transformation of energy from form into another. As our systems are based on machine learning methodology, we introduce a novel German short answer dataset for training them to detect the respective knowledge elements within students’ free‐text responses. We evaluate the performance of these systems using this dataset as well as the well‐established SciEntsBand‐3‐Way dataset and manage to achieve, to our best knowledge, new state‐of‐the‐art results for the latter. Moreover, we apply methodology for explaining model predictions to assess whether predictions are carried out for the correct reasons. Implications for practice and/or policy It is indeed possible to assess constructed responses for the demonstrated knowledge about energy physics in an analytic fashion using natural language processing. Transformer language models can outperform more specialized feature‐based approaches for this task in terms of predictive and descriptive accuracy. Co‐occurrences of different concepts within the same responses can lead models to learn undesired shortcuts which make them unstable.
This edited book is a collection of selected research papers presented at the 2021 2nd International Conference on Artificial Intelligence in Education Technology (AIET 2021), held in Wuhan, China on July 2-4, 2021. AIET establishes a platform for AI in education researchers to present research, exchange innovative ideas, propose new models, as well as demonstrate advanced methodologies and novel systems. Rapid developments in artificial intelligence (AI) and the disruptive potential of AI in educational use has drawn significant attention from the education community in recent years. For educators entering this uncharted territory, many theoretical and practical questions concerning AI in education are raised, and issues on AI’s technical, pedagogical, administrative and socio-cultural implications are being debated. The book provides a comprehensive picture of the current status, emerging trends, innovations, theory, applications, challenges and opportunities of current AI in education research. This timely publication is well-aligned with UNESCO’s Beijing Consensus on Artificial Intelligence (AI) and Education. It is committed to exploring how best to prepare our students and harness emerging technologies for achieving the Education 2030 Agenda as we move towards an era in which AI is transforming many aspects of our lives. Providing a broad coverage of recent technology-driven advances and addressing a number of learning-centric themes, the book is an informative and useful resource for researchers, practitioners, education leaders and policy-makers who are involved or interested in AI and education.
Full-text available
To ensure quality in education, this paper proposes and implements a hybrid technique for computerized evaluation of subjective answers containing only text. The evaluation is performed using statistical techniques—Latent semantic analysis (LSA) and bilingual evaluation understudy (BLEU) along with soft computing technique, fuzzy logic. LSA is used to identify the semantic similarity between two concepts. BLEU helps prevent overrating a student answer. Fuzzy logic is used to map the outputs of LSA and BLEU. The hybrid technique is implemented using Java programming language, MatLab, Java open source libraries, and WordNet—a lexical database of English words. A database of 50 questions from different subjects of Computer Science along with answers is collected for testing purpose. The accuracy of the results varied from 0.72 to 0.99. The results are encouraging when compared with existing techniques. The tool can be used as preliminary step for evaluation.
Conference Paper
Full-text available
Reasoning about causal mechanisms is central to scientific inquiry. In science education, it is important for teachers and researchers to detect students’ mechanistic explanations as evidence of their learning, especially related to causal mechanisms. In this paper, we introduce a semiautomated method that combines association rule mining with human rater’s insight to characterize students’ mechanistic explanations from their written responses to science questions. We show an example of applying this method to students’ written responses to a question about climate change and compare mechanistic reasoning between high- and low-scoring student groups. Such analysis provides important insight into students’ current knowledge structure and informs teachers and researchers about future design of instructional interventions.
Conference Paper
Speech data has been established as an extremely rich and important source of information. However, we still lack suitable methods for the semantic annotation of speech that has been transcribed by automated speech recognition (ASR) systems . For instance, the semantic role labeling (SRL) task for ASR data is still an unsolved problem, and the achieved results are significantly lower than with regular text data. SRL for ASR data is a difficult and complex task due to the absence of sentence boundaries, punctuation, grammar errors, words that are wrongly transcribed, and word deletions and insertions. In this paper we propose a novel approach to SRL for ASR data based on the following idea: (1) combine evidence from different segmentations of the ASR data, (2) jointly select a good segmentation, (3) label it with the semantics of PropBank roles. Experiments with the OntoNotes corpus show improvements compared to the state-of-the-art SRL systems on the ASR data. As an additional contribution, we semi-automatically align the predicates found in the ASR data with the predicates in the gold standard data of OntoNotes which is a quite difficult and challenging task, but the result can serve as gold standard alignments for future research.
We introduce a new approach to the machine-assisted grading of short answer questions. We follow past work in automated grading by first training a similarity metric between student responses, but then go on to use this metric to group responses into clusters and subclusters. The resulting groupings allow teachers to grade multiple responses with a single action, provide rich feedback to groups of similar answers, and discover modalities of misunderstanding among students; we refer to this amplification of grader effort as “powergrading.” We develop the means to further reduce teacher effort by automatically performing actions when an answer key is available. We show results in terms of grading progress with a small “budget” of human actions, both from our method and an LDA-based approach, on a test corpus of 10 questions answered by 698 respondents.
Conference Paper
Reasoning about causal mechanisms is central to scientific inquiry. In science education, it is important for teachers and researchers to detect students’ mechanistic explanations as evidence of their learning. In this paper, we introduce a semi-automated method which combines association rule mining and human rater’s insights to identify students’ mechanistic explanations from their written responses to science questions. We show an example of applying this method on students’ written responses to a question about climate change and compare the mechanistic reasoning between high and low-scoring groups. Such analysis provides important insights into students’ current knowledge structure and informs teachers and researchers about future design of instructional interventions.
Mobile systems and methods that overcomes the deficiencies of prior art speech-based interfaces for telematics applications through the use of a complete speech-based information query, retrieval, presentation and local or remote command environment. This environment makes significant use of context, prior information, domain knowledge, and user specific profile data to achieve a natural environment for one or more users making queries or commands in multiple domains. Through this integrated approach, a complete speech-based natural language query and response environment can be created. The invention creates, stores and uses extensive personal profile information for each user, thereby improving the reliability of determining the context and presenting the expected results for a particular question or command. The invention may organize domain specific behavior and information into agents, that are distributable or updateable over a wide area network. The invention can be used in dynamic environments such as those of mobile vehicles to control and communicate with both vehicle systems and remote systems and devices.