Conference PaperPDF Available

Abstract and Figures

Automated Essay Scoring has gained a wider applicability and usage with the integration of advanced Natural Language Processing techniques which enabled in-depth analyses of discourse in order capture the specificities of written texts. In this paper, we introduce a novel Automatic Essay Scoring method for Dutch language, built within the Readerbench framework, which encompasses a wide range of textual complexity indices, as well as an automated segmentation approach. Our method was evaluated on a corpus of 173 technical reports automatically split into sections and subsections, thus forming a hierarchical structure on which textual complexity indices were subsequently applied. The stepwise regression model explained 30.5% of the variance in students’ scores, while a Discriminant Function Analysis predicted with substantial accuracy (75.1%) whether they are high or low performance students.
Content may be subject to copyright.
ReaderBench Learns Dutch: Building
a Comprehensive Automated Essay Scoring
System for Dutch Language
Mihai Dascalu
, Wim Westera
, Stefan Ruseti
Stefan Trausan-Matu
, and Hub Kurvers
Faculty of Automatic Control and Computers,
University Politehnicaof Bucharest,
313 Splaiul Independenţei, 60042 Bucharest, Romania
Academy of Romanian Scientists,
Splaiul Independenţei 54, 050094 Bucharest, Romania
Open University of the Netherlands, Heerlen, The Netherlands
Abstract. Automated Essay Scoring has gained a wider applicability and usage
with the integration of advanced Natural Language Processing techniques which
enabled in-depth analyses of discourse in order capture the specicities of
written texts. In this paper, we introduce a novel Automatic Essay Scoring
method for Dutch language, built within the Readerbench framework, which
encompasses a wide range of textual complexity indices, as well as an auto-
mated segmentation approach. Our method was evaluated on a corpus of 173
technical reports automatically split into sections and subsections, thus forming
a hierarchical structure on which textual complexity indices were subsequently
applied. The stepwise regression model explained 30.5% of the variance in
studentsscores, while a Discriminant Function Analysis predicted with sub-
stantial accuracy (75.1%) whether they are high or low performance students.
Keywords: Automated Essay Scoring !Textual complexity assessment !
Academic performance !ReaderBench framework !Dutch semantic models
1 Introduction
Automated Essay Scoring (AES) is one of the important benets of Natural Language
Processing (NLP) in assisting teachers. AES may analyze the degree to which a student
covers in the written text the concepts acquired within the learning process. In addition,
it should analyze also the quality of the text, that means its coherence and complexity.
Latent Semantic Analysis (LSA) [1,2] was one of the rst methods to introduce the
possibility of measuring the semantic similarity when comparing a text written by a
student to the corresponding learning base. Later on, Latent Dirichlet Allocation
(LDA) [3] was introduced as a topic modeling technique that overcomes some problems
©Springer International Publishing AG 2017
E. Andréet al. (Eds.): AIED 2017, LNAI 10331, pp. 5263, 2017.
DOI: 10.1007/978-3-319-61425-0_5
of LSA. Even if LSA and LDA are powerful techniques, due to their inherited bag of
words approach, they cannot be used alone for evaluating the complexity and quality of
a written text.
Our aim is to build a comprehensive Automated Essay Scoring model for Dutch
language. However, text complexity is a hard to dene concept and, therefore, it cannot
be measured with only a few metrics. Moreover, the complexity of a text is directly
related to its ease of reading and to comprehension, which means it also involves
human reader particularities, for example, age, level of knowledge, socio-cultural
features, and even skill and motivation. Coherence, the main feature of a good dis-
course, of a good quality text, a premise of reducing complexity, is also related to
humans perception and it is very hard to measure [4]. Cohesion is a simpler to handle
and operationalize concept that is tightly connected to semantic similarity.
Many metrics and qualitative criteria for analyzing complexity have been proposed,
as it will be discussed in the next section, and various computer systems for computing
such metrics have become available [5]. In the research presented in this paper, we used
the ReaderBench NLP framework [6,7], which integrates a wide range of metrics and
techniques, covering both the cognitive and socio-cultural paradigms. ReaderBench
makes extensive usage of Cohesion Network Analysis (CNA) [8,9] in order to rep-
resent discourse in terms of semantic links; this enables the computation of various
local and global cohesion measures described later on. In addition, ReaderBench is
grounded in Bakhtins dialogism [10], which provides a unied framing for both
individual and collaborative learning [9,11].
An important parameter that should be considered for AES is the specic language.
First, LSA, LDA and any statistical approaches for analyzing essays require text cor-
pora written in the language of the essays. Second, there may be signicant differences
among languages with respect to the average length of sentences and even words, size
of vocabulary, discourse structuring, etc. Dutch language, in contrast to English,
contains a high number of compound words (which inherently decreases the number of
tokens per phase); moreover, besides compound words, general words tend to be longer
[12]. In this idea, this paper presents the stages required for porting the ReaderBench
framework, which was developed mainly for English, to Dutch language.
The paper continues with a state of the art section, followed by an in-depth pre-
sentation of the undergone steps required to build our comprehensive Dutch assessment
model. Our evaluation is based on a corpus of student reports in the domain of envi-
ronmental sciences. While engaging in a serious game, students adopt the role of
principal researcher for investigating a multifaceted environmental problem and, on
various occasions throughout the game. they are required to report about their ndings.
After discussing the results, the fth section presents the conclusions, as well as further
enhancements to be integrated within our approach.
2 State of the Art
The idea of quantifying textual complexity or difculty has been studied intensively
over the years, having in mind two major goals: presenting readers with materials
aligned with their level of comprehension, and evaluating learnersabilities and
ReaderBench Learns Dutch 53
knowledge levels from their writing traces. In our current research, we are focusing on
the latter goal, evaluating studentswriting capabilities in order to discover signicant
correlations to their knowledge level.
From a global perspective, textual complexity is relative to the students knowledge
of the domain, language familiarity, interest and personal motivation [6]. In addition,
the readers education, cognitive capabilities and prior experiences inuence read-
ability and comprehension [6]. In accordance to the Common Core State Standards
Initiative [13], textual complexity can be evaluated from three different perspectives:
quantitative (e.g., word frequency, word/phrase length), qualitative (e.g., clarity,
structure, language familiarity) and from the reader and task orientation (e.g., moti-
vation, prior knowledge or interest). In practice, these dimensions of textual complexity
can be used to determine if a student is prepared for college or for a career. The scope
of the standard is to reduce and eliminate knowledge gaps by offering students a
coherent ow of materials that have a slightly higher textual complexity in order to
challenge the reader.
A signicant effort has been put into developing automated tools of textual com-
plexity assessment as part of the linguistic research domain. E-Rater [14] is one of the
rst automated systems to evaluate text difculty based on three general classes of
essay features: structure (e.g., sentence syntax, proportion of spelling, grammar, usage
or mechanics errors), organization based on various discourse features, and content
based on prompt-specic vocabulary. Several other tools for automated essay grading
or for assessing the textual complexity of a given text have been developed and
employed in various educational programs [5,15]: Lexile (MetaMetrics), ATOS
(Renaissance Learning), Degrees of Reading Power: DRP Analyzer (Questar Assess-
ment, Inc.), REAP (Carnegie Mellon University), SourceRater (Educational Testing
Service), Coh-Metrix (University of Memphis), Markit (Curtin University of Tech-
nology) [16], IntelliMetric [17] or Writing Pal (Arizona State University) [18,19].
In terms of Dutch language, there are only a few systems that perform automated
essay scoring by integrating multiple textual complexity indices. T-Scan (http:// is one of the most elaborated solutions as it considers
multiple features, including [20]: lexical and sentence complexity, referential cohesion
and lexical diversity, relational coherence, concreteness, personal style, verbs and time,
verbs and time, as well as probability features, all derived from Coh-Metrix [2123].
Besides T-Scan, various Dutch surface tools have been reported that provide lexical
indices for text difculty, as well as recommendations to reorganize the text: e.g.,
Texamen, Klinkende Taal and Accessibility Leesniveau Tool [24].
3 Building the Dutch Complexity Model
3.1 The NLP Processing Pipeline for Dutch Language
Before establishing a comprehensive list of textual complexity indices that can be used
to predict a learners understanding level, we rst need to build a Natural Language
Processing (NLP) pipeline for Dutch language. This processing pipeline integrates key
techniques that are later on used also within the scoring algorithm. Multiple challenges
54 M. Dascalu et al.
were encountered besides mere translation issues while adapting our ReaderBench
framework from English to Dutch language; thus, we see t to provide prescriptive
information regarding our NLP specic processes.
First, a new thorough dictionary was required to perform a comprehensive cleaning
of the input text, by ltering and selecting only dictionary words. Elimination of noise
within the unsupervised training process of semantic models, as well as facile identi-
cation of typos are important elements while building our textual complexity model.
Moreover, as the essays used were academic reports we were also constrained to
include low-frequency, scientic words, in order to be capable to grasp the specicity
of our texts. E-Lex (formerly named TST-lexicon) [25] is a lexical database of Dutch
language consisting of both one-word and multi-word lexicons, and it represented the
best starting point after manually reviewing multiple dictionaries. Besides providing a
comprehensive list of words, E-Lex was also used to build a static lemmatizer that
reduces each inected word form to its corresponding lemma, therefore normalizing the
Second, similar to the requirement of a new dictionary, a new stop words list (i.e.
words having limited or no content information) was required in order to disregard
certain words for scoring purposes. Again, upon manual review, we opted for http:// which was expanded with numbers,
interjections, as well frequent words with low semantic meaning. These words induced
noise within the emerging topics from Latent Dirichlet Allocation (LDA) [3] by having
a high occurrence rate, as well as a high probability, in multiple topics.
Third, new semantic models, namely vector space models based on Latent
Semantic Analysis [1] and Latent Dirichlet Allocation topic distributions [3] needed to
be trained. The Corpus of Contemporary Dutch (Hedendaags Nederlands; 1.35 billion
words; represented the best alternative in
terms of dimension, breadth of topics, as well as novelty of comprised documents.
After preprocessing, the corpus was reduced to around 500 million content words from
approximately 11.5 million paragraphs, each surpassing the minimum imposed
threshold of at least 20 content words. The LSA space was built using the stochastic
SVD decomposition from Apache Mahout [26] which was applied on the
term-document matrix weighted with log-entropy, across 300 dimensions. LDA made
use of parallel Gibbs sampling implemented in Mallet [27] and the model was created
with 100 topics, as suggested by Blei [28]. A manual inspection of top 100 words from
each LDA topic suggested that the space was adequately constructed due to the fact
that the most representative words from each topic were semantically related one to
Fourth, complementary to our LSA and LDA models, the Open Dutch WordNet,
the most complete Dutch lexical semantic database up-to-date with more than 115,000
synsets, was also integrated, enabling the following: (a) the identication of lexical
chains and word sense disambiguation [29], as well as (b) the computation of various
semantic distances in ontologies, namely Wu-Palmer, Leacock-Chodorow and path
length distances [30].
ReaderBench Learns Dutch 55
3.2 Textual Complexity Indices
Starting from the wide range of textual complexity indices available within the
ReaderBench framework [6,7] for English language, and based on the previously
described NLP processing pipeline, we present the multitude of textual complexity
indices that we have made available into Dutch language.
In contrast to the systems mentioned within the state of the art section and besides
covering multiple layers of the analysis ranging from surface indices, syntax to
semantics, ReaderBench focuses on text cohesion and discourse connectivity. The
framework provides a more in-depth perspective of discourse structure based on
Cohesion Network Analysis [8,9], a multi-layered cohesion graph [31] that considers
semantic links between different text constituents. We further describe the indices
integrated in our framework and used for this study, categorized by their textual
analysis scope.
Surface, lexicon and syntax analyses. The rst approaches to text complexity were
developed by Page [32] in his search to develop an automatic grading system for
studentsessays. Page discovered a strong correlation between human intrinsic vari-
ables (trins) and proxes (i.e., computer approximations or textual complexity indices),
thus proving that statistical analyses can provide reliable textual automated estimations.
Our model integrates the most representative and predictive proxes from Pages initial
study, corroborated with other surface measures frequently used in other automated
essay grading systems (e.g., average word/phrase/paragraph length, average unique/
content words per paragraph, average commas per sentence/paragraph). Entropy at
word level, derived from Shannons Information Theory [33], is a relevant metric for
quantifying textual complexity based on the hypothesis that a more complex text
contains more information, more diverse concepts and requires more working memory.
In contrast, character entropy is a language specic characteristic [34] and does not
exhibit a signicant variance in texts written in English. Moreover, of particular interest
at this level due to the inherit implications in co-reference resolution, are the different
categories of pronouns (i.e., rst, second and third person, interrogative, and indenite
pronouns), implemented as predened words lists and considered within our model.
Coverage statistics with regards to specic pronouns usage were computed at sentence,
paragraph, and document levels.
Semantic analysis and discourse structure. In order to comprehend a text, the reader
must create a coherent and well connected representation of the information, commonly
referred to as the situation model [35]. According to McNamara et al. [15], textual
complexity is linked with cohesion in terms of comprehension, as the lack of cohesion
can articially increase the perceived difculty of a text. Thus, our model uses a local
and global evaluation of cohesion within the CNA graph, computed as the average
value of the semantic similarities of all linksat intra- and inter-paragraph levels [31,36].
Cohesion is estimated as the average value of [6]: (a) Wu-Palmer semantic distances
applied on the WordNet lexicalized ontology, (b) cosine similarity in Latent Semantic
Analysis (LSA) vector space models, and (c) the inverse of the Jensen Shannon dis-
similarity (JSD) between Latent Dirichlet Allocation (LDA) topic distributions [37].
56 M. Dascalu et al.
Besides semantic models, lexical chains provide a strong basis for assessing text
cohesion and several indices have been also introduced: (a) the average and the
maximum span of lexical chains (the distance in words between the rst and the last
occurrence of words pertaining to the same chain), (b) the average number of lexical
chains per paragraph, as well as (c) the percentage of words that are included in lexical
chains (i.e., words that are not isolated within the discourse, but inter-linked with other
concepts from the same chain).
In addition, starting from the Referentiebestand Nederlands (RBN) [38], several
discourse connectors identiable via cue phrases have been added to our complexity
model in order to provide a ne-grained view over the discourse with regards to the
following relevant relationships: cause, circumstance, comparison, concession, condi-
tion, conjunctive, contrast, degree, disjunctive, effect, exception, nonrestrictive, other,
purpose, restriction, time, and interrogative.
Word complexity represents a mixture of different layers of discourse analysis
covering a wide set of estimators for each words difculty: (a) syllable count,
(b) distance in characters between the inected form, lemma and word stem (adding
multiple prexes or sufxes increases the difculty of using a certain word),
(c) specicity reected in the inverse document frequency from LSA/LDA training
corpus, (d) the average and the maximum path distance in the hypernym tree based on
all word senses and (e) the word polysemy count from WordNet [39]. In order to reect
individual scores at sentence and paragraph level, all these indices were averaged, taking
into consideration only lemmatized content words generated after applying the NLP
processing pipeline. Moreover, normalized occurrences at both paragraph and sentence
levels of all major word categories from the Dutch LIWC dictionary [40] have been
considered, providing additional insights in terms of underlying concept categories.
3.3 Automated Text Segmentation
The previously introduced textual complexity indices become less relevant when facing
longer documents comprising of thousands or tens of thousands of words. Besides the
computational power required for building a complete CNA graph that captures all
potential cohesive links, different sections might exhibit different traits which can be
easily disregarded at document level. A commonly encountered approach is to auto-
matically split longer texts using an imposed xed window of words. The most fre-
quently used threshold value is of 1,000 words [5]. However, this method fails to
consider the natural discourse structure of the text, its hierarchical decomposition, as
most documents contain sections, subsections and so forth, constituent elements that
emerge as a more viable manner of splitting the text. Therefore, the headings from the
initial document produce a hierarchical structure in which each section contains its own
text and list of subsections that can be possibly empty.
Thus, we developed a new segmentation method applicable for Microsoft Word
documents, assuming that sections are correctly annotated with the appropriate heading
styles reecting its hierarchical structure (e.g., Heading 1 is automatically considered as
a section, Heading 2 a subsection, Heading 3 a subsubsection, etc.). From a technical
perspective, due to the constraint that the entire framework is written entirely in Java,
ReaderBench Learns Dutch 57
we have opted to rely on the Apache POI library ( for parsing the
.docx documents. The newly generated meta-document contains multiple layers of
well-dened and self-contained document segments on which we can apply the pre-
viously introduced textual complexity indices. The results for each textual complexity
index and for each extracted section are averaged in order to obtain the scores for the
entire meta-document.
4 Results
4.1 Corpus
The corpus used for performing a preliminary validation of our model consisted of 173
technical reports in Dutch written by master degree students from the Open University
of the Netherlands and Utrecht University. The students play an online game in the
domain of environmental policy, which confronts them multidimensional environ-
mental problems. During the game, they are required to upload technical reports about
their ndings, in subsequent stages (i.e., analysis, 2 design tasks, 2 evaluation tasks and
anal evaluation) [41]. As these reports need to be evaluated manually by teachers in
very short time spans, the need for Automated Essay Scoring arose. All essays are
scored by human tutors on the bases of an assessment framework and scores express a
linear variable ranging from 1 (utterly weak) to 10 (excellent). The reports used for this
experiment address only the rst stage (i.e., analysis) and contained an average of 1832
words (SD = 790), ranging from a minimum of 243 words to a maximum of 6186
words. All reports were manually corrected in terms of formatting in order to ensure an
appropriate usage of heading styles, a process that afterwards facilitates their automated
Because of the limited number of students whose scores span multiple levels, we
applied a binary split of student scores into two distinct classes: high performance
students with scores "7, while the rest were catalogued as low performance students.
Moreover, for the scope of these preliminary experiments, we opted to rely only on the
LDA topic model besides WordNet, instead of both LSA and LDA. This was due to the
fact that only the LDA space was inspected by native speakers with regards to com-
prising relevantword associations within corresponding topics.
4.2 Statistical Analyses
The Dutch indices from ReaderBench that lacked normal distributions were discarded
(e.g., average number of sentences, words and content words, average number of
commas at paragraphs and sentence levels, word polysemy counts, different connectors
and word lists at paragraph and sentence level). Correlations between the selected
indices and the dependent variable (the studentsscore for their technical report) were
then calculated for the remaining indices to determine whether there was a statistically
signicant relation (p< .05). Indices that were highly collinear (r".9) were agged,
and the index with the strongest correlation with the assigned score corresponding to
58 M. Dascalu et al.
each report was retained, while the other indices were removed. The remaining indices
were included as predictor variables in a stepwise regression to explain the variance in
the studentsscores, as well as predictors in a Discriminant Function Analysis [42]
used to classify students based on their performance.
4.3 Relationship Between ReaderBench and StudentsFinal Scores
To address our research question of automatically scoring studentsreports, we con-
ducted correlations between the ReaderBench indices that were normally distributed
and were not multicollinear and their nal scores. As shown in Table 1, medium to
weak effects were found for ReaderBench indices related to the number of words,
paragraphs, unique words per sentence, lexical chains, lower local cohesion induced by
a more varied vocabulary (higher word entropy), different types of discourse connec-
tors at both sentence and paragraph levels (concession, condition, circumstance), as
well as pronouns (both third person and indenite).
The correlations indicate that students who received higher scores had longer
reports in terms of words and paragraphs, greater word entropy, used more discourse
connectors and pronouns, and produced more unique words. Moreover, students who
received higher scores had lower inner cohesion per paragraph, indicating more
elaborated paragraphs that reect a mixture of diverse ideas.
Table 1. Correlations between ReaderBench indices and report score.
Index rp
Logarithmic number of words .461 <.001
Average number of lexical chains per paragraph .338 <.001
Average sentence-paragraph cohesion
(Wu-Palmer semantic distance in WordNet)
-.284 <.001
Average number of concession connectors per paragraph .269 <.001
Average number of condition connectors per paragraph .260 .001
Word entropy .258 .001
Average number of circumstance connectors per paragraph .254 .001
Percentage of words that are included in lexical chains .250 .001
Average number of indenite pronouns per sentence .237 .002
Average sentence length (number of characters) .193 .011
Average number of third person pronouns per sentence .187 .014
Average number of circumstance connectors per sentence .187 .014
Average number of unique content words per sentence .184 .015
Number of paragraphs .160 .035
Average number of condition connectors per sentence .154 .044
ReaderBench Learns Dutch 59
4.4 Regression Analysis and Discriminant Function Analysis
To analyze which ReaderBench features best predicted the studentsscore, we con-
ducted a stepwise regression analysis using the 15 signicant indices as the indepen-
dent variables. This yielded a signicant model, F(3, 169) = 24.676, p< .001,
r= .552, R
= .305. Three variables were signicant and positive predictors of report
scores: logarithmic number of words, average number of pronouns per sentence (in-
denite), percentage of words that are included in lexical chains. These variables
explained 30.5% of the variance in the studentsreport scores.
The stepwise Discriminant Function Analysis (DFA) retained three different vari-
ables as signicant predictors (i.e., 1. logarithmic number of words, 2. average number
of indenite pronouns per sentence, and 3. average sentence-paragraph cohesion using
Wu-Palmer semantic distance), and removed the remaining variables as non-signicant
The results prove that the DFA using these three indices correctly allocated 132 of
the 173 students from our dataset, v
(df = 3, n= 173) = 40.948, p< .001, for an
accuracy of 76.3% (the chance level for this analysis is 50%). For the leave-one-out
cross-validation (LOOCV), the discriminant analysis allocated 130 of the 173 students
for an accuracy of 75.1% (see the confusion matrix reported in Table 2for results). The
measure of agreement between the actual student performance and that assigned by our
model produced a weighted Cohens Kappa of .517, demonstrating moderate
5 Conclusions
The ReaderBench NLP framework was extended to support automatic scoring of
studentstechnical reports written in Dutch language. Existing textual complexity
indices and methods had to be adapted from English language, and specically
tweaked for Dutch language, thus introducing one of the most comprehensive models
available for Dutch to our knowing. Moreover, we have also introduced an automatic
segmentation method that creates a hierarchical structure based on document sections
and headings.
Table 2. Confusion matrix for DFA classifying students based on performance
Low High
Whole set Low 54 21 75
High 20 78 98
Cross-validated Low 53 22 71
High 21 77 98
60 M. Dascalu et al.
Initial results indicate that our model, which goes beyond the replication of the
English version of ReaderBench due to the performed customizations, has a high
accuracy and is suitable for automatically scoring Dutch technical reports. In addition,
the performance of our model is comparable to systems available in English language.
Our framework integrates the widest range of textual complexity indices available for
Dutch language, emphasizing the semantic dimension of the analysis instead of fre-
quently used surface measures. Nevertheless, we must point out that the variance
explained by the regression model, as well as the weighted Cohens Kappa, are rather
low in contrast to the accuracy of the DFA model which only assumes a binary
classication. Only the index with the highest correlation (i.e., logarithmic number of
words) was retained in both the linear regression and in the DFA model. The remaining
indices are specic for each model that is fundamentally different the regression
model predicts a linear score, while the DFA performs a classication into two per-
formance categories.
As limitations, we must also point out the discrepancies in the evaluation of the
technical reports as the automatic evaluation is mostly focused on studentswriting
style, while the tutors evaluate the technical quality of the report. Moreover, the
population for our study consists of master degree students who have, in general,
relatively high writing skills; in return, this may reduce the variance in complexity
among the essays. Therefore, new metrics should be introduced in order to address the
technical soundness of a document in relation to a given theme or an imposed set of
topics of interest. Moreover, the Dutch language imposes additional challenges, like the
high number of compound words. While relating to the process of building semantic
models, these words could be more relevant if taken separately. Thus, automated
splitting rules should be enforced upon compound words in order to provide a clearer
contextualization of the input text.
Acknowledgments. This work was partially funded by the 644187 EC H2020 Realising an
Applied Gaming Eco-system (RAGE) project, by the FP7 208-212578 LTfLL project, as well as
by University Politehnica of Bucharest through the Excellence Research GrantsProgram
UPBGEX 12/26.09.2016.
1. Landauer, T.K., Dumais, S.T.: A solution to Platos problem: the Latent Semantic Analysis
theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104(2),
211240 (1997)
2. Miller, T.: Essay assessment with Latent Semantic Analysis. J. Educ. Comput. Res. 29(4),
495512 (2003)
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3(45),
9931022 (2003)
4. Crossley, S.A., McNamara, D.S.: Text coherence and judgments of essay quality: models of
quality and coherence. In: 33rd Annual Conference of the Cognitive Science Society,
pp. 12361231. Cognitive Science Society, Boston (2011)
ReaderBench Learns Dutch 61
5. Nelson, J., Perfetti, C., Liben, D., Liben, M.: Measures of Text Difculty: Testing their
Predictive Value for Grade Levels and Student Performance. Council of Chief State School
Ofcers, Washington, DC (2012)
6. Dascalu, M.: Analyzing Discourse and text complexity for learning and collaborating,
Studies in Computational Intelligence, vol. 534. Springer, Cham (2014)
7. Dascalu, M., Dessus, P., Bianco, M., Trausan-Matu, S., Nardy, A.: Mining texts, learner
productions and strategies with ReaderBench. In: Peña-Ayala, A. (ed.) Educational Data
Mining. SCI, vol. 524, pp. 345377. Springer, Cham (2014). doi:10.1007/978-3-319-02738-
8. Dascalu, M., McNamara, D.S., Trausan-Matu, S., Stavarache, L.L., Allen, L.K.: Cohesion
network analysis of CSCL participation. Behavior Research Methods, PP. 116 (2017)
9. Dascalu, M., Trausan-Matu, S., McNamara, D.S., Dessus, P.: ReaderBench automated
evaluation of collaboration based on cohesion and dialogism. Int. J. Comput. Support.
Collaborative Learn. 10(4), 395423 (2015)
10. Bakhtin, M.M.: The dialogic imagination: four essays. The University of Texas Press, Austin
11. Dascalu, M., Allen, K.A., McNamara, D.S., Trausan-Matu, S., Crossley, S.A.: Modeling
comprehension processes via automated analyses of dialogism. In: 39th Annual Meeting of
the Cognitive Science Society (CogSci 2017). Cognitive Science Society, London (2017, in
12. Duyck, W., Desmet, T., Verbeke, L.P., Brysbaert, M.: WordGen: A tool for word selection
and nonword generation in Dutch, English, German, and French. Behav. Res. Methods
36(3), 488499 (2004)
13. National Governors Association Center for Best Practices & Council of Chief State School
Ofcers: Common Core State Standards. Authors, Washington D.C. (2010)
14. Powers, D.E., Burstein, J., Chodorow, M., Fowles, M.E., Kukich, K.: Stumping e-rater®:
Challenging the Validity of Automated Essay Scoring. Educational Testing Service,
Princeton (2001)
15. McNamara, D.S., Graesser, A.C., Louwerse, M.M.: Sources of text difculty: Across the
ages and genres. In: Sabatini, J.P., Albro, E., OReilly, T. (eds.) Measuring up: Advances in
How we Assess Reading Ability, pp. 89116. R&L Education, Lanham (2012)
16. Williams, R., Dreher, H.: Automatically grading essays with Markit©. J. Issues Informing
Sci. Inform. Technol. 1, 693700 (2004)
17. Elliot, S.: IntelliMetric: from here to validity. In: Shermis, M.D., Burstein, J.C. (eds.)
Automated Essay Scoring: A Cross Disciplinary Approach, pp. 7186. Lawrence Erlbaum
Associates, Mahwah (2003)
18. Crossley, S.A., Allen, L.K., McNamara, D.S.: The Writing Pal: a writing strategy tutor. In:
Crossley, S.A., McNamara, D.S. (eds.) Handbook on Educational Technologies for Literacy.
Taylor & Francis, Routledge, New York (in press)
19. McNamara, D.S., Crossley, S.A., Roscoe, R., Allen, L.K., Dai, J.: A hierarchical
classication approach to automated essay scoring. Assessing Writ. 23, 3559 (2015)
20. Pander Maat, H.L.W., Kraf, R.L., van den Bosch, A., van Gompel, M., Kleijn, S., Sanders,
T.J.M., van der Sloot, K.: T-Scan: a new tool for analyzing Dutch text. Comput. Linguist.
Neth. J. 4, 5374 (2014)
21. Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-Metrix: Analysis of text on
cohesion and language. Behav. Res. Methods Instrum. Comput. 36(2), 193202 (2004)
22. Graesser, A.C., McNamara, D.S., Kulikowich, J.M.: Coh-Metrix: Providing multilevel
analyses of text characteristics. Educ. Res. 40(5), 223234 (2011)
23. McNamara, D.S., Graesser, A.C., McCarthy, P., Cai, Z.: Automated Evaluation of Text and
Discourse with Coh-Metrix. Cambridge University Press, Cambridge (2014)
62 M. Dascalu et al.
24. Kraf, R., Lentz, L., Pander Maat, H.: Drie Nederlandse instrumenten voor het automatisch
voorspellen van begrijpelijkheid. Een klein consumentenonderzoek. Tijdschift voor
Taalbeheersing 33(3), 249265 (2011)
25. CGN Consortium: e-Lex, lexicale databank (lexical database). Instituut voor Nederlandse
Taal, Leiden, the Netherlands (2017)
26. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications
Co., Greenwich (2011)
27. McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002). http://
28. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 7784 (2012)
29. Galley, M., McKeown, K.: Improving word sense disambiguation in lexical chaining. In:
18th International Joint Conference on Articial Intelligence (IJCAI 2003), pp. 14861488.
Morgan Kaufmann Publishers, Inc., Acapulco (2003)
30. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of lexical semantic
relatedness. Comput. Linguist. 32(1), 1347 (2006)
31. Trausan-Matu, S., Dascalu, M., Dessus, P.: Textual complexity and discourse structure in
computer-supported collaborative learning. In: Cerri, Stefano A., Clancey, William J.,
Papadourakis, G., Panourgia, K. (eds.) ITS 2012. LNCS, vol. 7315, pp. 352357. Springer,
Heidelberg (2012). doi:10.1007/978-3-642-30950-2_46
32. Wresch, W.: The imminence of grading essays by computer25 years later. Comput.
Compos. 10(2), 4558 (1993)
33. Shannon, C.E.: Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 5064
34. Gervasi, V., Ambriola, V.: Quantitative assessment of textual complexity. In: Barbaresi, M.L.
(ed.) Complexity in Language and Text, pp. 197228. Plus, Pisa, Italy (2002)
35. van Dijk, T.A., Kintsch, W.: Strategies of Discourse Comprehension. Academic Press,
New York (1983)
36. Dascalu, M., Dessus, P., Trausan-Matu, Ş., Bianco, M., Nardy, A.: ReaderBench, an
environment for analyzing text complexity and reading strategies. In: Lane, H.Chad, Yacef,
K., Mostow, J., Pavlik, P. (eds.) AIED 2013. LNCS, vol. 7926, pp. 379388. Springer,
Heidelberg (2013). doi:10.1007/978-3-642-39112-5_39
37. Manning, C.D., Schütze, H.: Foundations of statistical Natural Language Processing. MIT
Press, Cambridge (1999)
38. van der Vliet, H.: The Referentiebestand Nederlands as a multi-purpose lexical database. Int.
J. Lexicogr. 20(3), 239257 (2007)
39. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 3941
40. Zijlstra, H., van Meerveld, T., van Middendorp, H., Pennebaker, J.W., Geenen, R.:
De Nederlandse versie van de Linguistic Inquiry and Word Count (LIWC), een
gecomputeriseerd tekstanalyseprogramma [Dutch version of the Linguistic Inquiry and
Word Count (LIWC), a computerized text analysis program]. Gedrag & Gezondheid 32,
273283 (2004)
41. Westera, W., Nadolski, N., Hummel, H.: Serious gaming analytics: what studentslog les
tell us about gaming and learning. Int. J. Serious Games 1(2), 3550 (2014)
42. Klecka, W.R.: Discriminant Analysis. Quantitative Applications in the Social Sciences
Series, vol. 19. Sage Publications, Thousand Oaks (1980)
ReaderBench Learns Dutch 63
... The featureengineering approach, which has traditionally been the greater used of the two, utilizes a statistical or machine learning model with pre-defined handcrafted features (e.g. Attali and Burstein, 2006;Chen and He, 2013;Phandi et al., 2015;Dascalu et al., 2017;Hastings et al., 2018;Yao et al., 2019). The neural approach, on the other hand, which has become popular recently, uses deep neural networks to extract features automatically from texts (e.g. ...
Conference Paper
Full-text available
Essay exams have been attracting attention as a way of measuring the higher-order abilities of examinees, but they have two major drawbacks in that grading them is expensive and raises questions about fairness. As an approach to overcome these problems, automated essay scoring (AES) is in increasing need. Many AES models based on deep neural networks have been proposed in recent years and have achieved high accuracy, but most of these models are designed to predict only a single overall score. However, to provide detailed feedback in practical situations, we often require not only the overall score but also analytic scores corresponding to various aspects of the essay. Several neural AES models that can predict both the analytic scores and the overall score have also been proposed for this very purpose. However, conventional models are designed to have complex neural architectures for each analytic score, which makes interpreting the score prediction difficult. To improve the interpretability of the prediction while maintaining scoring accuracy, we propose a new neural model for automated analytic scoring that integrates a multidimensional item response theory model, which is a popular psychometric model.
... Besides, Cozma [11] combined string kernels and word embeddings to capture text features, namely the bag-of-super-word-embeddings (BOSWE). Dascalu et al. [12] implemented an automated essay scoring system for Dutch by integrating features such as lexical and semantics features. ...
... Several free-text (e.g., essay) writing assessments computing tools have been proposed and developed for L2 tasks; some of them were-automatic essay feedback generation [30], TAACO-Tool for the Automatic Analysis of Cohesion [152,162], ETS e-rater ® scoring [163], Coh-Metrix [164], GAMET-Grammar And Mechanics Error Tool [165], ReaderBench-a multi-lingual framework for analyzing text complexity [166,167], ACT CRASE+ ® [168], CNN based AES [169], and CRNN based AES [170]. ...
Full-text available
Reading comprehension, which has been defined as gaining an understanding of written text through a process of translating grapheme into meaning, is an important academic skill. Other language learning skills - writing, speaking and listening, all are connected to reading comprehension. There have been several measures proposed by researchers to automate the assessment of comprehension skills for second language (L2) learners, especially English as Second Language (ESL) and English as Foreign Language (EFL) learners. However, current methods measure particular skills without analysing the impact of reading frequency on comprehension skills. In this dissertation, we show how different skills could be measured and scored automatically. We also demonstrate, using example experiments on multiple forms of learners' responses, how frequent reading practices could impact on the variables of multimodal skills (reading pattern, writing, and oral fluency). This thesis comprises of five studies. The first and second studies are based on eye-tracking data collected from EFL readers in repeated reading (RR) sessions. The third and fourth studies are to evaluate free-text summary written by EFL readers in repeated reading sessions. The fifth and last study, described in the sixth chapter of the thesis, is to evaluate recorded oral summaries recited by EFL readers in repeated reading sessions. In a nutshell, through this dissertation, we show that multimodal skills of learners could be assessed to measure their comprehension skills as well as to measure the effect of repeated readings on these skills in the course of time, by finding significant features and by applying machine learning techniques with a combination of statistical models such as LMER.
... AES models based on the feature-engineering approach predict scores using textual features that are manually designed by human experts (e.g., Dascalu et al. 2017;Mark and Shermis 2016;Nguyen and Litman 2018). Typical features include essay length and the number of grammatical and spelling errors. ...
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have been proposed over the past few years. To our knowledge, however, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify the AES task into four types and introduce existing DNN-AES models according to this classification.
... Many AES methods have been developed over the past decades, and can generally be categorized as feature-engineering and neural-network approaches (Hussein et al., 2019;Ke and Ng, 2019). The featureengineering approach predicts scores using handcrafted features such as essay length or spelling errors (e.g., (Amorim et al., 2018;Dascalu et al., 2017;Mark D. Shermis, 2016;Nguyen and Litman, 2018)). The advantages of this approach include interpretability and explainability. ...
Conference Paper
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by human raters. Conventional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks (DNNs) to obviate the need for feature engineering. Furthermore, hybrid methods that integrate handcrafted features in a DNN-AES model have been recently developed and have achieved state-of-the-art accuracy. One of the most popular hybrid methods is formulated as a DNN-AES model with an additional recurrent neural network (RNN) that processes a sequence of handcrafted sentence-level features. However, this method has the following problems: 1) It cannot incorporate effective essay-level features developed in previous AES research. 2) It greatly increases the numbers of model parameters and tuning parameters, increasing the difficulty of model training. 3) It has an additional RNN to process sentence-level features, enabling extension to various DNN-AES models complex. To resolve these problems, we propose a new hybrid method that integrates handcrafted essay-level features into a DNN-AES model. Specifically, our method concatenates handcrafted essay-level features to a distributed essay representation vector, which is obtained from an intermediate layer of a DNN-AES model. Our method is a simple DNN-AES extension, but significantly improves scoring accuracy.
... The feature-engineering approach predicts scores using manually tuned features such as essay length and number of spelling errors (e.g., [3,5,22,28]). Advantages of this approach include interpretability and explainability. However, these approaches generally require extensive feature redesigns to achieve high prediction accuracy. ...
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to human grading. Conventional AES methods typically rely on manually tuned features, which are laborious to effectively develop. To obviate the need for feature engineering, many deep neural network (DNN)-based AES models have been proposed and have achieved state-of-the-art accuracy. DNN-AES models require training on a large dataset of graded essays. However, assigned grades in such datasets are known to be strongly biased due to effects of rater bias when grading is conducted by assigning a few raters in a rater set to each essay. Performance of DNN models rapidly drops when such biased data are used for model training. In the fields of educational and psychological measurement, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed. This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias within training data. To our knowledge, this is a first attempt at addressing rating bias effects in training data, which is a crucial but overlooked problem.
... The aim of this experiment is to build a comprehensive Automated Essay Scoring model for Dutch language (Dascalu et al., 2017c) and also to propose a method for reducing the number of essays that need to be graded by teachers, while maintaining an acceptable accuracy . However, the concept of text complexity is a difficult to define and cannot be measured with only a few metrics. ...
Full-text available
Recurrent neural networks, among other advanced natural language processing techniques, can be used to tackle many problems that require text analysis. However, recent approaches were not applied for specific educational purposes, like automated essay scoring, or summary evaluation. Moreover, large datasets required for training complex models are not easily available in such scenarios; thus, different solutions need to be envisioned. In addition, several Question Answering problems cannot be solved only with neural networks, which need to be combined with rule-based or Information Retrieval methods. This thesis presents experiments centered on model improvements on tree-structured recursive neural networks, Question Answering (QA), as well as specific educational tasks. Although the tackled problems are different, similar techniques were used to solve them. Each solution was chosen depending on the size of the dataset, and whether the problem requires a comparison between texts or the evaluation of a single text. Recurrent neural networks view the sentence as a sequence of words, as opposed to recursive neural networks, which can use the syntactic parse tree. We also propose an improvement of the Tree Long Short-Term Memory (Tree LSTM) architecture, using dependency masks to alter the meaning of words, which was evaluated on two sentiment analysis datasets. The experiments that led to this improved version are also described. These include the validation of dependency masks on a word classification task, as well as a simpler version of a recursive neural network trained on paraphrasing and answer selection tasks which did not provide an improvement over the standard Tree LSTM model. For question answering, both structured knowledge and raw texts were used as input, employing different methods for each of these. In the case of answer selection, string kernels were used as input for a neural network. This method was also applied on other related problems as well, like detecting semantic links between utterances in chat conversations, showing that there are some similarities between these two tasks. Other experiments were focused on evaluating students’ written assignments, like essays, summaries, or asked questions. Automated evaluation reduces the teachers’ workload and also helps the students improve their skill by providing immediate feedback. Recurrent neural networks are easily adaptable for most new tasks, which facilitated the variety of the conducted experiments. This thesis argues that neural networks improve existing Natural Language Processing techniques used for Question Answering and writing evaluation for educational tasks. Even with smaller datasets, recurrent neural networks are a viable solution and should be taken into consideration, alongside traditional Machine Learning methods.
Full-text available
In automated essay scoring (AES), essays are automatically graded without human raters. Many AES models based on various manually designed features or various architectures of deep neural networks have been proposed over the past few decades. Each AES model has unique advantages and characteristics. Therefore, rather than using a single AES model, appropriate integration of predictions from various AES models is expected to achieve higher scoring accuracy. In the present paper, we propose a method that uses item response theory to integrate prediction scores from various AES models while taking into account differences in the characteristics of scoring behavior among models. It is found that the proposed method achieves higher accuracy than that of individual AES models and conventional score-integration methods. Furthermore, the proposed method facilitates interpreting each AES model's scoring characteristics and score-integration mechanism.
Full-text available
This paper introduces an adaptation of the open source ReaderBench framework that now supports Russian multilevel analyses of text characteristics, while integrating both textual complexity indices and state-of-the-art language models, namely Bidirectional Encoder Representations from Transformers (BERT). The evaluation of the proposed processing pipeline was conducted on a dataset containing Russian texts from two language levels for foreign learners (A - Basic user and B - Independent user). Our experiments showed that the ReaderBench complexity indices are statistically significant in differentiating between the two classes of language level, both from: a) a statistical perspective, where a Kruskal-Wallis analysis was performed and features such as the “nmod” dependency tag or the number of nouns at the sentence level proved the be the most predictive; and b) a neural network perspective, where our model combining textual complexity indices and contextualized embeddings obtained an accuracy of 92.36% in a leave one text out cross-validation, outperforming the BERT baseline. ReaderBench can be employed by designers and developers of educational materials to evaluate and rank materials based on their difficulty, as well as by a larger audience for assessing text complexity in different domains, including law, science, or politics.
Full-text available
In automated essay scoring (AES), scores are automatically assigned to essays as an alternative to grading by humans. Traditional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks to obviate the need for feature engineering. Those AES models generally require training on a large dataset of graded essays. However, assigned grades in such a training dataset are known to be biased owing to effects of rater characteristics when grading is conducted by assigning a few raters in a rater set to each essay. Performance of AES models drops when such biased data are used for model training. Researchers in the fields of educational and psychological measurement have recently proposed item response theory (IRT) models that can estimate essay scores while considering effects of rater biases. This study, therefore, proposes a new method that trains AES models using IRT-based scores for dealing with rater bias within training data.
Full-text available
Laboratoire de linguistique et didactique des langues étrangères et maternelles, University Grenoble Alpes, BP 25, 38040, Grenoble, France
Full-text available
The broad use of computer-supported collaborative-learning (CSCL) environments (e.g., instant messenger–chats, forums, blogs in online communities, and massive open online courses) calls for automated tools to support tutors in the time-consuming process of analyzing collaborative conversations. In this article, the authors propose and validate the cohesion network analysis (CNA) model, housed within the ReaderBench platform. CNA, grounded in theories of cohesion, dialogism, and polyphony, is similar to social network analysis (SNA), but it also considers text content and discourse structure and, uniquely, uses automated cohesion indices to generate the underlying discourse representation. Thus, CNA enhances the power of SNA by explicitly considering semantic cohesion while modeling interactions between participants. The primary purpose of this article is to describe CNA analysis and to provide a proof of concept, by using ten chat conversations in which multiple participants debated the advantages of CSCL technologies. Each participant’s contributions were human-scored on the basis of their relevance in terms of covering the central concepts of the conversation. SNA metrics, applied to the CNA sociogram, were then used to assess the quality of each member’s degree of participation. The results revealed that the CNA indices were strongly correlated to the human evaluations of the conversations. Furthermore, a stepwise regression analysis indicated that the CNA indices collectively predicted 54% of the variance in the human ratings of participation. The results provide promising support for the use of automated computational assessments of collaborative participation and of individuals’ degrees of active involvement in CSCL environments.
Full-text available
In this paper we explore existing log files of the VIBOA environmental policy game. Our aim is to identify relevant player behaviours and performance patterns. The VIBOA game is a 50 hours master level serious game that supports inquiry-based learning: students adopt the role of an environmental consultant in the (fictitious) consultancy agency VIBOA, and have to deal with complex, multi-faceted environmental problems in an academic and methodologically sound way. A sample of 118 master students played the game. We used learning analytics to extract relevant data from the logging and find meaningful patterns and relationships. We observed substantial behavioural variability across students. Correlation analysis suggest a behavioural trade that reflects the rate of “switching” between different game objects or activities. We were able to establish a model that uses switching indicators as predictors for the efficiency of learning. Also we found slight evidence that students who display increased switching behaviours need more time to complete the games. We conclude the paper by critically evaluating our findings, making explicit the limitations of our study and making suggestions for future research that links together learning analytics and serious gaming.
Full-text available
As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices. Abstract As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices.
Conference Paper
Dialogism provides the grounds for building a comprehensive model of discourse and it is focused on the multiplicity of perspectives (i.e., voices). Dialogism can be present in any type of text, while voices become themes or recurrent topics emerging from the discourse. In this study, we examine the extent that differences between self-explanations and think-alouds can be detected using computational textual indices derived from dialogism. Students (n = 68) read a text about natural selection and were instructed to generate self-explanations or think-alouds. The linguistic features of these text responses were analyzed using ReaderBench, an automated text analysis tool. A discriminant function analysis using these features correctly classified 80.9% of the students' assigned experimental conditions (self-explanation vs. think aloud). Our results indicate that self-explanation promotes text processing that focuses on connected ideas, rather than separate voices or points of view covering multiple topics.
Coh-Metrix is among the broadest and most sophisticated automated textual assessment tools available today. Automated Evaluation of Text and Discourse with Coh-Metrix describes this computational tool, as well as the wide range of language and discourse measures it provides. Section I of the book focuses on the theoretical perspectives that led to the development of Coh-Metrix, its measures, and empirical work that has been conducted using this approach. Section II shifts to the practical arena, describing how to use Coh-Metrix and how to analyze, interpret, and describe results. Coh-Metrix opens the door to a new paradigm of research that coordinates studies of language, corpus analysis, computational linguistics, education, and cognitive science. This tool empowers anyone with an interest in text to pursue a wide array of previously unanswerable research questions..