Conference PaperPDF Available

# ReaderBench Learns Dutch: Building a Comprehensive Automated Essay Scoring System for Dutch Language

Authors:

## Abstract and Figures

Automated Essay Scoring has gained a wider applicability and usage with the integration of advanced Natural Language Processing techniques which enabled in-depth analyses of discourse in order capture the specificities of written texts. In this paper, we introduce a novel Automatic Essay Scoring method for Dutch language, built within the Readerbench framework, which encompasses a wide range of textual complexity indices, as well as an automated segmentation approach. Our method was evaluated on a corpus of 173 technical reports automatically split into sections and subsections, thus forming a hierarchical structure on which textual complexity indices were subsequently applied. The stepwise regression model explained 30.5% of the variance in students’ scores, while a Discriminant Function Analysis predicted with substantial accuracy (75.1%) whether they are high or low performance students.
Content may be subject to copyright.
a Comprehensive Automated Essay Scoring
System for Dutch Language
Mihai Dascalu
1,2(&)
, Wim Westera
3
, Stefan Ruseti
1
,
Stefan Trausan-Matu
1,2
, and Hub Kurvers
3
1
Faculty of Automatic Control and Computers,
University Politehnicaof Bucharest,
313 Splaiul Independenţei, 60042 Bucharest, Romania
{mihai.dascalu,stefan.ruseti,
stefan.trausan}@cs.pub.ro
2
Splaiul Independenţei 54, 050094 Bucharest, Romania
3
Open University of the Netherlands, Heerlen, The Netherlands
{wim.westera,hub.kurvers}@ou.nl
Abstract. Automated Essay Scoring has gained a wider applicability and usage
with the integration of advanced Natural Language Processing techniques which
enabled in-depth analyses of discourse in order capture the specicities of
written texts. In this paper, we introduce a novel Automatic Essay Scoring
method for Dutch language, built within the Readerbench framework, which
encompasses a wide range of textual complexity indices, as well as an auto-
mated segmentation approach. Our method was evaluated on a corpus of 173
technical reports automatically split into sections and subsections, thus forming
a hierarchical structure on which textual complexity indices were subsequently
applied. The stepwise regression model explained 30.5% of the variance in
studentsscores, while a Discriminant Function Analysis predicted with sub-
stantial accuracy (75.1%) whether they are high or low performance students.
Keywords: Automated Essay Scoring !Textual complexity assessment !
1 Introduction
Automated Essay Scoring (AES) is one of the important benets of Natural Language
Processing (NLP) in assisting teachers. AES may analyze the degree to which a student
covers in the written text the concepts acquired within the learning process. In addition,
it should analyze also the quality of the text, that means its coherence and complexity.
Latent Semantic Analysis (LSA) [1,2] was one of the rst methods to introduce the
possibility of measuring the semantic similarity when comparing a text written by a
student to the corresponding learning base. Later on, Latent Dirichlet Allocation
(LDA) [3] was introduced as a topic modeling technique that overcomes some problems
E. Andréet al. (Eds.): AIED 2017, LNAI 10331, pp. 5263, 2017.
DOI: 10.1007/978-3-319-61425-0_5
of LSA. Even if LSA and LDA are powerful techniques, due to their inherited bag of
words approach, they cannot be used alone for evaluating the complexity and quality of
a written text.
Our aim is to build a comprehensive Automated Essay Scoring model for Dutch
language. However, text complexity is a hard to dene concept and, therefore, it cannot
be measured with only a few metrics. Moreover, the complexity of a text is directly
related to its ease of reading and to comprehension, which means it also involves
human reader particularities, for example, age, level of knowledge, socio-cultural
features, and even skill and motivation. Coherence, the main feature of a good dis-
course, of a good quality text, a premise of reducing complexity, is also related to
humans perception and it is very hard to measure [4]. Cohesion is a simpler to handle
and operationalize concept that is tightly connected to semantic similarity.
Many metrics and qualitative criteria for analyzing complexity have been proposed,
as it will be discussed in the next section, and various computer systems for computing
such metrics have become available [5]. In the research presented in this paper, we used
the ReaderBench NLP framework [6,7], which integrates a wide range of metrics and
makes extensive usage of Cohesion Network Analysis (CNA) [8,9] in order to rep-
resent discourse in terms of semantic links; this enables the computation of various
local and global cohesion measures described later on. In addition, ReaderBench is
grounded in Bakhtins dialogism [10], which provides a unied framing for both
individual and collaborative learning [9,11].
An important parameter that should be considered for AES is the specic language.
First, LSA, LDA and any statistical approaches for analyzing essays require text cor-
pora written in the language of the essays. Second, there may be signicant differences
among languages with respect to the average length of sentences and even words, size
of vocabulary, discourse structuring, etc. Dutch language, in contrast to English,
contains a high number of compound words (which inherently decreases the number of
tokens per phase); moreover, besides compound words, general words tend to be longer
[12]. In this idea, this paper presents the stages required for porting the ReaderBench
framework, which was developed mainly for English, to Dutch language.
The paper continues with a state of the art section, followed by an in-depth pre-
sentation of the undergone steps required to build our comprehensive Dutch assessment
model. Our evaluation is based on a corpus of student reports in the domain of envi-
ronmental sciences. While engaging in a serious game, students adopt the role of
principal researcher for investigating a multifaceted environmental problem and, on
various occasions throughout the game. they are required to report about their ndings.
After discussing the results, the fth section presents the conclusions, as well as further
enhancements to be integrated within our approach.
2 State of the Art
The idea of quantifying textual complexity or difculty has been studied intensively
over the years, having in mind two major goals: presenting readers with materials
aligned with their level of comprehension, and evaluating learnersabilities and
knowledge levels from their writing traces. In our current research, we are focusing on
the latter goal, evaluating studentswriting capabilities in order to discover signicant
correlations to their knowledge level.
From a global perspective, textual complexity is relative to the students knowledge
of the domain, language familiarity, interest and personal motivation [6]. In addition,
ability and comprehension [6]. In accordance to the Common Core State Standards
Initiative [13], textual complexity can be evaluated from three different perspectives:
quantitative (e.g., word frequency, word/phrase length), qualitative (e.g., clarity,
structure, language familiarity) and from the reader and task orientation (e.g., moti-
vation, prior knowledge or interest). In practice, these dimensions of textual complexity
can be used to determine if a student is prepared for college or for a career. The scope
of the standard is to reduce and eliminate knowledge gaps by offering students a
coherent ow of materials that have a slightly higher textual complexity in order to
A signicant effort has been put into developing automated tools of textual com-
plexity assessment as part of the linguistic research domain. E-Rater [14] is one of the
rst automated systems to evaluate text difculty based on three general classes of
essay features: structure (e.g., sentence syntax, proportion of spelling, grammar, usage
or mechanics errors), organization based on various discourse features, and content
based on prompt-specic vocabulary. Several other tools for automated essay grading
or for assessing the textual complexity of a given text have been developed and
employed in various educational programs [5,15]: Lexile (MetaMetrics), ATOS
(Renaissance Learning), Degrees of Reading Power: DRP Analyzer (Questar Assess-
ment, Inc.), REAP (Carnegie Mellon University), SourceRater (Educational Testing
Service), Coh-Metrix (University of Memphis), Markit (Curtin University of Tech-
nology) [16], IntelliMetric [17] or Writing Pal (Arizona State University) [18,19].
In terms of Dutch language, there are only a few systems that perform automated
essay scoring by integrating multiple textual complexity indices. T-Scan (http://
languagelink.let.uu.nl/tscan) is one of the most elaborated solutions as it considers
multiple features, including [20]: lexical and sentence complexity, referential cohesion
and lexical diversity, relational coherence, concreteness, personal style, verbs and time,
verbs and time, as well as probability features, all derived from Coh-Metrix [2123].
Besides T-Scan, various Dutch surface tools have been reported that provide lexical
indices for text difculty, as well as recommendations to reorganize the text: e.g.,
Texamen, Klinkende Taal and Accessibility Leesniveau Tool [24].
3 Building the Dutch Complexity Model
3.1 The NLP Processing Pipeline for Dutch Language
Before establishing a comprehensive list of textual complexity indices that can be used
to predict a learners understanding level, we rst need to build a Natural Language
Processing (NLP) pipeline for Dutch language. This processing pipeline integrates key
techniques that are later on used also within the scoring algorithm. Multiple challenges
54 M. Dascalu et al.
framework from English to Dutch language; thus, we see t to provide prescriptive
information regarding our NLP specic processes.
First, a new thorough dictionary was required to perform a comprehensive cleaning
of the input text, by ltering and selecting only dictionary words. Elimination of noise
within the unsupervised training process of semantic models, as well as facile identi-
cation of typos are important elements while building our textual complexity model.
Moreover, as the essays used were academic reports we were also constrained to
include low-frequency, scientic words, in order to be capable to grasp the specicity
of our texts. E-Lex (formerly named TST-lexicon) [25] is a lexical database of Dutch
language consisting of both one-word and multi-word lexicons, and it represented the
best starting point after manually reviewing multiple dictionaries. Besides providing a
comprehensive list of words, E-Lex was also used to build a static lemmatizer that
reduces each inected word form to its corresponding lemma, therefore normalizing the
input.
Second, similar to the requirement of a new dictionary, a new stop words list (i.e.
words having limited or no content information) was required in order to disregard
certain words for scoring purposes. Again, upon manual review, we opted for http://
snowball.tartarus.org/algorithms/dutch/stop.txt which was expanded with numbers,
interjections, as well frequent words with low semantic meaning. These words induced
noise within the emerging topics from Latent Dirichlet Allocation (LDA) [3] by having
a high occurrence rate, as well as a high probability, in multiple topics.
Third, new semantic models, namely vector space models based on Latent
Semantic Analysis [1] and Latent Dirichlet Allocation topic distributions [3] needed to
be trained. The Corpus of Contemporary Dutch (Hedendaags Nederlands; 1.35 billion
words; http://corpushedendaagsnederlands.inl.nl) represented the best alternative in
terms of dimension, breadth of topics, as well as novelty of comprised documents.
After preprocessing, the corpus was reduced to around 500 million content words from
approximately 11.5 million paragraphs, each surpassing the minimum imposed
threshold of at least 20 content words. The LSA space was built using the stochastic
SVD decomposition from Apache Mahout [26] which was applied on the
term-document matrix weighted with log-entropy, across 300 dimensions. LDA made
use of parallel Gibbs sampling implemented in Mallet [27] and the model was created
with 100 topics, as suggested by Blei [28]. A manual inspection of top 100 words from
each LDA topic suggested that the space was adequately constructed due to the fact
that the most representative words from each topic were semantically related one to
another.
Fourth, complementary to our LSA and LDA models, the Open Dutch WordNet,
the most complete Dutch lexical semantic database up-to-date with more than 115,000
synsets, was also integrated, enabling the following: (a) the identication of lexical
chains and word sense disambiguation [29], as well as (b) the computation of various
semantic distances in ontologies, namely Wu-Palmer, Leacock-Chodorow and path
length distances [30].
3.2 Textual Complexity Indices
Starting from the wide range of textual complexity indices available within the
ReaderBench framework [6,7] for English language, and based on the previously
described NLP processing pipeline, we present the multitude of textual complexity
indices that we have made available into Dutch language.
In contrast to the systems mentioned within the state of the art section and besides
covering multiple layers of the analysis ranging from surface indices, syntax to
semantics, ReaderBench focuses on text cohesion and discourse connectivity. The
framework provides a more in-depth perspective of discourse structure based on
Cohesion Network Analysis [8,9], a multi-layered cohesion graph [31] that considers
semantic links between different text constituents. We further describe the indices
integrated in our framework and used for this study, categorized by their textual
analysis scope.
Surface, lexicon and syntax analyses. The rst approaches to text complexity were
developed by Page [32] in his search to develop an automatic grading system for
studentsessays. Page discovered a strong correlation between human intrinsic vari-
ables (trins) and proxes (i.e., computer approximations or textual complexity indices),
thus proving that statistical analyses can provide reliable textual automated estimations.
Our model integrates the most representative and predictive proxes from Pages initial
study, corroborated with other surface measures frequently used in other automated
essay grading systems (e.g., average word/phrase/paragraph length, average unique/
content words per paragraph, average commas per sentence/paragraph). Entropy at
word level, derived from Shannons Information Theory [33], is a relevant metric for
quantifying textual complexity based on the hypothesis that a more complex text
In contrast, character entropy is a language specic characteristic [34] and does not
exhibit a signicant variance in texts written in English. Moreover, of particular interest
at this level due to the inherit implications in co-reference resolution, are the different
categories of pronouns (i.e., rst, second and third person, interrogative, and indenite
pronouns), implemented as predened words lists and considered within our model.
Coverage statistics with regards to specic pronouns usage were computed at sentence,
paragraph, and document levels.
Semantic analysis and discourse structure. In order to comprehend a text, the reader
must create a coherent and well connected representation of the information, commonly
referred to as the situation model [35]. According to McNamara et al. [15], textual
complexity is linked with cohesion in terms of comprehension, as the lack of cohesion
can articially increase the perceived difculty of a text. Thus, our model uses a local
and global evaluation of cohesion within the CNA graph, computed as the average
value of the semantic similarities of all linksat intra- and inter-paragraph levels [31,36].
Cohesion is estimated as the average value of [6]: (a) Wu-Palmer semantic distances
applied on the WordNet lexicalized ontology, (b) cosine similarity in Latent Semantic
Analysis (LSA) vector space models, and (c) the inverse of the Jensen Shannon dis-
similarity (JSD) between Latent Dirichlet Allocation (LDA) topic distributions [37].
56 M. Dascalu et al.
Besides semantic models, lexical chains provide a strong basis for assessing text
cohesion and several indices have been also introduced: (a) the average and the
maximum span of lexical chains (the distance in words between the rst and the last
occurrence of words pertaining to the same chain), (b) the average number of lexical
chains per paragraph, as well as (c) the percentage of words that are included in lexical
chains (i.e., words that are not isolated within the discourse, but inter-linked with other
concepts from the same chain).
In addition, starting from the Referentiebestand Nederlands (RBN) [38], several
discourse connectors identiable via cue phrases have been added to our complexity
model in order to provide a ne-grained view over the discourse with regards to the
following relevant relationships: cause, circumstance, comparison, concession, condi-
tion, conjunctive, contrast, degree, disjunctive, effect, exception, nonrestrictive, other,
purpose, restriction, time, and interrogative.
Word complexity represents a mixture of different layers of discourse analysis
covering a wide set of estimators for each words difculty: (a) syllable count,
(b) distance in characters between the inected form, lemma and word stem (adding
multiple prexes or sufxes increases the difculty of using a certain word),
(c) specicity reected in the inverse document frequency from LSA/LDA training
corpus, (d) the average and the maximum path distance in the hypernym tree based on
all word senses and (e) the word polysemy count from WordNet [39]. In order to reect
individual scores at sentence and paragraph level, all these indices were averaged, taking
into consideration only lemmatized content words generated after applying the NLP
processing pipeline. Moreover, normalized occurrences at both paragraph and sentence
levels of all major word categories from the Dutch LIWC dictionary [40] have been
considered, providing additional insights in terms of underlying concept categories.
3.3 Automated Text Segmentation
The previously introduced textual complexity indices become less relevant when facing
longer documents comprising of thousands or tens of thousands of words. Besides the
computational power required for building a complete CNA graph that captures all
potential cohesive links, different sections might exhibit different traits which can be
easily disregarded at document level. A commonly encountered approach is to auto-
matically split longer texts using an imposed xed window of words. The most fre-
quently used threshold value is of 1,000 words [5]. However, this method fails to
consider the natural discourse structure of the text, its hierarchical decomposition, as
most documents contain sections, subsections and so forth, constituent elements that
emerge as a more viable manner of splitting the text. Therefore, the headings from the
initial document produce a hierarchical structure in which each section contains its own
text and list of subsections that can be possibly empty.
Thus, we developed a new segmentation method applicable for Microsoft Word
documents, assuming that sections are correctly annotated with the appropriate heading
styles reecting its hierarchical structure (e.g., Heading 1 is automatically considered as
a section, Heading 2 a subsection, Heading 3 a subsubsection, etc.). From a technical
perspective, due to the constraint that the entire framework is written entirely in Java,
we have opted to rely on the Apache POI library (https://poi.apache.org) for parsing the
.docx documents. The newly generated meta-document contains multiple layers of
well-dened and self-contained document segments on which we can apply the pre-
viously introduced textual complexity indices. The results for each textual complexity
index and for each extracted section are averaged in order to obtain the scores for the
entire meta-document.
4 Results
4.1 Corpus
The corpus used for performing a preliminary validation of our model consisted of 173
technical reports in Dutch written by master degree students from the Open University
of the Netherlands and Utrecht University. The students play an online game in the
domain of environmental policy, which confronts them multidimensional environ-
mental problems. During the game, they are required to upload technical reports about
their ndings, in subsequent stages (i.e., analysis, 2 design tasks, 2 evaluation tasks and
anal evaluation) [41]. As these reports need to be evaluated manually by teachers in
very short time spans, the need for Automated Essay Scoring arose. All essays are
scored by human tutors on the bases of an assessment framework and scores express a
linear variable ranging from 1 (utterly weak) to 10 (excellent). The reports used for this
experiment address only the rst stage (i.e., analysis) and contained an average of 1832
words (SD = 790), ranging from a minimum of 243 words to a maximum of 6186
words. All reports were manually corrected in terms of formatting in order to ensure an
appropriate usage of heading styles, a process that afterwards facilitates their automated
assessment.
Because of the limited number of students whose scores span multiple levels, we
applied a binary split of student scores into two distinct classes: high performance
students with scores "7, while the rest were catalogued as low performance students.
Moreover, for the scope of these preliminary experiments, we opted to rely only on the
LDA topic model besides WordNet, instead of both LSA and LDA. This was due to the
fact that only the LDA space was inspected by native speakers with regards to com-
prising relevantword associations within corresponding topics.
4.2 Statistical Analyses
(e.g., average number of sentences, words and content words, average number of
commas at paragraphs and sentence levels, word polysemy counts, different connectors
and word lists at paragraph and sentence level). Correlations between the selected
indices and the dependent variable (the studentsscore for their technical report) were
then calculated for the remaining indices to determine whether there was a statistically
signicant relation (p< .05). Indices that were highly collinear (r".9) were agged,
and the index with the strongest correlation with the assigned score corresponding to
58 M. Dascalu et al.
each report was retained, while the other indices were removed. The remaining indices
were included as predictor variables in a stepwise regression to explain the variance in
the studentsscores, as well as predictors in a Discriminant Function Analysis [42]
used to classify students based on their performance.
4.3 Relationship Between ReaderBench and StudentsFinal Scores
To address our research question of automatically scoring studentsreports, we con-
ducted correlations between the ReaderBench indices that were normally distributed
and were not multicollinear and their nal scores. As shown in Table 1, medium to
weak effects were found for ReaderBench indices related to the number of words,
paragraphs, unique words per sentence, lexical chains, lower local cohesion induced by
a more varied vocabulary (higher word entropy), different types of discourse connec-
tors at both sentence and paragraph levels (concession, condition, circumstance), as
well as pronouns (both third person and indenite).
reports in terms of words and paragraphs, greater word entropy, used more discourse
connectors and pronouns, and produced more unique words. Moreover, students who
elaborated paragraphs that reect a mixture of diverse ideas.
Table 1. Correlations between ReaderBench indices and report score.
Index rp
Logarithmic number of words .461 <.001
Average number of lexical chains per paragraph .338 <.001
Average sentence-paragraph cohesion
(Wu-Palmer semantic distance in WordNet)
-.284 <.001
Average number of concession connectors per paragraph .269 <.001
Average number of condition connectors per paragraph .260 .001
Word entropy .258 .001
Average number of circumstance connectors per paragraph .254 .001
Percentage of words that are included in lexical chains .250 .001
Average number of indenite pronouns per sentence .237 .002
Average sentence length (number of characters) .193 .011
Average number of third person pronouns per sentence .187 .014
Average number of circumstance connectors per sentence .187 .014
Average number of unique content words per sentence .184 .015
Number of paragraphs .160 .035
Average number of condition connectors per sentence .154 .044
4.4 Regression Analysis and Discriminant Function Analysis
To analyze which ReaderBench features best predicted the studentsscore, we con-
ducted a stepwise regression analysis using the 15 signicant indices as the indepen-
dent variables. This yielded a signicant model, F(3, 169) = 24.676, p< .001,
r= .552, R
2
= .305. Three variables were signicant and positive predictors of report
scores: logarithmic number of words, average number of pronouns per sentence (in-
denite), percentage of words that are included in lexical chains. These variables
explained 30.5% of the variance in the studentsreport scores.
The stepwise Discriminant Function Analysis (DFA) retained three different vari-
ables as signicant predictors (i.e., 1. logarithmic number of words, 2. average number
of indenite pronouns per sentence, and 3. average sentence-paragraph cohesion using
Wu-Palmer semantic distance), and removed the remaining variables as non-signicant
predictors.
The results prove that the DFA using these three indices correctly allocated 132 of
the 173 students from our dataset, v
2
(df = 3, n= 173) = 40.948, p< .001, for an
accuracy of 76.3% (the chance level for this analysis is 50%). For the leave-one-out
cross-validation (LOOCV), the discriminant analysis allocated 130 of the 173 students
for an accuracy of 75.1% (see the confusion matrix reported in Table 2for results). The
measure of agreement between the actual student performance and that assigned by our
model produced a weighted Cohens Kappa of .517, demonstrating moderate
agreement.
5 Conclusions
The ReaderBench NLP framework was extended to support automatic scoring of
studentstechnical reports written in Dutch language. Existing textual complexity
indices and methods had to be adapted from English language, and specically
tweaked for Dutch language, thus introducing one of the most comprehensive models
available for Dutch to our knowing. Moreover, we have also introduced an automatic
segmentation method that creates a hierarchical structure based on document sections
Table 2. Confusion matrix for DFA classifying students based on performance
Predicted
performance
membership
Total
Low High
Whole set Low 54 21 75
High 20 78 98
Cross-validated Low 53 22 71
High 21 77 98
60 M. Dascalu et al.
Initial results indicate that our model, which goes beyond the replication of the
English version of ReaderBench due to the performed customizations, has a high
accuracy and is suitable for automatically scoring Dutch technical reports. In addition,
the performance of our model is comparable to systems available in English language.
Our framework integrates the widest range of textual complexity indices available for
Dutch language, emphasizing the semantic dimension of the analysis instead of fre-
quently used surface measures. Nevertheless, we must point out that the variance
explained by the regression model, as well as the weighted Cohens Kappa, are rather
low in contrast to the accuracy of the DFA model which only assumes a binary
classication. Only the index with the highest correlation (i.e., logarithmic number of
words) was retained in both the linear regression and in the DFA model. The remaining
indices are specic for each model that is fundamentally different the regression
model predicts a linear score, while the DFA performs a classication into two per-
formance categories.
As limitations, we must also point out the discrepancies in the evaluation of the
technical reports as the automatic evaluation is mostly focused on studentswriting
style, while the tutors evaluate the technical quality of the report. Moreover, the
population for our study consists of master degree students who have, in general,
relatively high writing skills; in return, this may reduce the variance in complexity
among the essays. Therefore, new metrics should be introduced in order to address the
technical soundness of a document in relation to a given theme or an imposed set of
topics of interest. Moreover, the Dutch language imposes additional challenges, like the
high number of compound words. While relating to the process of building semantic
models, these words could be more relevant if taken separately. Thus, automated
splitting rules should be enforced upon compound words in order to provide a clearer
contextualization of the input text.
Acknowledgments. This work was partially funded by the 644187 EC H2020 Realising an
Applied Gaming Eco-system (RAGE) project, by the FP7 208-212578 LTfLL project, as well as
by University Politehnica of Bucharest through the Excellence Research GrantsProgram
UPBGEX 12/26.09.2016.
References
1. Landauer, T.K., Dumais, S.T.: A solution to Platos problem: the Latent Semantic Analysis
theory of acquisition, induction and representation of knowledge. Psychol. Rev. 104(2),
211240 (1997)
2. Miller, T.: Essay assessment with Latent Semantic Analysis. J. Educ. Comput. Res. 29(4),
495512 (2003)
3. Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Mach. Learn. Res. 3(45),
9931022 (2003)
4. Crossley, S.A., McNamara, D.S.: Text coherence and judgments of essay quality: models of
quality and coherence. In: 33rd Annual Conference of the Cognitive Science Society,
pp. 12361231. Cognitive Science Society, Boston (2011)
5. Nelson, J., Perfetti, C., Liben, D., Liben, M.: Measures of Text Difculty: Testing their
Predictive Value for Grade Levels and Student Performance. Council of Chief State School
Ofcers, Washington, DC (2012)
6. Dascalu, M.: Analyzing Discourse and text complexity for learning and collaborating,
Studies in Computational Intelligence, vol. 534. Springer, Cham (2014)
7. Dascalu, M., Dessus, P., Bianco, M., Trausan-Matu, S., Nardy, A.: Mining texts, learner
productions and strategies with ReaderBench. In: Peña-Ayala, A. (ed.) Educational Data
Mining. SCI, vol. 524, pp. 345377. Springer, Cham (2014). doi:10.1007/978-3-319-02738-
8_13
8. Dascalu, M., McNamara, D.S., Trausan-Matu, S., Stavarache, L.L., Allen, L.K.: Cohesion
network analysis of CSCL participation. Behavior Research Methods, PP. 116 (2017)
9. Dascalu, M., Trausan-Matu, S., McNamara, D.S., Dessus, P.: ReaderBench automated
evaluation of collaboration based on cohesion and dialogism. Int. J. Comput. Support.
Collaborative Learn. 10(4), 395423 (2015)
10. Bakhtin, M.M.: The dialogic imagination: four essays. The University of Texas Press, Austin
(1981)
11. Dascalu, M., Allen, K.A., McNamara, D.S., Trausan-Matu, S., Crossley, S.A.: Modeling
comprehension processes via automated analyses of dialogism. In: 39th Annual Meeting of
the Cognitive Science Society (CogSci 2017). Cognitive Science Society, London (2017, in
Press)
12. Duyck, W., Desmet, T., Verbeke, L.P., Brysbaert, M.: WordGen: A tool for word selection
and nonword generation in Dutch, English, German, and French. Behav. Res. Methods
36(3), 488499 (2004)
13. National Governors Association Center for Best Practices & Council of Chief State School
Ofcers: Common Core State Standards. Authors, Washington D.C. (2010)
14. Powers, D.E., Burstein, J., Chodorow, M., Fowles, M.E., Kukich, K.: Stumping e-rater®:
Challenging the Validity of Automated Essay Scoring. Educational Testing Service,
Princeton (2001)
15. McNamara, D.S., Graesser, A.C., Louwerse, M.M.: Sources of text difculty: Across the
ages and genres. In: Sabatini, J.P., Albro, E., OReilly, T. (eds.) Measuring up: Advances in
How we Assess Reading Ability, pp. 89116. R&L Education, Lanham (2012)
16. Williams, R., Dreher, H.: Automatically grading essays with Markit©. J. Issues Informing
Sci. Inform. Technol. 1, 693700 (2004)
17. Elliot, S.: IntelliMetric: from here to validity. In: Shermis, M.D., Burstein, J.C. (eds.)
Automated Essay Scoring: A Cross Disciplinary Approach, pp. 7186. Lawrence Erlbaum
Associates, Mahwah (2003)
18. Crossley, S.A., Allen, L.K., McNamara, D.S.: The Writing Pal: a writing strategy tutor. In:
Crossley, S.A., McNamara, D.S. (eds.) Handbook on Educational Technologies for Literacy.
Taylor & Francis, Routledge, New York (in press)
19. McNamara, D.S., Crossley, S.A., Roscoe, R., Allen, L.K., Dai, J.: A hierarchical
classication approach to automated essay scoring. Assessing Writ. 23, 3559 (2015)
20. Pander Maat, H.L.W., Kraf, R.L., van den Bosch, A., van Gompel, M., Kleijn, S., Sanders,
T.J.M., van der Sloot, K.: T-Scan: a new tool for analyzing Dutch text. Comput. Linguist.
Neth. J. 4, 5374 (2014)
21. Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-Metrix: Analysis of text on
cohesion and language. Behav. Res. Methods Instrum. Comput. 36(2), 193202 (2004)
22. Graesser, A.C., McNamara, D.S., Kulikowich, J.M.: Coh-Metrix: Providing multilevel
analyses of text characteristics. Educ. Res. 40(5), 223234 (2011)
23. McNamara, D.S., Graesser, A.C., McCarthy, P., Cai, Z.: Automated Evaluation of Text and
Discourse with Coh-Metrix. Cambridge University Press, Cambridge (2014)
62 M. Dascalu et al.
24. Kraf, R., Lentz, L., Pander Maat, H.: Drie Nederlandse instrumenten voor het automatisch
voorspellen van begrijpelijkheid. Een klein consumentenonderzoek. Tijdschift voor
Taalbeheersing 33(3), 249265 (2011)
25. CGN Consortium: e-Lex, lexicale databank (lexical database). Instituut voor Nederlandse
Taal, Leiden, the Netherlands (2017)
26. Owen, S., Anil, R., Dunning, T., Friedman, E.: Mahout in Action. Manning Publications
Co., Greenwich (2011)
27. McCallum, A.K.: MALLET: A Machine Learning for Language Toolkit (2002). http://
mallet.cs.umass.edu/
28. Blei, D.M.: Probabilistic topic models. Commun. ACM 55(4), 7784 (2012)
29. Galley, M., McKeown, K.: Improving word sense disambiguation in lexical chaining. In:
18th International Joint Conference on Articial Intelligence (IJCAI 2003), pp. 14861488.
Morgan Kaufmann Publishers, Inc., Acapulco (2003)
30. Budanitsky, A., Hirst, G.: Evaluating WordNet-based measures of lexical semantic
relatedness. Comput. Linguist. 32(1), 1347 (2006)
31. Trausan-Matu, S., Dascalu, M., Dessus, P.: Textual complexity and discourse structure in
computer-supported collaborative learning. In: Cerri, Stefano A., Clancey, William J.,
Papadourakis, G., Panourgia, K. (eds.) ITS 2012. LNCS, vol. 7315, pp. 352357. Springer,
Heidelberg (2012). doi:10.1007/978-3-642-30950-2_46
32. Wresch, W.: The imminence of grading essays by computer25 years later. Comput.
Compos. 10(2), 4558 (1993)
33. Shannon, C.E.: Prediction and entropy of printed English. Bell Syst. Tech. J. 30, 5064
(1951)
34. Gervasi, V., Ambriola, V.: Quantitative assessment of textual complexity. In: Barbaresi, M.L.
(ed.) Complexity in Language and Text, pp. 197228. Plus, Pisa, Italy (2002)
35. van Dijk, T.A., Kintsch, W.: Strategies of Discourse Comprehension. Academic Press,
New York (1983)
36. Dascalu, M., Dessus, P., Trausan-Matu, Ş., Bianco, M., Nardy, A.: ReaderBench, an
environment for analyzing text complexity and reading strategies. In: Lane, H.Chad, Yacef,
K., Mostow, J., Pavlik, P. (eds.) AIED 2013. LNCS, vol. 7926, pp. 379388. Springer,
Heidelberg (2013). doi:10.1007/978-3-642-39112-5_39
37. Manning, C.D., Schütze, H.: Foundations of statistical Natural Language Processing. MIT
Press, Cambridge (1999)
38. van der Vliet, H.: The Referentiebestand Nederlands as a multi-purpose lexical database. Int.
J. Lexicogr. 20(3), 239257 (2007)
39. Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 3941
(1995)
40. Zijlstra, H., van Meerveld, T., van Middendorp, H., Pennebaker, J.W., Geenen, R.:
De Nederlandse versie van de Linguistic Inquiry and Word Count (LIWC), een
gecomputeriseerd tekstanalyseprogramma [Dutch version of the Linguistic Inquiry and
Word Count (LIWC), a computerized text analysis program]. Gedrag & Gezondheid 32,
273283 (2004)
41. Westera, W., Nadolski, N., Hummel, H.: Serious gaming analytics: what studentslog les
tell us about gaming and learning. Int. J. Serious Games 1(2), 3550 (2014)
42. Klecka, W.R.: Discriminant Analysis. Quantitative Applications in the Social Sciences
Series, vol. 19. Sage Publications, Thousand Oaks (1980)
... In this paper, we investigate how automated essay scoring methods can help to reduce the teachers' work load. In a preliminary study the automated essay scoring method for the reports in the VIBOA game demonstrated a validity level of 75.1% in correctly predicting a pass or a failure (Dascalu, Westera, Ruseti, Trausan-Matu, & Kurvers, 2017). Although this rate is substantial, it is considered insufficient for practical application. ...
... In a previous study (Dascalu McNamara et al., 2017;Dascalu Westera et al., 2017), the ReaderBench framework (Dascalu, 2014;Dascalu et al., 2014) was used to evaluate the quality of the technical reports from the previous dataset. A standard NLP pipeline for the Dutch language was used to process each essay, followed by the computation of a series of textual complexity indices. ...
... Text cohesion is evaluated at local and global levels, by using the CNA cohesion graph (Dascalu et al., 2017;Dascalu, Trausan-Matu, McNamara, & Dessus, 2015). This graph is constructed using the semantic similarities computed with the LSA and LDA models, as well as WordNet-based distances. ...
... (b) linguistic services used to process and append semantic meta-information to text resources, and (c) linguistic applications that rely on machine learning and data mining techniques, and are designed for various educational experiments and visualizations. ReaderBench implements various metrics and categories of textual complexity indices that can be used to leverage the automated classification of datasets in multiple languages, such as English[2], French[3], Romanian[4]and Dutch[5]. ...
... The semantic cohesion of the texts increased over time, but no significant differences were observed between the two regions. The fourth experiment[5]was run on a set of 173 technical reports written in Dutch language belonging to high or low performance students. Due to the length of the docu‐ ments, a multi-level hierarchical structure was automatically generated based on the section headings. ...
Conference Paper
Full-text available
Assessing textual complexity is a difficult, but important endeavor, especially for adapting learning materials to students’ and readers’ levels of understanding. With the continuous growth of information technologies spanning through various research fields, automated assessment tools have become reliable solutions to automatically assessing textual complexity. ReaderBench is a text processing framework relying on advanced Natural Language Processing techniques that encompass a wide range of text analysis modules available in a variety of languages, including English, French, Romanian, and Dutch. To our knowledge, ReaderBench is the only open-source multilingual textual analysis solution that provides unified access to more than 200 textual complexity indices including: surface, syntactic, morphological, semantic, and discourse specific factors, alongside cohesion metrics derived from specific lexicalized ontologies and semantic models.
... Several free-text (e.g., essay) writing assessments computing tools have been proposed and developed for L2 tasks; some of them were-automatic essay feedback generation [30], TAACO-Tool for the Automatic Analysis of Cohesion [152,162], ETS e-rater ® scoring [163], Coh-Metrix [164], GAMET-Grammar And Mechanics Error Tool [165], ReaderBench-a multi-lingual framework for analyzing text complexity [166,167], ACT CRASE+ ® [168], CNN based AES [169], and CRNN based AES [170]. ...
Preprint
Full-text available
... AES models based on the feature-engineering approach predict scores using textual features that are manually designed by human experts (e.g., Dascalu et al. 2017;Mark and Shermis 2016;Nguyen and Litman 2018). Typical features include essay length and the number of grammatical and spelling errors. ...
Article
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by humans. Although traditional AES models typically rely on manually designed features, deep neural network (DNN)-based AES models that obviate the need for feature engineering have recently attracted increased attention. Various DNN-AES models with different characteristics have been proposed over the past few years. To our knowledge, however, no study has provided a comprehensive review of DNN-AES models while introducing each model in detail. Therefore, this review presents a comprehensive survey of DNN-AES models, describing the main idea and detailed architecture of each model. We classify the AES task into four types and introduce existing DNN-AES models according to this classification.
... Many AES methods have been developed over the past decades, and can generally be categorized as feature-engineering and neural-network approaches (Hussein et al., 2019;Ke and Ng, 2019). The featureengineering approach predicts scores using handcrafted features such as essay length or spelling errors (e.g., (Amorim et al., 2018;Dascalu et al., 2017;Mark D. Shermis, 2016;Nguyen and Litman, 2018)). The advantages of this approach include interpretability and explainability. ...
Conference Paper
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to grading by human raters. Conventional AES typically relies on handcrafted features, whereas recent studies have proposed AES models based on deep neural networks (DNNs) to obviate the need for feature engineering. Furthermore, hybrid methods that integrate handcrafted features in a DNN-AES model have been recently developed and have achieved state-of-the-art accuracy. One of the most popular hybrid methods is formulated as a DNN-AES model with an additional recurrent neural network (RNN) that processes a sequence of handcrafted sentence-level features. However, this method has the following problems: 1) It cannot incorporate effective essay-level features developed in previous AES research. 2) It greatly increases the numbers of model parameters and tuning parameters, increasing the difficulty of model training. 3) It has an additional RNN to process sentence-level features, enabling extension to various DNN-AES models complex. To resolve these problems, we propose a new hybrid method that integrates handcrafted essay-level features into a DNN-AES model. Specifically, our method concatenates handcrafted essay-level features to a distributed essay representation vector, which is obtained from an intermediate layer of a DNN-AES model. Our method is a simple DNN-AES extension, but significantly improves scoring accuracy.
... The feature-engineering approach predicts scores using manually tuned features such as essay length and number of spelling errors (e.g., [3,5,22,28]). Advantages of this approach include interpretability and explainability. However, these approaches generally require extensive feature redesigns to achieve high prediction accuracy. ...
Chapter
Full-text available
Automated essay scoring (AES) is the task of automatically assigning scores to essays as an alternative to human grading. Conventional AES methods typically rely on manually tuned features, which are laborious to effectively develop. To obviate the need for feature engineering, many deep neural network (DNN)-based AES models have been proposed and have achieved state-of-the-art accuracy. DNN-AES models require training on a large dataset of graded essays. However, assigned grades in such datasets are known to be strongly biased due to effects of rater bias when grading is conducted by assigning a few raters in a rater set to each essay. Performance of DNN models rapidly drops when such biased data are used for model training. In the fields of educational and psychological measurement, item response theory (IRT) models that can estimate essay scores while considering effects of rater characteristics have recently been proposed. This study therefore proposes a new DNN-AES framework that integrates IRT models to deal with rater bias within training data. To our knowledge, this is a first attempt at addressing rating bias effects in training data, which is a crucial but overlooked problem.
... The aim of this experiment is to build a comprehensive Automated Essay Scoring model for Dutch language (Dascalu et al., 2017c) and also to propose a method for reducing the number of essays that need to be graded by teachers, while maintaining an acceptable accuracy . However, the concept of text complexity is a difficult to define and cannot be measured with only a few metrics. ...
Thesis
Full-text available
... Several existing text analysis tools can calculate a huge number of textual features against input texts. ReaderBench [22] is an open source multilingual framework that makes use of natural language processing techniques to provide text analysis tools. The framework is multilingual [23] -text analysis tools are available in Dutch, French, Romanian and English. ...
Chapter
Full-text available
For predicting and improving the quality of essays, text analytic metrics (surface, syntactic, morphological and semantic features) can be used to provide formative feedback to the students. In this study, the intent was to find a small number of features that exhibit a fair proxy of the scores given by the human raters. Using an existing corpus and a text analysis tool for the Dutch language, a large number of features were extracted. Artificial neural networks, Levenberg Marquardt algorithm and backward elimination were used to reduce the number of extracted features automatically. Irrelevant features were eliminated based on the inter-rater agreement between predicted and human scores calculated using Cohen’s Kappa ($$\kappa$$). By using our algorithm, the number of features in this study was reduced from 457 to 23. The selected features were grouped into six different categories. Of these categories, we believe that the features present in the groups “Word Difficulty” and “Lexical Diversity” are most useful for providing automated formative feedback to the students. The approach presented in this research paper is the first step towards our ultimate goal of providing meaningful formative feedback to the students for enhancing their writing skills and capabilities.
... This study demonstrates the effectiveness of the proposed model through experiments using actual data. AES methods using machine learning, such as topic models or deep neural networks, have recently attracted attention in AI fields as another approach to reduce the burden of essay grading [2,3,10,16,20,41]. The accuracy of such methods is limited, however. ...
Chapter
Full-text available
Essay-writing tests are widely used in various assessment contexts to measure higher-order abilities of learners. However, a persistent difficulty is that ability measurement accuracy strongly depends on rater characteristics. To resolve this problem, many item response theory (IRT) models have been proposed that can estimate learners’ abilities in consideration of rater effects. One remaining difficulty, however, is that measurement accuracy is reduced when few raters are assigned to each essay, a common situation in practical testing contexts. To address this problem, we propose a new rater-effect IRT model integrating a supervised topic model that can estimate abilities from raters’ scores and the textual content of written essays. By reflecting textual content features in IRT-based ability estimates, the model can improve ability measurement accuracy when there are few raters for each essay. Furthermore, learners’ abilities can be estimated using essay textual content alone, without ratings, when model parameters are known. Finally, scores for unrated essays can be estimated from textual content, so the model can be used for automated essay scoring. We evaluate the effectiveness of the proposed model through experiments using actual data.
... The multi-layered textual complexity model behind ReaderBench is highly extensible and supports different languages, such as: English Crossley, Paquette, Dascalu, McNamara, & Baker, 2016;Dascalu, Popescu, Becheru, Crossley, & Trausan-Matu, 2016), French Dascalu, Stavarache, Trausan-Matu, Dessus, & Bianco, 2014), Romanian Gifu, Dascalu, Trausan-Matu, & Allen, 2016), and Dutch (Dascalu, Westera, Ruseti, Trausan-Matu, & Kurvers, 2017). In addition, several languages including Spanish, Italian, and Latin are partially covered. ...
Book
Full-text available
Laboratoire de linguistique et didactique des langues étrangères et maternelles, University Grenoble Alpes, BP 25, 38040, Grenoble, France
Article
Full-text available
The broad use of computer-supported collaborative-learning (CSCL) environments (e.g., instant messenger–chats, forums, blogs in online communities, and massive open online courses) calls for automated tools to support tutors in the time-consuming process of analyzing collaborative conversations. In this article, the authors propose and validate the cohesion network analysis (CNA) model, housed within the ReaderBench platform. CNA, grounded in theories of cohesion, dialogism, and polyphony, is similar to social network analysis (SNA), but it also considers text content and discourse structure and, uniquely, uses automated cohesion indices to generate the underlying discourse representation. Thus, CNA enhances the power of SNA by explicitly considering semantic cohesion while modeling interactions between participants. The primary purpose of this article is to describe CNA analysis and to provide a proof of concept, by using ten chat conversations in which multiple participants debated the advantages of CSCL technologies. Each participant’s contributions were human-scored on the basis of their relevance in terms of covering the central concepts of the conversation. SNA metrics, applied to the CNA sociogram, were then used to assess the quality of each member’s degree of participation. The results revealed that the CNA indices were strongly correlated to the human evaluations of the conversations. Furthermore, a stepwise regression analysis indicated that the CNA indices collectively predicted 54% of the variance in the human ratings of participation. The results provide promising support for the use of automated computational assessments of collaborative participation and of individuals’ degrees of active involvement in CSCL environments.
Article
Full-text available
In this paper we explore existing log files of the VIBOA environmental policy game. Our aim is to identify relevant player behaviours and performance patterns. The VIBOA game is a 50 hours master level serious game that supports inquiry-based learning: students adopt the role of an environmental consultant in the (fictitious) consultancy agency VIBOA, and have to deal with complex, multi-faceted environmental problems in an academic and methodologically sound way. A sample of 118 master students played the game. We used learning analytics to extract relevant data from the logging and find meaningful patterns and relationships. We observed substantial behavioural variability across students. Correlation analysis suggest a behavioural trade that reflects the rate of “switching” between different game objects or activities. We were able to establish a model that uses switching indicators as predictors for the efficiency of learning. Also we found slight evidence that students who display increased switching behaviours need more time to complete the games. We conclude the paper by critically evaluating our findings, making explicit the limitations of our study and making suggestions for future research that links together learning analytics and serious gaming.
Article
Full-text available
As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices. Abstract As Computer Supported Collaborative Learning (CSCL) gains a broader usage, the need for automated tools capable of supporting tutors in the time-consuming process of analyzing conversations becomes more pressing. Moreover, collaboration, which presumes the intertwining of ideas or points of view among participants, is a central element of dialogue performed in CSCL environments. Therefore, starting from dialogism and a cohesion-based model of discourse, we propose and validate two computational models for assessing collaboration. The first model is based on a cohesion graph and can be perceived as a longitudinal analysis of the ongoing conversation, thus accounting for collaboration from a social knowledge-building perspective. In the second approach, collaboration is regarded from a dialogical perspective as the intertwining or synergy of voices pertaining to different speakers, therefore enabling a transversal analysis of subsequent discussion slices.
Conference Paper
Dialogism provides the grounds for building a comprehensive model of discourse and it is focused on the multiplicity of perspectives (i.e., voices). Dialogism can be present in any type of text, while voices become themes or recurrent topics emerging from the discourse. In this study, we examine the extent that differences between self-explanations and think-alouds can be detected using computational textual indices derived from dialogism. Students (n = 68) read a text about natural selection and were instructed to generate self-explanations or think-alouds. The linguistic features of these text responses were analyzed using ReaderBench, an automated text analysis tool. A discriminant function analysis using these features correctly classified 80.9% of the students' assigned experimental conditions (self-explanation vs. think aloud). Our results indicate that self-explanation promotes text processing that focuses on connected ideas, rather than separate voices or points of view covering multiple topics.
Article
Coh-Metrix is among the broadest and most sophisticated automated textual assessment tools available today. Automated Evaluation of Text and Discourse with Coh-Metrix describes this computational tool, as well as the wide range of language and discourse measures it provides. Section I of the book focuses on the theoretical perspectives that led to the development of Coh-Metrix, its measures, and empirical work that has been conducted using this approach. Section II shifts to the practical arena, describing how to use Coh-Metrix and how to analyze, interpret, and describe results. Coh-Metrix opens the door to a new paradigm of research that coordinates studies of language, corpus analysis, computational linguistics, education, and cognitive science. This tool empowers anyone with an interest in text to pursue a wide array of previously unanswerable research questions..