ArticlePDF Available

Abstract and Figures

E-learning is employing technology to help and promote learning begun decades of most important example of using intelligent software agent in e-learning is Intelligent Tutoring systems (ITSs). Assessment plays a significant role in the educational process. Automated Essay Scoring (AES) is defined as the computer technology that evaluates and scores the subjective answers. To evaluate students essay answers based on a linguistic knowledge in this project we presented a suitable model. Getting results from the recommended system application in calculating simulation results indicates high precision in performance associated with other methods. We used square root error rate (RMSE) and Pearson correlation coefficient metrics to measure the system efficiency and also use student answer acquired from the University of North Texas data collection, this data accessible on the Internet.
Content may be subject to copyright.
Automated Scoring for Essay Questions In E-learning
Manar Joundy Hazar1
Zinah Hussein Toman2
Sarah Hussein Toman3
1Computer Center, University of Al -Qadisiyah, Qadisiyah, Iraq.
2Computer Science Department, College of Computer Science and Information Technology,
University of Al -Qadisiyah
3Roads and Transport Department, College of Engineering, University of Al -Qadisiyah
E-learning is employing technology to help and promote learning begun decades of
most important example of using intelligent software agent in e-learning is Intelligent Tutoring systems
(ITSs). Assessment plays a significant role in the educational process. Automated Essay Scoring (AES)
is defined as the computer technology that evaluates and scores the subjective answers. To evaluate
students essay answers based on a linguistic knowledge in this project we presented a suitable model.
Getting results from the recommended system application in calculating simulation results indicates
high precision in performance associated with other methods. We used square root error rate (RMSE)
and Pearson correlation coefficient metrics to measure the system efficiency and also use student answer
acquired from the University of North Texas data collection, this data accessible on the Internet.
E-learning, Word Net, NLP, Semantic Similarity, essay question.
E-learning is an effective method of learning using modern technologies based on the use of
the Internet. This way of teaching in e-learning enables you to give courses to students at anytime and
anywhere as well as interact with students in an effective and simple manner. E-learning become one
of the most important requirements of the educational process to keep pace with the rapid development
of educational institutions around the world, it also played an essential role in the development of
education. [1].
In recent years E-Learning has become a significant trend technology. All resources offered
richer than the traditional classroom to simplify teaching, it also overcomes the restrictions of time and
space of traditional learning. It allow students to study independently, which mean free them from
restrictions of traditional education techniques [2].
Figure 1:the basic Components of ITS
Assessment of learning outcomes with tests and examinations can simplicity many types and
ways of grading. Designed questions can be types of specific questions, like everything from multiple-
choice questions to simple questions that involve natural language answers such as short answers or
articles. The rating manner may be either manual staging or automatic staging by calculation methods.
In this research we focus on the type of article answer question and the automatic estimation technique.
Many scientists argue that the subjective nature of article evaluation leads to disparities in the
scores given by various human residents, which students consider a major source of injustice.
You may experience this problem by adopting automated article evaluation tools. The
automated assessment system will be at least consistent in the manner the articles are recorded, and
time savings and massive cost can be achieved if the technique can be presented to evaluative articles
within a scope granted by a human evaluator. Furthermore, according to Hirst (2000) using computers
to increase our understanding of textual features and cogni and tive skills that are involved in creating
and understanding written texts, will provide a number of benefits to the educational
addition It will help in developing more effective technologies such as search engines and question-
answering systems to provide universal access to electronic information. [4]
Summit (for exam preparation assistant), a tool to evaluate students' articles centered on their
content. Depending on semantic text analysis known as Latent semantic analysis (LSA) [5]. Essentially,
LSA denotes to each word as a vector in a high-dimensional space so that nearness between two stores
is closely related to the semantic similarity between the two words.
This change in how the content of the course and interaction with students is a major
departure from the home classroom course in traditional classrooms. Perhaps the most interesting
position is the actual location of the place in which the introduction and deeper sharing with the
material takes place. Traditionally, an introduction to the classroom is provided through a lecture, and
a deeper engagement outside the classroom is conducted through homework. In the description above,
the introduction takes place outside the classroom and the participation occurs within the classroom
In this area, most researchers agree that it is inappropriate to use objective questions to measure
some complex aspects of achievement. Learning results involving the skill to remember, organize and
integrate ideas, the capability to express themselves in writing and the ability to supply only the
identification of interpretation and application of data, need a less responsive structure than are subject
to objective test clauses. In measuring such results, the highest levels of Bloom's classification (1956)
(specifically evaluation and synthesis) correspond to the article's question serving the most useful
purpose [7].
Not the same of Multiple Choice Question (MCQ), essays enclose subjective answers rather than
the accurate answers in MCQ (e.g. true or false) [8]. So, the student skill and ability play a large role in
creating a strong answer free of misspellings and grammatical errors that will reduce the student's mark
if they exist. Hence, process of automatic essays evaluations is a challenging task because of the need
of complete assessment in order to authenticate the answers accurately [9].
2.1 Related work
Several researchers have addressed the problem of automated essay scoring or so-called
automatic essay assessment by using various techniques. The key characteristic behind these techniques
lies on a set of manually scored essay by human in which the essay that intended to be assessed is
compared with the pre-scored essays. Usually, the manually scored essays are called pre-scored essays
or training essays, whereas the essay that required to be assessed by the computer is called tested essay
or automated scoring essay [10].
The earliest system proposed for essay assessment is the Project Essay Grader (PEG) which was
focusing on the writing style of a given essay in order to provide the score. The writing style
concentrates on essay length and mechanics such as spelling error, capitalization, grammar and diction.
Obviously this approach was criticized due to the lack of semantic analysis in which the content is being
There are many methods that used in Graesser, Arthur C., et al. (2000), which is correcting the
computer for the article presented in the test answers by sample of students and one of the methods used
to evaluate the article provided by a group or sample of students using the model of efficiency based on
educational education. One scientist suggested a model or approach to lessons the smart test answers
students using a grammar checker and providing feedback after evaluating the student's answer. Another
world - class student identification software program proposes using a check-in-side flow control
diagram to measure the similarities [11].
Satav, Harshada, Trupti Nanekar, Supriya Pingale, and Nupur [12] has presented an
examination system that based on SQL and Microsoft.Net (C# and Asp.Net). They implemented an
examination system for computer application base. Their system has only examination and evaluation
subsystems. Moreover, their system included different types of questions that related to computer
application such as multiple choice, fill in blank, true / false, programming design questions.
Mohamed Jaballah, Saad Harous, and Sane M. Yagi [13] developed an Arabic examination
system for students in University of Sharjah. Their system just included examination and grading
subsystems of different types of questions such as true/false, multiple choice, fill in blank and essay
question types. The exam paper is generated automatically by the examination system. However, their
grading subsystem was not automated and the grading is done manually by the teacher via a grading
Chen Xiangjun and Wu Fangsheng [14] proposed an examination system which provides login
activity recording, users management, test question management. It includes examination subsystem
and grading subsystem that based on matching the student answers with the answers key.
3.1 Methodology
A framework is proposed in this chapter for process of students' evaluation in educational
environments based on linguistic knowledge. The framework loads the student answer as input which will
be passed through three layers illustrated in figure (2)
Figure 2:Layers of the proposed framework
Suggested system is based on linguistic knowledge. Students' answers are electronically
obtained and compared to the ideal answer stored in the system. The acquired answer as well as
assessed electronically
3.1 The Pre-Processing Layer:
This process powered by OpenNLP frawmeork that is open source statistical base parser.
OpenNLP is used to process natural languages such as English language. It performs the tasks of
sentence detection, tokenizing, tagging part of speech and detecting most popular people noun,
organization, places, cities, countries and much more [15].
Processing Layer
extract synonyms
by wordnet sumrization wsd
similarty between summry text and ideal
detector token name entity stop word
steming parsing co-reference
(Student score)
Input text
(Student nswer)
Post Processing
3.1.1 Sentence Detection
The sentence is defined as the longest white space trimmed character sequence between two
punctuation marks. Open NLP Sentence Detector can identify that a punctuation letter marks the end
of a sentence or not. This rule excludes the beginning and end of the sentence. The first non-
whitespace character is assumed to be the begin of a sentence, and the last non-whitespace character is
assumed to be a sentence end. [16, 17].
3.1.2 Tokenization
The Open NLP Tokenizers segment an input character sequence into tokens. Tokens are usually
words, punctuation, numbers, etc. [18].
3.1.3 Name Finder
The Name Finder can detect named entities and numbers in the text. To be able to detect entities
the Name, Finder needs a style. The style depends on the language in addition to the entity type on
which the model was trained.
There is a paucity of name search methods in Open NLP that has been pre-trained on various
sessions available for free. To find names in raw text, are divided into symbols and sentences. It is
important that the tokenization for the training data and the input text be identical [19].
3.1.4 Stop Words
The correct English sentence consists of some grammatical words and these words do not
contribute to the meaning of the sentence. For example, words such as from and to are stop words
because the sentence is needed to complete the logical meaning, but it does not describe the true
meaning. For example, in a fast car sentence, car and fast words are considered the basic part of the
sentence, while the relationship among words is described in the basic part. It is very normal that
stop words appear in most sentences, and this may lead to a higher overall score so the suggestion
can be to eliminate these words before comparing sentences
3.1.5 POS Tagger
Putting POS signs into a part of speech is one way of removing the meaning of words that aim to
assign each word in a particular text to a fixed set of speech parts such as name, verb, character, or
circumstance. There are many words that have many potential signs and therefore, a POS has been
thrown in to cancel these words. Therefore, the primary role of the POS is to determine the exact tag
for each word in the group. [20].
3.1.6 Stemming:
Words with the same meaning appear in various morphological forms. To capture their similarity they
are normalized into a common root-form, the stem word involves suffix deletion, an automated word
stemming plays an important role in the information retrieval systems. Furthermore, subsequent
automatic removal has a major effect on information retrieval systems since it reducing total number
of terms in the system. We use the Porter stemming algorithm [21] to this perpuse.
3.1.5 Chunking
The dissection of text consists of dividing the text into interconnected parts of words, such as names
and classes of action, but does not specify its internal structure or its role in the main sentence[20].
3.1.6 Parsing
parsing is the most important stage used during the question generation. It provide rich infor
mation about wholesale grammar. Distribution represents a text entity in a tree structure based on the
use of grammatical rules and input sentences.
Parsing is done in down manner or Progressive manner. Top-down parsers construct the parse
tree by the derivation of the input sentence from the root node S down to the leaves [15]. The searcher
algorithm of the parser will expand All trees that have S as a mother, using information about possible
trees under the root, will construct a parallel tree analyzer. In bottom-up parsing, the parse tree
construction process is started With the words of the input and tries to build the trees of the word,
again by applying grammar rules simultaneously. The success of the parser depends on if the parser
Succeeds in structure a tree rooted in the start code S that covers all inputs [22].
3.1.7 Co-reference
In this pharese all of the references are mathching to equivalent name entity in the article,
regardless of the reference reference format. It usually matches the name, full name or pronoun. Some
work has shown that the common reference resolution can be used to improve summary systems that
rely heavily on word frequency features [15].
A simple example is the use of a pronominal reference. For example, "John will travel
tomorrow, buy the ticket yesterday." In this case, the word "he" refers to "John". Thus, if words are
recorded together, they may be more important. This type of analysis is not widely used in summary
systems due to performance and accuracy problems.
3.2 Intermediate Processing Layer:
This layer represents the core of the proposed model for evaluating summaries. Treatment at
this stage involves applying of grammatical and semantic information to the summary evaluating.
Initially, the source text and summary text analyzed in a set of sentences. Then the analogy between
each sentence of the text of the summary and the full sentences of the source text is determined using
a similarity configuration of word order and semantic similarity. The maximum value is set as a
degree of similarity to the recent outline of the summary.
WordNet is an on-line lexical reference system whose design is inspired by current
psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized
into synonym sets, each representing one underlying lexical concept. Different relations link the
synonym sets. Each wordnet represents an autonomous structure of language-specific lexicalizations,
which are interconnected via an Inter-Lingual-Index. The wordnets are built at different sites from
existing resources, starting from a shared level of basic concepts and extended top-down. The Collins
and Quillian model proposed a hierarchical structure of concepts, where more specific concepts inherit
information from their superordinate, more general concepts; only knowledge particular to more
specific concepts needs to be stored with such concepts. Thus, it took subjects longer to confirm a
statement like “canaries have feathers” than the statement “birds have feathers” since, presumably, the
property “has-feathers” is stored with the concept bird and not redundantly with the concept for each
kind of bird. This means that the similarity procedure will consist only of noun comparisons, verb
action, and soon.
Figure 3: Example of WordNet Structure
The The availability of the WORDNET database was an important starting point.
The synchronization model is simple enough to provide a fundamental relationship between the
concept and the corresponding words. Moreover, WORDNET covers the entire English dictionary
and provides an unusually large amount of conceptual differentiation. It is also particularly useful
from an arithmetic point of view because it has been developed for easy access and movement
through hierarchies. Starting with WORDNET, we chose a subset of the appropriate processors to
represent the emotional concepts [23].
The first decision we made in the mapping project was to settle the relationships that would
be used to set WordNet as synchronized. There are three possible relationships: synonymy,
hypernymy, and instantiation. Some examples should illustrate these three relationships and use them
in mapping on WordNet groups. Consider the following entry in the WordNet name database.
Formally, the semantic similarity is defined as follows:
simres(c1; c2) = max c2S(c1;c2) icres(c)
Where S(c1; c2) Is a set of concepts that fall under c1 and c2. The theoretical similarity measure that
used the same IC idea was from Lin. Explains its definition of similarity: ”The similarity between A
and B is measured by the ratio between the amount of information needed to state the commonality
of A and B and the information needed to fully describe what A and B are.”
Formally, the definition above can be expressed by:
simlin (c1; c2) = 2 £ simres (c1; c2) (icres (c1) + icres (c2))/ (icres(c1) + icres(c2))
3.2.2 Summarization:
To have a short sentence, unnecessary information must removed from the sentence from in the
source text so, deletion strategy is used to that. Unimportant information includes trivial details
about topics such as examples, scenarios, or repeated information that contain some important
information [24]. basic job of the deletion strategy is to remove non-important material like stop
words and interpretations and giving two sentences, original sentences and concise sentence,
allowing Ss to be a concise sentence, Os the original sentence, Len (Os) denotes the length of the
Os statement while Len (Os) Wholesale Ss. The first rule of the deletion strategy is as follows:
Length (Ss) is less than Length (Os), Sentence Partitioning: Each sentence must be divided into a
list of words in order to measure the similarity. The sentence is tokenized by first eliminating the
stop words. The stop words are the words that should be removed from a sentence before any natural
language processing technique is applied. In addition, there is no specific set of words that represent
the stop words, and it can be any set. The most common set of stop words is the set of the functional
words such as the, for ,and at. Finally, the separator characters between words are recognized and
the word list is created.
3.2.3. The word ambiguity
The POS class in WordNet is divided into data sets that contain synonyms. We have also seen
that a word can be full of sensations, that is to say, it has different senses. All of these senses will be
included in the same classification, but in different groups. When calculating the degree of similarity
between two words, it is necessary to make sure that the correct sense of words is compared. This is
done by creating an unambiguous feeling in the given context. The algorithm used most often to
understand the meaning of the word is the algorithm of Lesk Algorithm, offered by Michael E. Lesk
in 1986 [25]. The algorithm uses a word definition to determine if there is something in common with
another word.
Word sense disambiguate must be done in the sentence similarity process since every single
word has different meaning within different context. Adapted Lesk Algorithm It is considered
one of the most accurate algorithms used in the field of word sense disambiguation.
1.2 Post-treatment Layer
The auto grading system utilizes the natural language processing algorithms to measure
the matching between the stored answer and the student's answer for the free response
questions.This shows the results from the user's method. It illustrations the measurement of similarity
as a degree to the student. the following equations is used to determine the final result (FS) for any
written summary of the student:
FS = 2 𝑇𝑜𝑘𝑒𝑛 _𝑀𝑎𝑡𝑐ℎ(𝑋,𝑌)
where token_match(X, Y) are the matching word tokens between X and Y .
Dice coefficient fomula This approach returns the percentage of the number of tokens that can be
matched over the total of tokens, Dice will always return a higher value than Matching average
[26,27]. this policy need to predefine a threshold to select the matching pairs that have values greater
than the given threshold.
Dice coefficient =2∗|𝑋∩𝑌|
4.1. Metrics
we evaluate our system by measured its performance we measured a correlation between two
variable using pearson correlation cofficient with value [1,-1]where 1 mean positive,0 not exist and -1
negative. Also we measure the variance between assessed and actual values using root mean square
errore metric.
4.2 Simulation Environment
Presented system applied using Open NLP open source library, to presses text. It provide all
NLP tasks (tokenization, POS, chunking, parsing …). Interface of proposed system was build using
visual studio 2013
4.4. Results and Discussion
To evaluate the performance of the proposed model, we conducted three experiments. In the
first one, we evaluated performance of the system against human judgment to identify the grading of
students' answers. While In the second one, we compared proposed framework with the human
calculations and finally, in the third experiment we compared with the stae-of-the art and existing
4.4.1. Test 1
In summer of 2017, dataset from al qadisyiah university student's (student responses) were
collected in Web programming course. It contain one essay. Table1.Displays the questions use in this
study. Figure 3. Shows a sample answer of three students.
Table 1: First Test
Suggested method
list the benefits of
answer essays ranked by Five human by hand, all five manually grader are computer since
lecturer, mark give for each answer was in range between 0 and 10, three samples are token , then
automatic eases on every answer is associated to lecturer's score.
4.4.2. Test 2
To evaluate our technique for grading student response, the performance measured in
contradiction of human grading to determine the similarity between response student and human
Figure 4: Manual student answer samples
Table 2: proposed technique against human scoring
Sample question ,precise answer, and student response
What are the similarities between iteration and recursion?
They both involve repetition; they both have termination tests; they can both occur infinitely.
Student 1
Both are based on a control statement.
Student 2
Both involve a controlled repetition structures, and they both have a termination
test. Also both of them can loop forever.
Student 3
They are methods of repeating the same task.
What is the main difference between a while and a do...while statement?
The block inside a do...while statement will execute at least once.
Student 1
Do while loop always executes once. A while loop's conditional statement has to
be true for it to run.
Student 2
the loop body always executes at least once
Student 3
Do ...while runs the embedded code at least once, the do command does not
We independency test one components of our overall grading system in the final stage, to assess
the performance of suggested technique, we must use an evaluation metric. Several evaluation metrics
are commonly used in NLP applications. In our test, the evaluation is implemented between proposed
frame work and dataset using Correlation metric for full data set.Table8882 3 show result of the
proposed frame work compare with each of these measures and figure (4, 5, 6) are comparison of
(Knowledge-based, Corpus-based, Baseline) measures to the proposed frame work and the result was
higher by 0.3 of a higher result in these measures.
Table 3: Result of the presented method against each measures using correlations.
knowledge based
sortest path
LSA Wikipedia
ESA Wikipedia
Proposed model
Figure 5: comparison of knowledge-based and proposed method
4.4.3. Test 3
To evaluate the sugested framework on data set, we choice the following approaches:
Corpus-based and Baseline
Figure 6: Suggested method Vs Corpus-based.
0.373 0.3368
0.3926 0.45
correlation coefficient
Shortest path Leacock&Chodorow Lesk Wu&palmer Resnik
Lin Jiang&Conrath Hirst&St-Onge proposed model
0.4072 0.4288 0.4684 0.490052587
correlation coefficient
LSA BNC LSA Wikipedia ESA Wikipedia proposed model
Figure 7: Suggested method Vs. Baseline
As shown in Figures (5), (6) and (7), it is very clearly that suggested method succeeded higher
correlation coefficient than all other methods. We can notice the similarity between suggested method
and human judgment.
Lastly Table (4) shows RMSE of the suggested framework and some other state-of-the-art
systems illustrating an indication of the performance of these grading methods [26].
Table 4: RMSC measures
4.5 Conclusion
In this paper we offered an analysis and a discussion of most recently presented assessment approaches
in tutoring.then describes how intelligent tutoring systems (ITS) can be used for students assessment
by using NLP tools and algorithms. This work studied an automated assessment of learners’ input that
involved a lexico-syntactic (entailment) scheme to textual analysis beside to a variety of other textual
assessment measures. In the proposed frameworke we use NLP to discover the semantic similarity
between student answer and ideal answer in outomatic grading system.the expermintal resulte show the
eficiancy of the presented framework and proved it is an accurated method.
correlation coefficient
tf*idf proposed model
suggested model
4.6 future work
Essay question may contain mathematical symbols and formulas which is a big challenging, so
, as future work this framworke can be extended to cover another type of format in article question like
mathmatical formating and also another languages like arabic language
1. Motiwalla, L. F. (2007). Mobile learning: A framework and evaluation. Computers &
education, 49(3), 581-596.
2. Wang, T. H. (2014). Developing an assessment-centered e-Learning system for
improving student learning effectiveness. Computers & Education, 73, 189-203.
3. Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on
automated essay grading. Journal of Information Technology Education: Research, 2,
4. Bin, L., & Jian-Min, Y. (2011, September). Automated essay scoring using multi-
classifier fusion. In International Conference on Information and Management
Engineering (pp. 151-157). Springer, Berlin, Heidelberg.
5. Felder, R. M., Woods, D. R., Stice, J. E., & Rugarcia, A. (2000). The future of
engineering education II. Teaching methods that work. Chemical Engineering
Education, 34(1), 26-39.
6. Lage, M. J., Platt, G. J., & Treglia, M. (2000). Inverting the classroom: A gateway to
creating an inclusive learning environment. The Journal of Economic Education, 31(1),
7. Ghosh, S. (2010). Online Automated Essay Grading System as a Web Based Learning
(WBL) Tool in Engineering Education. Web-Based Engineering Education: Critical
Design and Effective Tools: Critical Design and Effective Tools, 53.
8. Landauer, T. K. (2003). Automatic essay assessment. Assessment in education:
Principles, policy & practice, 10 (3), 295-308.
9. Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short
answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60-
10. J. Kurs, M. Lungu, and O. Nierstrasz, Top-Down Parsing with Parsing Contexts”, In
Proceedings of International Workshop on Smalltalk Technologies (IWST14), England,
2014, pp. 1-7.
11. Graesser, A. C., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Tutoring
Research Group, T. R. G., & Person, N. (2000). Using latent semantic analysis to
evaluate the contributions of students in AutoTutor. Interactive learning environments,
8(2), 129-147.
12. Baker, R. S., D'Mello, S. K., Rodrigo, M. M. T., & Graesser, A. C. (2010). Better to be
frustrated than bored: The incidence, persistence, and impact of learners’ cognitive–
affective states during interactions with three different computer-based learning
environments. International Journal of Human-Computer Studies, 68(4), 223-241.
13. Jaballah, M., Harous, S., & Yagi, S. M. (2008, April). UOS EASY EXAM Arabic
Computer-Based Examination System. In Information and Communication
Technologies: From Theory to Applications, 2008. ICTTA 2008. 3rd International
Conference on (pp. 1-5). IEEE.
14. Ho, Y. S., Sang, J., Ro, Y. M., Kim, J., & Wu, F. (Eds.). (2015). Advances in
Multimedia Information Processing--PCM 2015: 16th Pacific-Rim Conference on
Multimedia, Gwangju, South Korea, September 16-18, 2015, Proceedings (Vol. 9314).
15. Bond, F., & Foster, R. (2013). Linking and extending an open multilingual wordnet. In
Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1352-1362).
16. S. Anantpure, H. Jain, N. Alhat, S. Bhor, and S. Guru, “Literature Survey On Syntax
Parser For English Language Using Grammar Rules ˮ, International Journal of Advance
Foundation and Research in Computer (IJAFRC), Vol. 2, Special Issue (NCRTIT 2015),
January 2015, pp. 327-333.
17. S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim, “A Practical
Method for Transforming Free-Text Eligibility Criteria into Computable Criteria”,
Journal of Biomedical Informatics, Vol. 44, Issue 2, 2011, pp. 239250.
18. B. A. Galitsky, “Transfer Learning of Syntactic Structures for Building Taxonomies for
Search Engines”, Engineering Applications of Artificial Intelligence, Vol. 26, Issue 10,
November 2013, pp. 2504-2515.
19. Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219-
20. G. Wilcock, “Text Annotation with OpenNLP and UIMA”, Proceedings of the 17th
Nordic Conference of Computational Linguistics, University of Southern Denmark,
Odense, 2009, pp. 7-8.
21. C. Arora, M. Sabetzadeh, L. Briand, F. Zimmer, and R. Gnaga, “Automatic Checking of
Conformance to Requirement Boilerplates via Text Chunking: An Industrial Case Study
ˮ, 7th ACM/IEEE International Symposium on Empirical Software Engineering and
Measurement (ESEM ), USA, 2013, pp.35-44.
22. Agirre, E., & Edmonds, P. (Eds.). (2007). Word sense disambiguation: Algorithms and
applications (Vol. 33). Springer Science & Business Media.
23. Banerjee, Satanjeev, and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense
Disambiguation Using WordNet. Tech. Duluth: University of Minnesota. Print
24. T. N. Dao and T. Simpson, "Measuring Similarity between sentences," WordNet. Net,
Tech. Rep.,2005.
25. S. Banerjee, T. Pedersen, "An adapted Lesk algorithm for word sense disambiguation
using. WordNet.," Proceedings of the Third International Conference on Computational
Linguistics and Intelligent Text Processing, Springer Berlin Heidelberg., pp. 136-145,
26. M. Mohler, R. Bunescu, R. Mihalcea, "Learning to Grade Short Answer Questions using
Semantic Similarity Measures and Dependency Graph Alignments," Association for
Computational Linguistics, pp. 752762, 2011.
... Educational institutions are challenged to preserve academic integrity through objective evaluation of students' performance [1], [2]. As such, technologies that support the improvement of the assessment process are vital in today's connected learning environment [3]. ...
... As such, technologies that support the improvement of the assessment process are vital in today's connected learning environment [3]. Prior research explored the impact of technologies in the learning process and recent advances [4], various platforms [5] and diverse users [6], [7] point to the need for innovative approach in assessing learners' comprehension of a subject matter [1], [2], [7A] . ...
Conference Paper
Full-text available
Most people are relying on technology in various industries, such as education, where people are more likely to use technology as a platform for knowledge acquisition. In this study, the researchers proposed an Automated Essay Scoring or AES to help the teachers minimize the time in grading essay work of the student, prevent biased ratings and provide a feedback mechanism for the students. The study proposed a Neural-Network Approach Architecture and Bayesian Linear Ridge Regression Algorithm to improve the correctness and reliability of the AES system and dimensions we are going to use. We also use the new Hybrid Model or Hybrid Agile to have a detailed approach in developing our study. We used the dataset from the Hewlett Foundation, the Automated Student Assessment Prize (ASAP) from Kaggle. We also gather information in reliable sources and collect data to test the accuracy and reliability of our AES system.
... This leads to reducing the work of manual scoring as well as the subjectivity of the experts. The growing deployment of e-Learning systems in higher education has led to numerous studies conducted for assessing and improving the consistency of the AES systems for efficient evaluation of high-grade essays (Hazar et al., 2019). AES has been a hot topic of research in academic literature for the past several years now. ...
Full-text available
E-learning is gradually gaining prominence in higher education, with universities enlarging provision and more students getting enrolled. The effectiveness of automated essay scoring (AES) is thus holding a strong appeal to universities for managing an increasing learning interest and reducing costs associated with human raters. The growth in e-learning systems in the higher education system and the demand for consistent writing assessments has spurred research interest in improving the accuracy of AES systems. This paper presents a transformer-based neural network model for improved AES performance using Bi-LSTM and RoBERTa language model based on Kaggle’s ASAP dataset. The proposed model uses Bi-LSTM model over pre-trained RoBERTa language model to address the coherency issue in essays that is ignored by traditional essay scoring methods, including traditional NLP pipelines, deep learning-based methods, a mixture of both. The comparison of the experimental results on essay scoring with human raters concludes that the proposed model outperforms the existing methods in essay scoring in terms of QWK score. The comparative analysis of results demonstrates the applicability of the proposed model in automated essay scoring at higher education level.
Grading essay type of examinations is a kind of assessment where teachers have to evaluate descriptive answers written by students. Evaluating explanatory answers is a challenging task. Evaluating essay type questions are often laborious, requiring more time compared to multiple-choice questions. Also, the evaluation process is a subjective one leading to inaccuracies and considerable variation in grading. In recent times, many researchers have developed techniques for Automated Grading of Essays (AGE) using various machine learning methods. This paper reviews and summarizes the existing literature on methods for automated grading of essays. The paper reviews applications of approaches such as Natural Language Processing and Deep Learning for AGE. The paper identifies a few areas for further research in solving the problem of AGE.
Full-text available
This paper reports on a study undertaken in a Chinese university in order to investigate the effectiveness of an online automated essay marking system in the context of a Blended Learning course design. Two groups of undergraduate learners studying English were required to write essays as part of their normal course. One group had their essays marked by an online automated essay marking and feedback system, the second, control group were marked by a tutor who provided feedback in the normal way. Their essay scores and attitudes to the essay writing tasks were compared. It was found that learners were not disadvantaged by the automated essay marking system. Their mean performance was better (p<0.01) than the tutor marked control for seven of the essays and showed no difference for three essays. In no case did the tutor marked essay group score higher than the automated system. Correlations were performed that indicated that for both groups there was a significant improvement in performance (p<0.05) over the duration of the course and that there was a significant relationship between essay scores for the groups (p<0.01). An investigation of attitude to the automated system as compared to the tutor marked system was more complex. It was found that there was a significant difference in the attitudes of those classified as low and high performers (p<0.05). In the discussion these findings are placed in a Blended Learning context.
Full-text available
Computational techniques for scoring essays have recently come into use. Their bases and development methods raise both old and new measurement issues. However, coming principally from computer and cognitive sciences, they have received little attention from the educational measurement community. We briefly survey the state of the technology, then describe one such system, the Intelligent Essay Assessor (IEA). IEA is based largely on Latent Semantic Analysis (LSA), a machine-learning model that induces the semantic similarity of words and passages by analysis of large bodies of domain-relevant text. IEA's dominant variables are computed from comparisons with pre-scored essays of highly similar content as measured by LSA. Over many validation studies with a wide variety of topics and test-takers, IEA correlated with human graders as well as they correlated with each other. The technique also supports other educational applications. Critical measurement questions are posed and discussed.
Full-text available
We apply a paradigm of transfer learning to build a taxonomy of entities intended to improve search engine relevance in a vertical domain. The taxonomy construction process starts from the seed entities and mines available source domains for new entities associated with these seed entities. New entities are formed by applying the machine learning of syntactic parse trees (their generalizations) to the search results for existing entities to form commonalities between them. These commonality expressions then form parameters of existing entities, and are turned into new entities at the next learning iteration. To match natural language expressions between source and target domains, we use syntactic generalization, an operation which finds a set of maximal common sub-trees of constituency parse trees of these expressions. Taxonomy and syntactic generalization are applied to relevance improvement in search and text similarity assessment. We conduct an evaluation of the search relevance improvement in vertical and horizontal domains and observe significant contribution of the learned taxonomy in the former, and a noticeable contribution of a hybrid system in the latter domain. We also perform industrial evaluation of taxonomy and syntactic generalization-based text relevance assessment and conclude that a proposed algorithm for automated taxonomy learning is suitable for integration into industrial systems. The proposed algorithm is implemented as a component of Apache OpenNLP project.
Full-text available
An example is given of an inclusive learning environment built on the cornerstone of new learning technologies that is conducive to reaching students with diverse learning styles.
Conference Paper
The method of multi-classifier fusion was applied to essay scoring. In this paper, each essay was represented by Vector Space Model (VSM). After removing the stopwords, we extracted the features of contents and linguistics from the essays, and each vector was expressed by corresponding weight. Three classical approaches including Document Frequency (DF), Information Gain (IG) and Chi-square Statistic (CHI) were used to select features by some predetermined thresholds. According to the experimental results, we classified the test essay to appropriate category using different classifiers, such as Naive Bayes (NB), K Nearest Neighbors (KNN) and Support Vector Machine (SVM). Finally the ensemble classifier was combined by those component classifiers. After training for multi-classifier fusion technique, the experiments on CET4 essays about same topic in Chinese Learner English Corpus (CLEC) show that precision over 73% was achieved.
Automated Essay Grading (AEG) or Scoring (AES) systems are not more a myth they are reality. As on today, the human written (not hand written) essays are corrected not only by examiners /teachers also by machines. The TOEFL exam is one of the best examples of this application. The students' essays are evaluated both by human & web based automated essay grading system. Then the average is taken. Many researchers consider essays as the most useful tool to assess learning outcomes, implying the ability to recall, organize and integrate ideas, the ability to supply merely than identify interpretation and application of data. Automated Writing Evaluation Systems, also known as Automated Essay Assessors, might provide precisely the platform we need to explicate many of the features those characterize good and bad writing and many of the linguistic, cognitive and other skills those underline the human capability for both reading and writing. They can also provide time-to-time feedback to the writers/students by using that the people can improve their writing skill. A meticulous research of last couple of years has helped us to understand the existing systems which are based on AI & Machine Learning techniques, NLP (Natural Language Processing) techniques and finding the loopholes and at the end to propose a system, which will work under Indian context, presently for English language influenced by local languages. Currently most of the essay grading systems is used for grading pure English essays or essays written in pure European languages. No one in today's world can ignore the use of English in Engineering education. Better to tell in professional courses. All the Engineering branches or streams are normally supported with modern English and sometimes known as English-for-Engineers. This write-up focuses on the existing automated essay grading systems, basic technologies behind them and proposes a new framework to show that how best these AEG systems can be used for Engineering Education. E-learning has created the path of alternate education. Whereas the Web-based-learning (WBL) has made the path much easier. Use of AEG systems in a web based learning environment helps the students to know, use, and understand English much better than they used to do in normal classroom based study. Such kinds of AEG systems are very useful mainly for non-English spoken students, better to say - students whose mother tongue is not English. Normally found that English used by such students are influenced by local languages. Use of a AEG system will not only help students to write better English essay, score better in English and others subjects written in English.
Conference Paper
We create an open multilingual wordnet with large wordnets for over 26 languages and smaller ones for 57 languages. It is made by combining wordnets with open licences, data from Wiktionary and the Unicode Common Locale Data Repository. Overall there are over 2 million senses for over 100 thousand concepts, linking over 1.4 million words in hundreds of languages.
Automatic short answer grading (ASAG) is the task of assessing short natural language responses to objective questions using computational methods. The active research in this field has increased enormously of late with over 80 papers fitting a definition of ASAG. However, the past efforts have generally been ad-hoc and non-comparable until recently, hence the need for a unified view of the whole field. The goal of this paper is to address this aim with a comprehensive review of ASAG research and systems according to history and components. Our historical analysis identifies 35 ASAG systems within 5 temporal themes that mark advancement in methodology or evaluation. In contrast, our component analysis reviews 6 common dimensions from preprocessing to effectiveness. A key conclusion is that an era of evaluation is the newest trend in ASAG research, which is paving the way for the consolidation of the field.
Conference Paper
Context. Boilerplates have long been used in Requirements Engineering (RE) to increase the precision of natural language requirements and to avoid ambiguity problems caused by unrestricted natural language. When boilerplates are used, an important quality assurance task is to verify that the requirements indeed conform to the boilerplates. Objective. If done manually, checking conformance to boilerplates is laborious, presenting a particular challenge when the task has to be repeated multiple times in response to requirements changes. Our objective is to provide automation for checking conformance to boilerplates using a Natural Language Processing (NLP) technique, called Text Chunking, and to empirically validate the effectiveness of the automation. Method. We use an exploratory case study, conducted in an industrial setting, as the basis for our empirical investigation. Results. We present a generalizable and tool-supported approach for boilerplate conformance checking. We report on the application of our approach to the requirements document for a major software component in the satellite domain. We compare alternative text chunking solutions and argue about their effectiveness for boilerplate conformance checking. Conclusion. Our results indicate that: (1) text chunking provides a robust and accurate basis for checking conformance to boilerplates, and (2) the effectiveness of boilerplate conformance checking based on text chunking is not compromised even when the requirements glossary terms are unknown. This makes our work particularly relevant to practice, as many industrial requirements documents have incomplete glossaries.
This research used Web-based two-tier diagnostic assessment and Web-based dynamic assessment to develop an assessment-centered e-Learning system, named the ‘GPAM-WATA e-Learning system.’ This system consists of two major designs: (1) personalized dynamic assessment, meaning that the system automatically generates dynamic assessment for each learner based on the results of the pre-test of the two-tier diagnostic assessment; (2) personalized e-Learning material adaptive annotation, meaning that the system annotates the e-Learning materials each learner needs to enhance learning based on the results of the pre-test of the two-tier diagnostic assessment and dynamic assessment. This research adopts a quasi-experimental design, applying GPAM-WATA e-Learning system to remedial Mathematics teaching of the ‘Speed’ unit in an elementary school Mathematics course. 107 sixth-graders from four classes in an elementary school participated in this research (55 male and 52 female). With each class as a unit, they were divided into four different e-Learning models: (1) the personalized dynamic assessment and personalized e-Learning material adaptive annotation group (n = 26); (2) the personalized dynamic assessment and non-personalized e-Learning material adaptive annotation group (n = 28); (3) the non-personalized dynamic assessment and personalized e-Learning material adaptive annotation group (n = 26); and (4) the non-personalized dynamic assessment and non-personalized e-Learning material adaptive annotation group (n = 27). Before remedial teaching, all students took the prior knowledge assessment and the pre-test of the summative assessment and two-tier diagnostic assessment. Students then received remedial teaching and completed all teaching activities. After remedial teaching, all students took the post-test of the summative assessment and two-tier diagnostic assessment. It is found that compared to the e-Learning models without personalized dynamic assessment, e-Learning models with personalized dynamic assessment are significantly more effective in facilitating student learning achievement and improvement of misconceptions, especially for students with low-level prior knowledge. This research also finds that personalized e-Learning material adaptive annotation significantly affects the percentage of reading time students spend on the e-Learning materials they need to enhance learning. However, it does not appear to predict student learning achievement and improvement of misconceptions.