Automated Scoring for Essay Questions In E-learning
Manar Joundy Hazar1
Zinah Hussein Toman2
Sarah Hussein Toman3
1Computer Center, University of Al -Qadisiyah, Qadisiyah, Iraq.
2Computer Science Department, College of Computer Science and Information Technology,
University of Al -Qadisiyah
3Roads and Transport Department, College of Engineering, University of Al -Qadisiyah
E-learning is employing technology to help and promote learning begun decades ago.one of
most important example of using intelligent software agent in e-learning is Intelligent Tutoring systems
(ITSs). Assessment plays a significant role in the educational process. Automated Essay Scoring (AES)
is defined as the computer technology that evaluates and scores the subjective answers. To evaluate
students essay answers based on a linguistic knowledge in this project we presented a suitable model.
Getting results from the recommended system application in calculating simulation results indicates
high precision in performance associated with other methods. We used square root error rate (RMSE)
and Pearson correlation coefficient metrics to measure the system efficiency and also use student answer
acquired from the University of North Texas data collection, this data accessible on the Internet.
E-learning, Word Net, NLP, Semantic Similarity, essay question.
E-learning is an effective method of learning using modern technologies based on the use of
the Internet. This way of teaching in e-learning enables you to give courses to students at anytime and
anywhere as well as interact with students in an effective and simple manner. E-learning become one
of the most important requirements of the educational process to keep pace with the rapid development
of educational institutions around the world, it also played an essential role in the development of
In recent years E-Learning has become a significant trend technology. All resources offered
richer than the traditional classroom to simplify teaching, it also overcomes the restrictions of time and
space of traditional learning. It allow students to study independently, which mean free them from
restrictions of traditional education techniques .
Figure 1:the basic Components of ITS
Assessment of learning outcomes with tests and examinations can simplicity many types and
ways of grading. Designed questions can be types of specific questions, like everything from multiple-
choice questions to simple questions that involve natural language answers such as short answers or
articles. The rating manner may be either manual staging or automatic staging by calculation methods.
In this research we focus on the type of article answer question and the automatic estimation technique.
Many scientists argue that the subjective nature of article evaluation leads to disparities in the
scores given by various human residents, which students consider a major source of injustice.
You may experience this problem by adopting automated article evaluation tools. The
automated assessment system will be at least consistent in the manner the articles are recorded, and
time savings and massive cost can be achieved if the technique can be presented to evaluative articles
within a scope granted by a human evaluator. Furthermore, according to Hirst (2000) using computers
to increase our understanding of textual features and cogni and tive skills that are involved in creating
and understanding written texts, will provide a number of benefits to the educational community.in
addition It will help in developing more effective technologies such as search engines and question-
answering systems to provide universal access to electronic information. 
Summit (for exam preparation assistant), a tool to evaluate students' articles centered on their
content. Depending on semantic text analysis known as Latent semantic analysis (LSA) . Essentially,
LSA denotes to each word as a vector in a high-dimensional space so that nearness between two stores
is closely related to the semantic similarity between the two words.
This change in how the content of the course and interaction with students is a major
departure from the home classroom course in traditional classrooms. Perhaps the most interesting
position is the actual location of the place in which the introduction and deeper sharing with the
material takes place. Traditionally, an introduction to the classroom is provided through a lecture, and
a deeper engagement outside the classroom is conducted through homework. In the description above,
the introduction takes place outside the classroom and the participation occurs within the classroom
In this area, most researchers agree that it is inappropriate to use objective questions to measure
some complex aspects of achievement. Learning results involving the skill to remember, organize and
integrate ideas, the capability to express themselves in writing and the ability to supply only the
identification of interpretation and application of data, need a less responsive structure than are subject
to objective test clauses. In measuring such results, the highest levels of Bloom's classification (1956)
(specifically evaluation and synthesis) correspond to the article's question serving the most useful
1.2 PROBLEM STATEMENT
Not the same of Multiple Choice Question (MCQ), essays enclose subjective answers rather than
the accurate answers in MCQ (e.g. true or false) . So, the student skill and ability play a large role in
creating a strong answer free of misspellings and grammatical errors that will reduce the student's mark
if they exist. Hence, process of automatic essays evaluations is a challenging task because of the need
of complete assessment in order to authenticate the answers accurately .
2.1 Related work
Several researchers have addressed the problem of automated essay scoring or so-called
automatic essay assessment by using various techniques. The key characteristic behind these techniques
lies on a set of manually scored essay by human in which the essay that intended to be assessed is
compared with the pre-scored essays. Usually, the manually scored essays are called pre-scored essays
or training essays, whereas the essay that required to be assessed by the computer is called tested essay
or automated scoring essay .
The earliest system proposed for essay assessment is the Project Essay Grader (PEG) which was
focusing on the writing style of a given essay in order to provide the score. The writing style
concentrates on essay length and mechanics such as spelling error, capitalization, grammar and diction.
Obviously this approach was criticized due to the lack of semantic analysis in which the content is being
There are many methods that used in Graesser, Arthur C., et al. (2000), which is correcting the
computer for the article presented in the test answers by sample of students and one of the methods used
to evaluate the article provided by a group or sample of students using the model of efficiency based on
educational education. One scientist suggested a model or approach to lessons the smart test answers
students using a grammar checker and providing feedback after evaluating the student's answer. Another
world - class student identification software program proposes using a check-in-side flow control
diagram to measure the similarities .
Satav, Harshada, Trupti Nanekar, Supriya Pingale, and Nupur  has presented an
examination system that based on SQL and Microsoft.Net (C# and Asp.Net). They implemented an
examination system for computer application base. Their system has only examination and evaluation
subsystems. Moreover, their system included different types of questions that related to computer
application such as multiple choice, fill in blank, true / false, programming design questions.
Mohamed Jaballah, Saad Harous, and Sane M. Yagi  developed an Arabic examination
system for students in University of Sharjah. Their system just included examination and grading
subsystems of different types of questions such as true/false, multiple choice, fill in blank and essay
question types. The exam paper is generated automatically by the examination system. However, their
grading subsystem was not automated and the grading is done manually by the teacher via a grading
Chen Xiangjun and Wu Fangsheng  proposed an examination system which provides login
activity recording, users management, test question management. It includes examination subsystem
and grading subsystem that based on matching the student answers with the answers key.
A framework is proposed in this chapter for process of students' evaluation in educational
environments based on linguistic knowledge. The framework loads the student answer as input which will
be passed through three layers illustrated in figure (2)
Figure 2:Layers of the proposed framework
Suggested system is based on linguistic knowledge. Students' answers are electronically
obtained and compared to the ideal answer stored in the system. The acquired answer as well as
3.1 The Pre-Processing Layer:
This process powered by OpenNLP frawmeork that is open source statistical base parser.
OpenNLP is used to process natural languages such as English language. It performs the tasks of
sentence detection, tokenizing, tagging part of speech and detecting most popular people noun,
organization, places, cities, countries and much more .
by wordnet sumrization wsd
similarty between summry text and ideal
detector token name entity stop word
steming parsing co-reference
3.1.1 Sentence Detection
The sentence is defined as the longest white space trimmed character sequence between two
punctuation marks. Open NLP Sentence Detector can identify that a punctuation letter marks the end
of a sentence or not. This rule excludes the beginning and end of the sentence. The first non-
whitespace character is assumed to be the begin of a sentence, and the last non-whitespace character is
assumed to be a sentence end. [16, 17].
The Open NLP Tokenizers segment an input character sequence into tokens. Tokens are usually
words, punctuation, numbers, etc. .
3.1.3 Name Finder
The Name Finder can detect named entities and numbers in the text. To be able to detect entities
the Name, Finder needs a style. The style depends on the language in addition to the entity type on
which the model was trained.
There is a paucity of name search methods in Open NLP that has been pre-trained on various
sessions available for free. To find names in raw text, are divided into symbols and sentences. It is
important that the tokenization for the training data and the input text be identical .
3.1.4 Stop Words
The correct English sentence consists of some grammatical words and these words do not
contribute to the meaning of the sentence. For example, words such as from and to are stop words
because the sentence is needed to complete the logical meaning, but it does not describe the true
meaning. For example, in a fast car sentence, car and fast words are considered the basic part of the
sentence, while the relationship among words is described in the basic part. It is very normal that
stop words appear in most sentences, and this may lead to a higher overall score so the suggestion
can be to eliminate these words before comparing sentences
3.1.5 POS Tagger
Putting POS signs into a part of speech is one way of removing the meaning of words that aim to
assign each word in a particular text to a fixed set of speech parts such as name, verb, character, or
circumstance. There are many words that have many potential signs and therefore, a POS has been
thrown in to cancel these words. Therefore, the primary role of the POS is to determine the exact tag
for each word in the group. .
Words with the same meaning appear in various morphological forms. To capture their similarity they
are normalized into a common root-form, the stem word involves suffix deletion, an automated word
stemming plays an important role in the information retrieval systems. Furthermore, subsequent
automatic removal has a major effect on information retrieval systems since it reducing total number
of terms in the system. We use the Porter stemming algorithm  to this perpuse.
The dissection of text consists of dividing the text into interconnected parts of words, such as names
and classes of action, but does not specify its internal structure or its role in the main sentence.
parsing is the most important stage used during the question generation. It provide rich infor
mation about wholesale grammar. Distribution represents a text entity in a tree structure based on the
use of grammatical rules and input sentences.
Parsing is done in down manner or Progressive manner. Top-down parsers construct the parse
tree by the derivation of the input sentence from the root node S down to the leaves . The searcher
algorithm of the parser will expand All trees that have S as a mother, using information about possible
trees under the root, will construct a parallel tree analyzer. In bottom-up parsing, the parse tree
construction process is started With the words of the input and tries to build the trees of the word,
again by applying grammar rules simultaneously. The success of the parser depends on if the parser
Succeeds in structure a tree rooted in the start code S that covers all inputs .
In this pharese all of the references are mathching to equivalent name entity in the article,
regardless of the reference reference format. It usually matches the name, full name or pronoun. Some
work has shown that the common reference resolution can be used to improve summary systems that
rely heavily on word frequency features .
A simple example is the use of a pronominal reference. For example, "John will travel
tomorrow, buy the ticket yesterday." In this case, the word "he" refers to "John". Thus, if words are
recorded together, they may be more important. This type of analysis is not widely used in summary
systems due to performance and accuracy problems.
3.2 Intermediate Processing Layer:
This layer represents the core of the proposed model for evaluating summaries. Treatment at
this stage involves applying of grammatical and semantic information to the summary evaluating.
Initially, the source text and summary text analyzed in a set of sentences. Then the analogy between
each sentence of the text of the summary and the full sentences of the source text is determined using
a similarity configuration of word order and semantic similarity. The maximum value is set as a
degree of similarity to the recent outline of the summary.
WordNet is an on-line lexical reference system whose design is inspired by current
psycholinguistic theories of human lexical memory. English nouns, verbs, and adjectives are organized
into synonym sets, each representing one underlying lexical concept. Different relations link the
synonym sets. Each wordnet represents an autonomous structure of language-specific lexicalizations,
which are interconnected via an Inter-Lingual-Index. The wordnets are built at different sites from
existing resources, starting from a shared level of basic concepts and extended top-down. The Collins
and Quillian model proposed a hierarchical structure of concepts, where more specific concepts inherit
information from their superordinate, more general concepts; only knowledge particular to more
specific concepts needs to be stored with such concepts. Thus, it took subjects longer to confirm a
statement like “canaries have feathers” than the statement “birds have feathers” since, presumably, the
property “has-feathers” is stored with the concept bird and not redundantly with the concept for each
kind of bird. This means that the similarity procedure will consist only of noun comparisons, verb
action, and soon.
Figure 3: Example of WordNet Structure
The The availability of the WORDNET database was an important starting point.
The synchronization model is simple enough to provide a fundamental relationship between the
concept and the corresponding words. Moreover, WORDNET covers the entire English dictionary
and provides an unusually large amount of conceptual differentiation. It is also particularly useful
from an arithmetic point of view because it has been developed for easy access and movement
through hierarchies. Starting with WORDNET, we chose a subset of the appropriate processors to
represent the emotional concepts .
The first decision we made in the mapping project was to settle the relationships that would
be used to set WordNet as synchronized. There are three possible relationships: synonymy,
hypernymy, and instantiation. Some examples should illustrate these three relationships and use them
in mapping on WordNet groups. Consider the following entry in the WordNet name database.
Formally, the semantic similarity is defined as follows:
simres(c1; c2) = max c2S(c1;c2) icres(c)
Where S(c1; c2) Is a set of concepts that fall under c1 and c2. The theoretical similarity measure that
used the same IC idea was from Lin. Explains its definition of similarity: ”The similarity between A
and B is measured by the ratio between the amount of information needed to state the commonality
of A and B and the information needed to fully describe what A and B are.”
Formally, the definition above can be expressed by:
simlin (c1; c2) = 2 £ simres (c1; c2) (icres (c1) + icres (c2))/ (icres(c1) + icres(c2))
To have a short sentence, unnecessary information must removed from the sentence from in the
source text so, deletion strategy is used to that. Unimportant information includes trivial details
about topics such as examples, scenarios, or repeated information that contain some important
information . basic job of the deletion strategy is to remove non-important material like stop
words and interpretations and giving two sentences, original sentences and concise sentence,
allowing Ss to be a concise sentence, Os the original sentence, Len (Os) denotes the length of the
Os statement while Len (Os) Wholesale Ss. The first rule of the deletion strategy is as follows:
Length (Ss) is less than Length (Os), Sentence Partitioning: Each sentence must be divided into a
list of words in order to measure the similarity. The sentence is tokenized by first eliminating the
stop words. The stop words are the words that should be removed from a sentence before any natural
language processing technique is applied. In addition, there is no specific set of words that represent
the stop words, and it can be any set. The most common set of stop words is the set of the functional
words such as the, for ,and at. Finally, the separator characters between words are recognized and
the word list is created.
3.2.3. The word ambiguity
The POS class in WordNet is divided into data sets that contain synonyms. We have also seen
that a word can be full of sensations, that is to say, it has different senses. All of these senses will be
included in the same classification, but in different groups. When calculating the degree of similarity
between two words, it is necessary to make sure that the correct sense of words is compared. This is
done by creating an unambiguous feeling in the given context. The algorithm used most often to
understand the meaning of the word is the algorithm of Lesk Algorithm, offered by Michael E. Lesk
in 1986 . The algorithm uses a word definition to determine if there is something in common with
Word sense disambiguate must be done in the sentence similarity process since every single
word has different meaning within different context. Adapted Lesk Algorithm It is considered
one of the most accurate algorithms used in the field of word sense disambiguation.
1.2 Post-treatment Layer
The auto grading system utilizes the natural language processing algorithms to measure
the matching between the stored answer and the student's answer for the free response
questions.This shows the results from the user's method. It illustrations the measurement of similarity
as a degree to the student. the following equations is used to determine the final result (FS) for any
written summary of the student:
FS = 2 𝑇𝑜𝑘𝑒𝑛 _𝑀𝑎𝑡𝑐ℎ(𝑋,𝑌)
where token_match(X, Y) are the matching word tokens between X and Y .
Dice coefficient fomula This approach returns the percentage of the number of tokens that can be
matched over the total of tokens, Dice will always return a higher value than Matching average
[26,27]. this policy need to predefine a threshold to select the matching pairs that have values greater
than the given threshold.
Dice coefficient =2∗|𝑋∩𝑌|
we evaluate our system by measured its performance we measured a correlation between two
variable using pearson correlation cofficient with value [1,-1]where 1 mean positive,0 not exist and -1
negative. Also we measure the variance between assessed and actual values using root mean square
4.2 Simulation Environment
Presented system applied using Open NLP open source library, to presses text. It provide all
NLP tasks (tokenization, POS, chunking, parsing …). Interface of proposed system was build using
visual studio 2013
4.4. Results and Discussion
To evaluate the performance of the proposed model, we conducted three experiments. In the
first one, we evaluated performance of the system against human judgment to identify the grading of
students' answers. While In the second one, we compared proposed framework with the human
calculations and finally, in the third experiment we compared with the stae-of-the art and existing
4.4.1. Test 1
In summer of 2017, dataset from al qadisyiah university student's (student responses) were
collected in Web programming course. It contain one essay. Table1.Displays the questions use in this
study. Figure 3. Shows a sample answer of three students.
Table 1: First Test
list the benefits of
answer essays ranked by Five human by hand, all five manually grader are computer since
lecturer, mark give for each answer was in range between 0 and 10, three samples are token , then
automatic eases on every answer is associated to lecturer's score.
4.4.2. Test 2
To evaluate our technique for grading student response, the performance measured in
contradiction of human grading to determine the similarity between response student and human
Figure 4: Manual student answer samples
Table 2: proposed technique against human scoring
Sample question ,precise answer, and student response
What are the similarities between iteration and recursion?
They both involve repetition; they both have termination tests; they can both occur infinitely.
Both are based on a control statement.
Both involve a controlled repetition structures, and they both have a termination
test. Also both of them can loop forever.
They are methods of repeating the same task.
What is the main difference between a while and a do...while statement?
The block inside a do...while statement will execute at least once.
Do while loop always executes once. A while loop's conditional statement has to
be true for it to run.
the loop body always executes at least once
Do ...while runs the embedded code at least once, the do command does not
We independency test one components of our overall grading system in the final stage, to assess
the performance of suggested technique, we must use an evaluation metric. Several evaluation metrics
are commonly used in NLP applications. In our test, the evaluation is implemented between proposed
frame work and dataset using Correlation metric for full data set.Table8882 3 show result of the
proposed frame work compare with each of these measures and figure (4, 5, 6) are comparison of
(Knowledge-based, Corpus-based, Baseline) measures to the proposed frame work and the result was
higher by 0.3 of a higher result in these measures.
Table 3: Result of the presented method against each measures using correlations.
Figure 5: comparison of knowledge-based and proposed method
4.4.3. Test 3
To evaluate the sugested framework on data set, we choice the following approaches:
Corpus-based and Baseline
Figure 6: Suggested method Vs Corpus-based.
Shortest path Leacock&Chodorow Lesk Wu&palmer Resnik
Lin Jiang&Conrath Hirst&St-Onge proposed model
0.4072 0.4288 0.4684 0.490052587
LSA BNC LSA Wikipedia ESA Wikipedia proposed model
Figure 7: Suggested method Vs. Baseline
As shown in Figures (5), (6) and (7), it is very clearly that suggested method succeeded higher
correlation coefficient than all other methods. We can notice the similarity between suggested method
and human judgment.
Lastly Table (4) shows RMSE of the suggested framework and some other state-of-the-art
systems illustrating an indication of the performance of these grading methods .
Table 4: RMSC measures
In this paper we offered an analysis and a discussion of most recently presented assessment approaches
in tutoring.then describes how intelligent tutoring systems (ITS) can be used for students assessment
by using NLP tools and algorithms. This work studied an automated assessment of learners’ input that
involved a lexico-syntactic (entailment) scheme to textual analysis beside to a variety of other textual
assessment measures. In the proposed frameworke we use NLP to discover the semantic similarity
between student answer and ideal answer in outomatic grading system.the expermintal resulte show the
eficiancy of the presented framework and proved it is an accurated method.
tf*idf proposed model
4.6 future work
Essay question may contain mathematical symbols and formulas which is a big challenging, so
, as future work this framworke can be extended to cover another type of format in article question like
mathmatical formating and also another languages like arabic language
1. Motiwalla, L. F. (2007). Mobile learning: A framework and evaluation. Computers &
education, 49(3), 581-596.
2. Wang, T. H. (2014). Developing an assessment-centered e-Learning system for
improving student learning effectiveness. Computers & Education, 73, 189-203.
3. Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An overview of current research on
automated essay grading. Journal of Information Technology Education: Research, 2,
4. Bin, L., & Jian-Min, Y. (2011, September). Automated essay scoring using multi-
classifier fusion. In International Conference on Information and Management
Engineering (pp. 151-157). Springer, Berlin, Heidelberg.
5. Felder, R. M., Woods, D. R., Stice, J. E., & Rugarcia, A. (2000). The future of
engineering education II. Teaching methods that work. Chemical Engineering
Education, 34(1), 26-39.
6. Lage, M. J., Platt, G. J., & Treglia, M. (2000). Inverting the classroom: A gateway to
creating an inclusive learning environment. The Journal of Economic Education, 31(1),
7. Ghosh, S. (2010). Online Automated Essay Grading System as a Web Based Learning
(WBL) Tool in Engineering Education. Web-Based Engineering Education: Critical
Design and Effective Tools: Critical Design and Effective Tools, 53.
8. Landauer, T. K. (2003). Automatic essay assessment. Assessment in education:
Principles, policy & practice, 10 (3), 295-308.
9. Burrows, S., Gurevych, I., & Stein, B. (2015). The eras and trends of automatic short
answer grading. International Journal of Artificial Intelligence in Education, 25(1), 60-
10. J. Kurs, M. Lungu, and O. Nierstrasz, “Top-Down Parsing with Parsing Contexts”, In
Proceedings of International Workshop on Smalltalk Technologies (IWST14), England,
2014, pp. 1-7.
11. Graesser, A. C., Wiemer-Hastings, P., Wiemer-Hastings, K., Harter, D., Tutoring
Research Group, T. R. G., & Person, N. (2000). Using latent semantic analysis to
evaluate the contributions of students in AutoTutor. Interactive learning environments,
12. Baker, R. S., D'Mello, S. K., Rodrigo, M. M. T., & Graesser, A. C. (2010). Better to be
frustrated than bored: The incidence, persistence, and impact of learners’ cognitive–
affective states during interactions with three different computer-based learning
environments. International Journal of Human-Computer Studies, 68(4), 223-241.
13. Jaballah, M., Harous, S., & Yagi, S. M. (2008, April). UOS EASY EXAM Arabic
Computer-Based Examination System. In Information and Communication
Technologies: From Theory to Applications, 2008. ICTTA 2008. 3rd International
Conference on (pp. 1-5). IEEE.
14. Ho, Y. S., Sang, J., Ro, Y. M., Kim, J., & Wu, F. (Eds.). (2015). Advances in
Multimedia Information Processing--PCM 2015: 16th Pacific-Rim Conference on
Multimedia, Gwangju, South Korea, September 16-18, 2015, Proceedings (Vol. 9314).
15. Bond, F., & Foster, R. (2013). Linking and extending an open multilingual wordnet. In
Proceedings of the 51st Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) (Vol. 1, pp. 1352-1362).
16. S. Anantpure, H. Jain, N. Alhat, S. Bhor, and S. Guru, “Literature Survey On Syntax
Parser For English Language Using Grammar Rules ˮ, International Journal of Advance
Foundation and Research in Computer (IJAFRC), Vol. 2, Special Issue (NCRTIT 2015),
January 2015, pp. 327-333.
17. S. W. Tu, M. Peleg, S. Carini, M. Bobak, J. Ross, D. Rubin, and I. Sim, “A Practical
Method for Transforming Free-Text Eligibility Criteria into Computable Criteria”,
Journal of Biomedical Informatics, Vol. 44, Issue 2, 2011, pp. 239–250.
18. B. A. Galitsky, “Transfer Learning of Syntactic Structures for Building Taxonomies for
Search Engines”, Engineering Applications of Artificial Intelligence, Vol. 26, Issue 10,
November 2013, pp. 2504-2515.
19. Willett, P. (2006). The Porter stemming algorithm: then and now. Program, 40(3), 219-
20. G. Wilcock, “Text Annotation with OpenNLP and UIMA”, Proceedings of the 17th
Nordic Conference of Computational Linguistics, University of Southern Denmark,
Odense, 2009, pp. 7-8.
21. C. Arora, M. Sabetzadeh, L. Briand, F. Zimmer, and R. Gnaga, “Automatic Checking of
Conformance to Requirement Boilerplates via Text Chunking: An Industrial Case Study
ˮ, 7th ACM/IEEE International Symposium on Empirical Software Engineering and
Measurement (ESEM ), USA, 2013, pp.35-44.
22. Agirre, E., & Edmonds, P. (Eds.). (2007). Word sense disambiguation: Algorithms and
applications (Vol. 33). Springer Science & Business Media.
23. Banerjee, Satanjeev, and Ted Pedersen. An Adapted Lesk Algorithm for Word Sense
Disambiguation Using WordNet. Tech. Duluth: University of Minnesota. Print
24. T. N. Dao and T. Simpson, "Measuring Similarity between sentences," WordNet. Net,
25. S. Banerjee, T. Pedersen, "An adapted Lesk algorithm for word sense disambiguation
using. WordNet.," Proceedings of the Third International Conference on Computational
Linguistics and Intelligent Text Processing, Springer Berlin Heidelberg., pp. 136-145,
26. M. Mohler, R. Bunescu, R. Mihalcea, "Learning to Grade Short Answer Questions using
Semantic Similarity Measures and Dependency Graph Alignments," Association for
Computational Linguistics, pp. 752–762, 2011.