Content uploaded by Aparna Lalingkar
Author content
All content in this area was uploaded by Aparna Lalingkar on May 29, 2020
Content may be subject to copyright.
Building a Model for finding Quality of Affirmation in a Discussion Forum
Aparna Lalingkar, Prakhar Mishra, Sridhar Mandyam, Jagatdeep Pattanaik and Srinath Srinivasa
Web Science Lab, IIIT Bangalore, Bangalore, India
aparna.l@iiitb.ac.in, prakhar.mishra@iiitb.org, Sridhar.mandyamk@iiitb.ac.in jagatdeep.pattanaik@iiitb.org , sri@iiitb.ac.in
Abstract— Education is an inherently social activity. People
like to exchange thoughts and learn from each other: this is
why we are interested in discussions. But discussions can be
messy and vague. In order to make discussions meaningful,
relevant mediations may need to be made, whenever
discussions lose clarity. Discussion forum data is in the form of
a sequence of questions, answers, and comments on the
answers. Taken together, this data is called an affirmation.
Knowing the clarity of an affirmation makes it possible to
intervene in a discussion to steer it towards agreement or
conclusion. We have built a model of clarity for a branch of an
affirmation which is considered based on scores of agreements,
disagreements, and partial answers. We used three classifier
models in our experimentation, but the random forest classifier
model gave an F1 score of 0.7315, which was better than the
other two classifier models. For our model, the ratio of strong
agreement to partial agreement was 1.44 for training data and
1.37 for testing data, which is quite close. In the future, we are
planning to enhance our model at the sentence level to capture
more details and find an aggregate score of clarity for an
affirmation.
Keywords-Social Learning; Discussion Forum Analysis;
Mediation; Quality of Affirmation
I. INTRODUCTION
Learning is an inherently social activity. Bandura’s [1]
social learning theory hypothesizes that people learn from
each other through observation, imitation, and modeling. In
digital space, social learning can happen through online
discussion forums [2], online chat platforms such as
WhatsApp [3], computer-supported collaborative learning
systems (CSCLs) [4], question-answering systems [5], etc.
Online discussion forums are used for promoting
conversational learning. Various studies have highlighted the
importance of instructors playing an active role [6] and
making thoughtful comments [7] in discussion forums. In the
paradigm of Navigated Learning (NL), the importance of
technology-based mediation in social learning has been
discussed by Lalingkar et al. [8].
Shorn of the technical matter contained in a discussion,
we think of ‘quality’ of a discussion as a measure of the
articulated agreement or assertion of understanding among
participants, in natural languages. We consider better clarity
to have emerged in a discussion when a larger proportion of
participants make assertions of understanding – which we
shall refer to as an affirmation. It is important to determine
the quality of the discussion automatically, even as it is in
progress. For then, we can consider the possibility of
developing tools which could help steer a discussion towards
better quality and assist in the potential improvement of
learning outcomes by mediating in the discussion. In this
paper, we first discuss quality of affirmations and the state of
the art with respect to the automatic calibration of the quality
of affirmations. We then discuss the dataset considered for
experimentation and the feature sets created for
implementation in the classification models. Finally, we
discuss experimentation and results followed by conclusion
and future work.
II. QUALITY OF AFFIRMATIONS
The data of an online discussion forum has the following
format: a participant asks a question (Q), which could have
various answers — for example (A1, A2, A3 etc.). Each
answer could optionally have multiple comments — for
example, A1 has two comments (C1.1, C1.2) — and each
comment could in turn, have multiple replies — for example,
C1.1 has two replies (R1.1, R1.2). The whole conversation
tree for one question (Q) is called an affirmation (see figure
1).
Figure 1. An affirmation in a discussion forum
The quality of an affirmation could be decided by
identifying which conversation threads have been concluded
or have been resolved. If there is partial resolution in a
conversation thread, then relevant questions could be
inserted into it to steer the conversation towards the
concluding state or resolution. Several studies [9, 10, 11, 12]
have discussed identification of resolution, partial resolution
and lack of resolution in a discussion. Generally, it is
observed that resolutions are preferred responses that are
temporally immediate and stated explicitly without any
hesitation or doubt. Expressions such as “Thanks, now it is
clear”, “I agree with you”, “Got it, thanks”, “Thank you, it
worked” are clear and strong indicators of resolution. On the
other hand, lack of resolution are marked by non-preferred
responses like hesitations, repeated questioning, requests for
clarifications, contradictions, etc. Expressions such as “I do
PRE-PRINT
not agree with you”, “I am not convinced”, “This is a wrong
answer”, “I don’t think it’s a viable idea”, “it is still unclear”
are strong indicators of a lack of resolution. Partial
resolutions are indicated when people ask for explanations,
apologize, continue questioning even after an agreement, or
do not conclude their answers. Whenever there is partial
agreement, there is implicitly also partial disagreement.
Expressions such as “Yes, I agree, but what if….?”, “Can
you elaborate?”, “Can you clarify?”, “Could you please give
an example and explain?”, “Sorry, but can you give some
direction?” are indicators of partial agreement and
disagreement.
The points of view of two political parties (based on
documents used in an electoral campaign, and not in
dialogue form) first by using manual annotation and then by
using supervised learning are classified by Menini & Tonelli
[13]. However, it is important to see how a group reaches
agreement in a discussion forum by considering the structure
of conversation, as studied by Rosenthal & McKeown [14].
Detection of agreement and disagreement in several types of
digital discussions has been popular in the literature.
Computer-mediated discussions were analyzed as
agreements and disagreements by Baym [10]. The findings
indicated that disagreements are more serious than
agreements but contain directly contradictory statements, as
compared to agreements which contain affirmative
assertions. Detection of disagreement in a discussion of news
items using a pseudo-monologic rhetorical structure is
demonstrated by Kelsey et al. [15]. Study of detection of
agreements and disagreements in a set of meetings by using
a combination of unsupervised training on an unlabelled data
set and supervised training on a small amount of data is done
by Hillard et al. [16]. Misra & Walker [17] worked on the
detection of topic-independent identification of agreements
and disagreements in social media dialogues by using
linguistic features and unigrams, bigrams etc. A conditional
random field-based approach for detection of agreements and
disagreements among speakers in English broadcast
conversations is studied by Wang et al. [18]. We decided to
investigate: How can we build models for finding quality of
affirmation in an academic discussion forum? Our basic
approach was to annotate the branches of discussion trees
manually and then train the models based on various feature
sets. The data set and the complete experimental setup is
explained in the next section.
III. DATA SET AND METHODOLOGY
A. Data Set
For the experimentation, we obtained discussion forum
data from an online learning platform where courses on
subjects such as Machine Learning and Data Science are
offered for working professionals. As described in section II,
we received the discussion data in the form of questions and
answers to those questions: a total of 17,947 questions and
54,309 answers. Some of the answers had comments as well.
The data had the following structure: each question had
[(Question ID, Time stamp), Questioner ID, Question text];
each answer had [(Answer ID, Time stamp), Answerer ID,
Answer text]; and each comment had [(Comment ID, Time
stamp), Commenter ID, Comment text]. After initial manual
analysis of the data, we noticed that some data points were
missing, so we sought additional data from the online
learning platform to add value to our experimental setup. The
new updated data contained a total of 8,954 questions and
13,018 answers, and had the following structure: each
question had [(Question ID, Time stamp), Questioner ID,
Question Title, Question Text, Upvote)]; each answer had
[(Answer ID, Time Stamp), Answerer ID, Answer Text, TA
Verification, Upvote)]; and each comment had [(Comment
ID, Time stamp), Commenter ID, Comment text].
B. Data Annotation
In the discussion forum data, when one answer and the
comments on that answer are followed for a question, we call
it a branch of the conversation tree or the affirmation (as
defined in section II). After analyzing some data manually,
we randomly selected 150 branches from the data set so that
we could annotate them manually to get the baseline. For this
we automatically created Google forms by using Google
cloud script. We identified five classes (Strong Agreement,
Partial Agreement, Neutral, Partial Disagreement, Strong
Disagreement) to get more granularity for annotation. Eleven
people annotated the sampled branches of the data. We got
70% of the annotations directly from the collective
annotation, but we needed to discuss the remaining 30%.
During the discussion, we found that there was confusion
about partial agreement, partial disagreement, and neutral
classes. We designed feature sets based on our manual
analysis and discussion.
C. Feature Sets
For the learning model, we developed three types of
feature sets: structural features, linguistic features, and meta-
features.
1) Structural Features:
Structural features are based mainly on the structure of
the data (in this case, the structure of the branches of
affirmation). Following are the structural features that are
considered (The data structure of the feature is given in
brackets.)
a) Depth of each branch (int) [more discussion
indicates lack of clarity useful to detect partial and
disagreement]
b) Number of unique people engaged (int) [indicates
clarity and interest factor]
c) Number of upvotes (int) [correctness of the
response]
d) Is answer verified by TA or not (Bool) [correctness
of the response]
e) Is last comment Question or not (Bool)
[conversation remains open …useful to detect partial]
f) Is Questioner comes back or not (Bool) [interest
factor]
PRE-PRINT
g) Count of Intermediate Question (int) [more
questions indicate lack of clarity useful to detect partial]
2) Linguistic Features:
Linguistic features are based on the words and key
phrases used and the linguistic structure of the data. We used
Unigrams and Bigrams to get the key phrases and TF-IDF
(Term Frequency and Inverse Document Frequency) to
convert these Unigrams and Bigrams into numbers.
a) TF-IDF weight is a statistical measure used to
evaluate how important a word is to a document in a
collection or corpus.
b) TF(t) = (Number of times term t appears in a
document) / (Total number of terms in the document)
c) IDF(t) = log_e(Total number of documents /
Number of documents with term t in it)
d) TF-IDF weight(t) = TF(t) * IDF(t)
3) Meta-features:
Meta-features are features that combine the results from
structural and linguistic features with the observations.
Following are the meta-features which we have used:
a) Polarity (Range: -1 to +1) [17] [indicates
acceptance and rejections useful to detect agreement and
disagreement]
(https://textblob.readthedocs.io/en/dev/quickstart.html)
b) Upper case word count [indicates technical
vocabulary and empahsis given to a word]
c) Number of Intermediate Question per No of Unique
People [interest factor]
d) Punctuation Count [17]
e) Title Case Word Count
D. Training and Validation
For training we used the same 150 conversation threads that
were used for creation of the baseline. We used several
classic and state-of-the art classifiers (including Random
Forest (RF) [19], XGB [20] and Naïve Bayes (NB) [21]) to
train the models. We compared the results for all three to
find out which one gave better results. For validation we
used stratified K fold cross validation for (k = 7) [22]. We
divided 150 samples into 7 parts and trained the model on 6
parts and validated it on the remaining 7th part. This is
repeated for each part i.e. 7 times. To get the best parameter,
we used Grid search [23], with model parameters including
number of trees, max features, and minimum leaf sample.
To measure the performance we used Precision, Recall
(https://en.wikipedia.org/wiki/Precision_and_recall) and F1
score (https://en.wikipedia.org/wiki/F1_score).
Precision = (True Positive) / (True Positive+False Positive)
= (True Positive) / (Total Predicted Positive)
Recall = (True Positive) / (True Positive + False Negative)
= (True Positive) / (Total Actual Positive)
F1 = 2 (Precision * Recall) / (Precision + Recall)
We conducted several experiments, of which we will now
discuss three for drawing our conclusion. We are
considering only top three experiments which contributed to
the model. The other experiments were discarded because
the features and parameters used in those did not improve
the model. All the features discussed in section III (C) were
used for training the model for classification in all the three
experiments.
IV. EXPERIMENTATION AND RESULTS
A. Experiment I
For the first experiment, we considered all the five
classes (Strong Agreement, Partial Agreement, Neutral,
Partial Disagreement, Strong Disagreement) for classifying
the randomly selected 150 branches from the data. The
manual annotation results for the five classes are shown in
figure 2. This annotated data was trained and validated for all
three classifiers (described in section III (D)) by using all the
features described in section III (C). Table I shows the F1
score results (experiment I) for all three classifiers.
Figure 2. Manual annotation results for five classes
TABLE I. F1 SCORE (EXPERIMENT I) FOR ALL THREE CLASSIFIERS
Classifier
F1 Score
Random Forest (RF)
0.5098
XGB
0.4722
Naïve Bayes (NB)
0.2483
It was observed that whenever a participant partially agrees
with some aspect, there is implicitly also partial
disagreement. In the literature, other researchers had
considered three classes — i.e. Strong Agreement, Partial
Disagreement, and Strong Disagreement — for
classification of the data. So, for the second experiment we
decided to combine three classes — i.e. Partial Agreement,
Neutral, and Partial Disagreement — into the single class
Partial. The manual annotation results for three classes are
shown in figure 3. Table II shows the F1 score results
(Experiment II) for all three classifiers.
PRE-PRINT
B. Experiment II
Figure 3. Manual annotation results for three classes
TABLE II. F1 SCORE (EXPERIMENT II) FOR ALL THREE CLASSIFIERS
Classifier
F1 Score
Random Forest (RF)
0.6687
XGB
0.6291
Naïve Bayes (NB)
0.2185
Observe that the F1 score for the Random Forest and XGB
classifiers has been improved for experiment II. Hence, the
decision to combine Partial Agreement, Neutral, and Partial
Disagreement was right.
C. Experiment III
Manually inspecting the data set, we found that Neutral was
the most ambiguous of the five classes, so we decided to
remove it and leave only Strong Agreement, Partial, and
Strong Disagreement, where Partial includes Partial
Agreement + Partial Disagreement. The manual annotation
results for the three classes without Neutral are shown in
figure 4. Table III shows the F1 score results (experiment
III) for all three classifiers. Clearly, after removing Neutral
and combining Partial Agreement and Partial Disagreement
into the single class Partial, the F1 scores for all three
classifiers were improved.
Figure 4. Manual annotation results for three classes except neutral
TABLE III. F1 SCORE (EXPERIMENT III) FOR ALL THREE CLASSIFIERS
Classifier
F1 Score
Random Forest (RF)
0.7315
XGB
0.6291
Naïve Bayes (NB)
0.4146
To further improve the F1 score, we experimented with
some meta-features, such as number of digits, lexical
diversity, number of URLs, use of more stop words,
Word2Vec + PCA (k-dim), Count Encoding of existing
categorical variables, subjectivity and count vectorization.
However, adding these meta-features did not make any
difference in the F1 score for any of the classifiers. So, we
decided not to include these features and retained the
previously used feature sets. Table IV shows the comparison
of all three classifiers across all three experiments.
TABLE IV. COMPARISON OF F1 SCORES OF ALL CLASSIFIERS
Classifier
Exp-I
(F1 score)
Exp-II
(F1 score)
Exp-III (F1
score)
RF
0.5098
0.6687
0.7315
XGB
0.4722
0.6291
0.6518
NB
0.2483
0.2185
0.4146
After comparison of the F1 scores of all three classifiers for
all three experiments, we concluded that it would be best to
proceed with the Random Forest classifier model with the
existing feature set.
D. Testing of the Model
We randomly selected 150 unseen (not annotated) branches
from the data set and tested them in the existing three-class
Random Forest classifier. The ratio of Strong Agreement to
Partial Agreement is 1.44 in the training data and 1.37 in the
testing data, which is quite close. Hence testing of the model
on unseen data is also good. It can be observed that in the
training and testing data we consistently have fewer or
almost zero Strong Disagreements. When we analyzed the
data manually, we observed that people are hesitant to
express their Strong Disagreement directly: instead, they
raise doubts, ask questions, and leave the discussion
incomplete and inconclusive. Our observation is supported
by the classification of training and testing data.
V. DISCUSSION AND CONCLUSION AND FUTURE WORK
At this point there are two main contributions of our
work that we can highlight. Firstly, our data set is structured
text-based conversation. So, we have included more
structural and meta features along with linguistic features
(discussed in section C) that are different from the existing
literature. Secondly, our main goal is to detect the points to
intervene so that we can utilize the affirmation scores to push
the academic discussion towards concluding state. We have
got model for getting branch level scores.
PRE-PRINT
We have shown that in an academic discussion forum on
an online platform (without considering the
(scientific/technical content matter), the discussion threads
can be classified as Strong Agreement, Partial Agreement, or
Strong Disagreement by considering structural and linguistic
features and meta-features. Please note that for experiment I
we are comparing five classes with five classes and for
experiment III we are comparing three classes with three
classes. Since we have used supervised learning approach
that needs annotated samples. We have only 150 annotated
samples, so the F1 score is based on those 150 samples.
Given more annotated samples the F1 score is bound to
increase. We have shown that use of the Random Forest
classifier model gives an F1 score of 0.7315, which is good,
and that the ratio of Strong Agreement to Partial Agreement
is 1.44 in the training data and 1.37 in the testing data, which
is quite close. Classification also supported our observation
that in academic discussions people prefer not to express
Strong Disagreement directly.
In future work, this model will be further enhanced at the
sentence level and find an aggregate score of clarity for an
affirmation. This additional information will allow us to give
the classification greater granularity and obtain the precise
information needed for academic intervention either by
asking questions or providing probes to steer the
conversation to the concluding state. In the existing model,
we did not use a rule engine, which we are planning to
incorporate to further enhance the model. We are planning to
implement a chat-bot so that this enhanced model could be
applied in mediation as described by Lalingkar et al. [8].
ACKNOWLEDGMENT
This paper is based on work that was conducted in
Mphasis CoE for Cognitive Computing at IIITB and
supported by a CSR grant from the Mphasis F1 Foundation.
We thank Mphasis F1 Foundation for funding the work.
Also, we thank upGrad, the educational platform that offer
online courses, for providing us the discussion forum data set
for experimentation. Also, we thank Prof G.
Srinivasraghavan from IIITB for providing valuable insights
for feature selection in the model.
REFERENCES
[1] A. Bandura, “Social learning theory”, Englewood Cliffs, NJ:
Prentice-hall, vol. 1, 1977.
[2] M. J. Thomas, “Learning within incoherent structures: The space of
online discussion forums”. Journal of Computer Assisted Learning,
18(3), 2002, pp.351-366.
[3] H. Avci and T. Adiguzel, “A case study on mobile-blended
collaborative learning in an English as a foreign language (EFL)
context.” International Review of Research in Open and Distributed
Learning, 18(7), 2017.
[4] P. Jermann & P. Dillenbourg, “Group mirrors to support interaction
regulation in collaborative problem solving”. Computers &
Education, 51(1), 2008, pp. 279-296
[5] O. Lebedeva and L. Zaitseva, “Question Answering Systems in
Education and their Classifications”. In Joint International
Conference on Engineering Education & International Conference on
Information Technology, 2014, pp. 359-366.
[6] G. Salmon, “E-moderating: the key to teaching and learning online”.
London: Kogan-Page, 2000.
[7] R. M. Paloff & K. Pratt, “Building learning communities in
cyberspace – effective strategies for the online classroom”. San
Francisco: Jossey-Bass, 1999.
[8] A. Lalingkar, S. Srinivasa & P. Ram, “Characterization of
Technology-based Mediations for Navigated Learning”, Advanced
Computing and Communications Systems, 3(2), 2019, pp. 33-47.
[9] M. Wojatzki, T. Zesch, Mohammad, S. and S. Kiritchenko, “Agree or
disagree: Predicting judgments on nuanced assertions”. In
Proceedings of the Seventh Joint Conference on Lexical and
Computational Semantics, June 2018, pp. 214-224.
[10] N. K. Baym., “Agreements and disagreements in a computer-
mediated discussion”. Research on language and social interaction,
29(4), 1996, pp.315-345.
[11] A. Sheldon, “Conflict talk: Sociolinguistic challenges to self-assertion
and how young girls meet them”, Merrill-Palmer Quarterly, 38, 1992,
pp. 95-117.
[12] M. Mulkay, “Agreement and disagreement in conversations and
letters. Text”, 5, 1986, pp. 201-227.
[13] S. Menini, and S. Tonelli. "Agreement and disagreement: Comparison
of points of view in the political domain." In Proceedings of COLING
2016, the 26th International Conference on Computational
Linguistics: Technical Papers, 2016, pp. 2461-2470.
[14] S. Rosenthal, and K. McKeown. "I couldn’t agree more: The role of
conversational structure in agreement and disagreement detection in
online discussions." In Proceedings of the 16th Annual Meeting of the
Special Interest Group on Discourse and Dialogue, 2015, pp. 168-177
[15] A. Kelsey, G. Carenini, and R. Ng. "Detecting disagreement in
conversations using pseudo-monologic rhetorical structure." In
Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP),2014, pp. 1169-1180.
[16] D. Hillard, M. Ostendorf, and E. Shriberg. "Detection of agreement
vs. disagreement in meetings: Training with unlabeled data." In
Proceedings of the 2003 Conference of the North American Chapter
of the Association for Computational Linguistics on Human
Language Technology: companion volume of the Proceedings of
HLT-NAACL 2003--short papers-Volume 2, Association for
Computational Linguistics, 2003, pp. 34-36.
[17] A. Misra and M. Walker. "Topic independent identification of
agreement and disagreement in social media dialogue." In
Proceedings of the SIGDIAL 2013 Conference, 2013, pp. 41–50.
[18] W. Wang, S. Yaman, K. Precoda, C. Richey, and G. Raymond.
"Detection of agreement and disagreement in broadcast
conversations." In Proceedings of the 49th Annual Meeting of the
Association for Computational Linguistics: Human Language
Technologies: short papers-Volume 2, Association for Computational
Linguistics, 2011, pp. 374-378.
[19] A. Jehad, R. Khan, N. Ahmad, and I. Maqsood. "Random forests and
decision trees." International Journal of Computer Science Issues
(IJCSI) Vol 9, Issue 5, No. 3, 2012, pp. 272-278.
[20] J. C. Reis, A. Correia, F. Murai, A. Veloso, F. Benevenuto, and E.
Cambria. "Supervised Learning for Fake News Detection." IEEE
Intelligent Systems, 34(2), 2019, pp. 76-81.
[21] J. Ren, S. D. Lee, X. Chen, B. Kao, R. Cheng, and D. Cheung. "Naive
bayes classification of uncertain data." In 2009 Ninth IEEE
International Conference on Data Mining, 2009, pp. 944-949.
[22] N. A. Diamantidis, D. Karlis, & E. A. Giakoumakis, “Unsupervised
stratification of cross-validation for accuracy estimation”. Artificial
Intelligence, 116(1-2), 2000, pp. 1-16.
[23] J. Y. Hesterman, L. Caucci, M. A. Kupinski, H. H. Barrett, & L. R.
Furenlid, “Maximum-likelihood estimation with a contracting-grid
search algorithm”. IEEE transactions on nuclear science, 57(3), 2010,
pp. 1077-1084.
PRE-PRINT