From Stances' Imbalance to Their Hierarchical Representation and Detection



Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, dis-cuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.
Qiang Zhang
University College London
London, United Kingdom
Shangsong Liang
Sun Yat-sen University
Guangzhou, China
Aldo Lipani
University College London
London, United Kingdom
Zhaochun Ren
Shandong University
Qingdao, China
Emine Yilmaz
University College London
London, United Kingdom
The quality of online news is usually less substantiated than that
of traditional news services such as magazines or newspapers [
]. A large volume of fake news is being produced for political
or economical purposes [
]. Fake news are those news ar-
ticles that purport to be factual, but which contain misstatements
of fact with intention to arouse passions, attract viewership, or
deceive [
]. Verifying news content needs to retrieve evi-
dences and determine their stance with respect to the news claims,
which proposes new challenges for the conventional stance detec-
tion task [
]. We specify evidence as text, e.g. web-pages and
documents, that can be used to prove if news content is or is not
true. Moreover, automatic stance detection has broad applications
in information retrieval and text entailment [34, 42].
The task of stance detection is to identify the stance of an evi-
dence towards a given news claim [
]. Stances can be catego-
rized into four classes: agree,disagree,discuss and unrelated [
Two characteristics make the stance detection task peculiar. On the
one hand, news claims and evidences are often unrelated – gener-
ating a severe class imbalance problem; On the other hand, since
the non-related classes are by denition related, intuitively, the
identication of an evidence as related or unrelated to a news claim
is semantically dierent from the identication of an evidence as
belonging to one of the other three classes. These two characteris-
tics suggests the natural presence of a hierarchical structure among
stance classes.
Stance detection has been studied in areas of information ex-
traction and natural language processing [
]. However, previ-
ous methods tackle the task as a multiclass classication problem,
neglecting the hierarchical structure in stance classes. Also, the
commonly-used four-way classiers are easily inuenced by the
class imbalance problem. In this paper, we address this issue by
modeling the stance detection task as a two-layer neural network.
The rst layer aims at identifying the relatedness of the evidence,
while the second layer aims at classifying, those evidences iden-
tied as related, into the other three classes: agree, disagree and
discuss. Moreover, by studying various level of dependence assump-
tions between the two layers: (1) independent, when there is no
error propagation between the two layers; (2) dependent, when the
error propagation is left free, and; (3) learned, when the error prop-
agation is controlled by Maximum Mean Discrepancy (MMD), we
show that when learned, the neural network (a) better separates the
distributions of related and unrelated stances and (b) outperforms
the state-of-the-art accuracy for the stance detection task.
The remainder of the paper is organized as follows: § 2 summarizes
the related work; § 3 denes the stance detection task; § 4 details the
proposed hierarchical classication model and the regularization
term; § 5 describes the used datasets and experimental setup; § 6 is
devoted to experimental results, and; § 7 concludes the paper.
Machine learning techniques are widely researched to tackle the
stance detection task. Previous works focus on political or congres-
sional oor debates [
] and online forums [
]. Most of these works rely on content-based features, such as
sentiment analysis and topic-specic features learned from labeled
datasets for a closed set of topics.
Two methods only consider the agree, disagree and discuss
classes: Bar-Haim et al
. [7]
split the stance detection task to three
sub-tasks and propose a Contrast Classication Algorithm to distin-
guish agree and disagree classes; Augenstein et al
. [4]
build a neural
network architecture based on bidirectional conditional encoding
on a Tweeter dataset. A long-short term memory (LSTM) encodes
the claim and another LSTM encodes the text with the encoded
claim as initial states. These methods fail to consider the unrelated
Two other methods consider all the classes, but use two dierent
models: Bourgonje et al
. [10]
use the lemmatized
-gram matching
and a rule-based procedure to decide the evidence relatedness, and
a three-way logistic regression classier to distinguish among the
relevant classes; Wang et al
. [43]
rstly develop a gradient boosted
decision tree (GBDT) model [
] to determine the evidence related-
ness, then another GBDT model is used to distinguish stances of the
text towards the claim. These methods involve feature engineering
in separate models and cannot be jointly optimized to achieve the
best performance.
Other methods that also consider all the classes have been de-
veloped during the Fake News Challenge stage 1 (FNC-1) [
]. The
winner team uses a 50%/50% weighted average between a GBDT
model and a convolutional neural network (CNN) [
]. The second
best performance is achieved by an ensemble of ve multi-layer
perceptrons (MLPs) where input features include bag-of-words,
semantic analysis in addition to the baseline features developed
by the challenge organizers [
]. Compared to the above two solu-
tions, the third best team does not try ensemble methods. They use
TF-IDF features and an MLP as a four-way classier [
]. Zhang
et al
. [48]
propose a ranking method to tackle the task and achieve
empirical performance improvements. However, these methods all
neglect the hierarchical structure among the four types of stances
and suer from class imbalance.
Deep learning-based methods have also been applied in the
FNC-1. Bajaj
utilizes LSTM, CNN and their variants to detect
stances. Bajaj nds that an attention-augmented CNN obtains the
best performance. Rakholia and Bhargava
analyze the eective-
ness of dierent ways of text coding, such as independent coding,
bidirectional conditional encoding and attentive readers, and con-
clude that the attentive reader model is the most suitable for the
task. Ma et al
. [23]
propose a multi-task learning algorithm that
jointly detect rumours and stances. However, all these methods fail
to achieve high accuracy for the agree and disagree classes.
There are three major defects in all the aforementioned meth-
ods: (a) they neglect the hierarchical relationships among the four
stances; (b) they suer from the class imbalance problem, and; (c)
they fail to achieve acceptable detection performance for the agree
and disagree classes.
The stance detection task consists in classifying the stance of an
evidence towards a claim as one of the four classes: agree, disagree,
discuss and unrelated. Formal denitions of these four stances are:
agree – the evidence supports the claim;
disagree – the evidence denies the claim;
– the evidence does not have a position about the claim;
unrelated – the evidence is not about the claim.
In this section, we detail our proposed two-layer neural network
for stance detection. § 4.1 outlines the model. In order to better
dierentiate between the related and unrelated classes, we design
an MMD regularization term in § 4.2. This is then integrated into
the two-layer neural network loss function in § 4.3. In Figure 1, we
show the architecture of our model.
4.1 Two-Layer Neural Network
Let the input space be formed by
-dimensional real vectors in a
neural network, denoted as
. The four-class label can be
transformed into a one-hot vector
. The
-dimension of
) is 1
when the stance is the
-element in the label set
{aдree ,disaдree,
discuss,unrelated }
and 0 otherwise. The hidden layer with param-
learns to map
to a
-dimensional hidden representation
For the two-layer classication, the rst layer decides whether the
evidence is related to a claim. Hence, the rst classication layer is
called the relatedness layer. This layer is parameterized by
learns to produce a 2-dimensional normalized vector ˆ
ras follows:
Note that the
function is included in
to normalize the
2-dimensional vector, so each component of the vector
the probability that the neural network assigns
to the related and
unrelated classes, i.e., p(rel ated)and p(unrel ated).
The second layer classies the evidence into the related classes,
i.e., agree, disagree, or discuss stances. Hence, the second classica-
tion layer is called the stance layer. The stance layer is parameterized
by θsand learns to produce a 3-dimensional normalized vector ˆ
r· (1,0);θs),(3)
where the vector multiplication
r· (
extracts the rst element
. Note that the
function is also included in
to nor-
malize the 3-dimensional vector, so that each component of the
denotes the conditional probability that the neural network
 
Relatedness Layer
 
    
 
Stance Layer
Input Output
Figure 1: The architecture of our proposed two-layer neural network.
to agree, disagree and discuss given that
is related, i.e.,
p(aдree |related),p(disaдree |r elat ed), and p(discuss|rel ated ).
We dene the classication loss by the Kullback-Leibler (KL) di-
vergence [
], which measures the dierence between the network
outputs and labels:
is the ground-truth relatedness of the input data.
is com-
puted from a label yas follows:
r=( (y,e4),(y=e4)),(5)
where is the indicator function,
is a 4-dimensional one-hot
vector with fourth element equal to 1. When
is veried, it
indicates that the label belongs to the unrelated class. Similarly, the
stance classication loss can be dened as:
is the ground-truth stance of the input data.
is computed
from a label yas follows:
s=( (y=e1),(y=e2),(y=e3)),(7)
are 4-dimensional one-hot vectors with rst, second,
and third elements equal to 1. When
is veried, it indicates
that the label belongs to the agree class, when
is veried,
it indicates that the label belongs to the disagree class, and when
is veried, it indicates that the label belongs to the discuss
Finally, we now dene the loss function for the two-layer neural
network as the linear combination between the loss function of the
relatedness layer (lr) and the loss function of the stance layer (ls):
where αleverages the importance of the two classication layers.
4.2 Maximum Mean Discrepancy
The classication of related/unrelated stances is a dierent task
from that of agree/disagree/discuss stances. Therefore, data repre-
sentations from the relatedness layer and the stance layer can be
seen as samples drawn from two dierent distributions. In order
to measure distribution discrepancy between these two layers, we
employ the Maximum Mean Discrepancy (MMD) [
] as a regular-
ization term. The MMD does not involve density estimation and
thus is a non-parametric way of measuring the dierence between
distributions. MMD has achieved success in face recognition and
image annotation [15].
MMD is dened as follows:
Denition 4.1. Maximum Mean Discrepancy [
]: “Let
two Borel probability distributions over a space
and let
be sets with independent identically distributed samples drawn
. The MMD is dened by a class
of map functions
X → H as:
(Ep[ψ(x)] − Eq[ψ(z)]).(9)
Here, xand zare samples from Xand Z.
In other words, the MMD equation denes the largest possible
distance between two expectations over the set of function
. More-
over, “when
is the reproducing kernel Hilbert space (RKHS) [
this means that for all
x∈ X
, the linear point evaluation function
exists and is continuous. When
is the unit
ball in a universal RKHS, it is guaranteed that
detect any discrepancy between pand q[9, 35].
denote the distribution for the rst layer samples (un-
related hidden representations) in our model, with sample set
1, . . . , u1
and according to Eq.
their generating set
1, . . . , v1
. And,
denotes the distribution for the second
layer samples (agree, disagree and discuss hidden representations),
with sample set
1, . . . , u2
and according to Eq.
generating set
1, . . . , v2
are the number of
samples in
. Thus we have
, where
is a
matrix in the projection layer.
jare the space dimensions. According to Eq. (1), the hidden repre-
is parameterized by
, thus the empirical expression
of MMD is parameterized by θuand θd:
i;θu) − 1
By constantly changing the projection layer parameterized by
we nd the maximum expectation dierence between the represen-
tations of the two classication layers.
4.3 Optimization
The more dierent two distributions are, the larger the MMD is.
Hence, in order to make the distributions easier to be distinguished,
a larger MMD regularization term is preferred, and we treat the
regularization term as an extra goal besides classication. We in-
tegrate the two-layer classication loss (see Eq.
) and the MMD
regularization term (see Eq.
) into a single objective function
). Specically, we add these two sub-goals with a hyperparameter
βas follows:
L(θu,θr,θs,θd)=lc(θu,θr,θs) − β·d(θu,θd),(12)
leverages the importance of the regularization. The larger
the MMD regularization term is, the easier is for the classier to
distinguish between the related and unrelated stances. Thus, the
sign of the regularization term is negative.
The optimization involves the minimization of the classication
loss Lwith respect to θu,θr,θs, and θdas follows:
Optimizing the model consists of two sub-goals. On the one hand,
we want to maximize the distribution discrepancy between the
two classication layers. On the other hand, we want to minimize
the classication loss of both layers. Both of these two sub-goals
involve the feature layer parameter
update, but in opposite
update directions. The optimization process will not stop until
a saddle point (the feature layer parameters can be well applied
in both sub-goals) is reached. Algorithm 1 shows the parameter
update process, which is based on the mini-batch gradient descent
4.4 Prediction
Given as input a feature vector
, the classier outputs the following
p(unrelated )
p(aдree |related)
p(disaдree|rel ated )
p(discuss|r elated )
. However, these last 3 probabilities are not
comparable with the rst one. To make them comparable we derive
Algorithm 1:
Parameter update process based on the mini-
batch gradient descent algorithm.
input : Sample mini-batch {vi,ri,si}n
i=1, mini-batch size n,
00000 hyperparameters α,β, and µ
output :θu,θr,θs,θd
2Initialize θu,θr,θs,θd;
4/* forward propagation */
6for i from 1to n do
10 lrlr+lr
11 if ri· (1,0)=1then
12 /* classify related */
13 ˆ
ri· (1,0);θs);
14 ls
15 else
16 /* unrelated */
17 ls
18 lsls+ls
19 d=MMD({ui,ri}n
20 /* backward propagation */
21 θsθsµ·α·ls
22 θrθrµ· ( lr
23 θdθd+µ·β·d
24 θuθuµ· ( lr
25 until θu,θr,θs,θdconverge;
. By observing that the class
agree is assumed as related, thus
p(aдree,rel ated )=p(aдree )
, we
derive that:
p(aдree)=p(aдree,related )
=p(aдree |related) × p(related)
=p(aдree |related)×(1p(unrelated)).(14)
Similarly, for the other two classes we derive that:
p(disaдree)=p(dis aдree |r elat ed)×(1p(unr elated)),
p(discuss)=p(discuss|rel ated)×(1p(unrelated )).(15)
Thereby, the model actual output ˆ
y=(p(aдree ),p(disaдree),p(discuss),p(unrelated)),(16)
where the class with the highest probability corresponds to the
predicted stance.
We start this section by presenting the datasets and evaluation
measures relevant to the stance detection task. Then, we describe
the features used by our model and the model parameterization.
Finally, we present the baselines. The software used to run the
experiments of this paper is available on the website of the rst
5.1 Datasets
Experiments are conducted on two publicly available datasets: the
Emergent dataset
] and the FNC-1 dataset
. In these two datasets,
a claim consists of a news article headline and an evidence of a
news article content. These datasets are split into train and test
subsets; see Table 1 for statistics about the splits.
The FNC-1 dataset consist of 75,385 instances. Each instance
in the dataset is a pair claim-evidence labeled as one of the four
stances: agree, disagree, discuss and unrelated. The ratio of training
data over testing data in the FNC-1 dataset is
2:1. Every class
accounts for a similar percentage in the train and test subsets. The
unrelated stances are the majority (over 70%) in both subsets, while
the disagree stances are less than 3%. The agree and discuss stances
are less than 20% and 10%.
The Emergent dataset is similar to the FNC-1 dataset, however it
contains only agree, disagree and discuss stances. Hence, it needs to
be augmented with unrelated stances. Similarly to how the FNC-1
dataset unrelated stances have been labeled, we manually labeled
unrelated stances by pairing a claim with an unrelated evidence, i.e.,
paired with another claim. Moreover, to make the class distributions
less imbalanced, we make the ratio of related stances and unrelated
1:1. The augmented Emergent dataset contains 4,071 training
labels and 1,024 testing labels with a ratio of
4:1. Class distributions
between train and test subsets are similar.
Compared to the FNC-1 dataset, the class distributions of the
augmented Emergent dataset is more balanced. The percentage of
unrelated stances is about 50%, whereas the percentages of agree
and disagree stances are about 24% and 8%. Both datasets have
similar percentages of the discuss stances.
5.2 Evaluation Measure
In line with the FNC-1 challenge, the evaluation is based on a
weighted two-level scoring system based on the accuracy mea-
sure. This evaluation measure, called relative score, evaluates a
model by splitting the stance detection task into two sub-tasks, re-
lated/unrelated and agree/disagree/discuss classication sub-tasks.
To the former sub-task is given a 25% weight. This is done because
this sub-task is considered to be easier than the latter sub-task to
which is given a 75% weight.
We report the evaluation measures: relative score, accuracy, and
accuracy on a per class basis.
5.3 Feature Extraction
To represent claims and evidences we use a bag-of-words approach.
For each claim and evidence we generate a TF-IDF vector, and for
each pair claim-evidence we compute their cosine similarity. We
also include the FNC-1 ocial features into the input feature vector.
The nal set of features include:
TF-IDF vectors of claims;
TF-IDF vectors of evidences;
Cosine similarity (CosSim) between the claim vector and the
Ratio of word overlap (WordLap) between the claim and the
An Indicator whether a claim has refuting words (RefWord);
The polarity (Pol) of the claim and the evidence;
The number of overlapping
-grams (NGrams) for
n∈ {
4,5,6}between the claim and the evidence.
For the TF-IDF vectors, we only use the top 2,000 most frequent
terms except stop-words. All of these features are concatenated to
form the input feature vector v.
5.4 Experimental Setting
The following hyperparameters have been set via a ve-cross vali-
dation on the train subsets:
The dimension kof hidden representations is set to 100;
The dimension jof the MMD is set to 10;
The activation function used in the hidden layers is set to
The parameters
are set to 1
5and 1
3for the Emergent and
FNC-1 datasets.
The parameter βis set to 0.001;
We include a L2 regularization term [
] for the MLP weight
parameters in the nal loss function to mitigate overtting. Dropout
is also used to mitigate overtting with rate set to 0
6. We train
in mini-batches of size 64 over the entire train subset. Note that
the gradient steps in Algorithm 1 can easily be alternated with a
more powerful optimizer such as the Adam optimizer [
]. Early
stopping is applied when the classication loss on the validation
subset does not get smaller for three continuous iterations. The
whole model is implemented with TensorFlow.
5.5 Baselines
We compare our model against the methods mentioned in Section 2.
These methods are detailed in the following. Among them we dis-
tinguish between methods that use the same features as ours and
methods that learn their representations. We start with the latter
type, we call these representation learning-based baselines:
Bidirectional LSTM (BiLSTM).
Augenstein et al
. [4]
build a neu-
ral network architecture based on bidirectional LSTM on a
Tweeter dataset. A LSTM encodes the claim, and another
LSTM encodes the evidence with the encoded claim set as
initial states. The 100-d GloVe word embedding is used as
input [30];
Attentive CNN (AtCNN).
builds an attention-augmented
CNN. The claim and the evidence are input to a convolu-
tional neural network to obtain hidden representations, and
the attention mechanism is employed to locate the most
inuential words or phases on the nal results;
Table 1: Statistics of the datasets.
Subset Stance Emergent FNC-1
Number Percentage Number Percentage
Training agree 992 24.37 3,678 7.36
disagree 303 7.44 840 1.68
discuss 776 19.06 8,909 17.83
unrelated 2,000 49.13 36,545 73.13
4,071 49,972
Testing agree 246 24.02 1,903 7.49
disagree 91 8.89 697 2.74
discuss 776 19.06 4,464 17.57
unrelated 500 48.83 18,349 72.20
1,024 25,413
Memory Network (MN).
Mohtarami et al
. [26]
develop an end-
to-end memory network for stance detection. The network
operates at the paragraph level and integrates convolutional
and recurrent neural networks, as well as a similarity matrix
as part of the overall architecture;
Ranbking Model (RM).
Zhang et al
. [48]
build a ranking method
to tackle the stance detection and achieve empirical perfor-
mance improvements. A ranking loss function is proposed to
replace Softmax and maximize the representation dierence
between four classes of stance.
We now review the second type of baselines: those methods
that use the same features as our method, we call these feature
engineering-based baselines:
Ocial Baseline (OB).
This is the FNC-1 ocial baseline that
uses one gradient boosting decision trees model for four-
way classication;
Logistic Regression (LR).
Bourgonje et al
. [10]
-gram match-
ing and a rule-based procedure to decide relatedness, and
three-way logistic regression to distinguish among the re-
lated classes;
Gradient Boosted Decision Trees (GBDT).
Wang et al
. [43]
velop two GBDT models, one to determine the relatedness
of an evidence to a claim, and another to distinguish among
the related classes;
Multi-Layer Perception (MLP).
This model [
] achieved the
third best performance in FNC-1. It extracts TF-IDF and
cosine similarity between claims and evidences as input fea-
tures, and uses a MLP as the four-class classier.
In this section, we start by analyzing the dependency assumption.
Then, we compare and contrast our model against the baselines.
Next, we provide a sensitivity analysis of the hyperparameters. We
conclude with an impact analysis of the features used by the model.
6.1 Dependency Assumption
In Figure 2 we show the eect of the 3 dependency assumptions
by visualizing the learned representations using a t-SNE projec-
tion [
]. We observe that when the classiers are assumed inde-
pendent, i.e., the classication is performed in cascade — no error is
propagated from the second layer to the rst during training — then
the learned representation well separates the unrelated class from
the unrelated ones. When the classiers are assumed dependent,
i.e., the two classiers are trained together — the error is left free
to propagate from the second layer to the rst — then the learned
representation is not very well separated. However, when the de-
pendence assumption of the two classiers is learned via the MMD
regularization, i.e., the two classiers are trained together with the
error propagation controlled by the regularizer, then the learned
representation is again well separated like in the rst case. Well-
separated representations suggest a greater discriminative power
of the model — the unrelated and related classes are almost linearly
The last three rows of Tables 2 and 3 show the performance
of our model on the two test subsets for each one of the three
assumptions: independent, dependent, and learned. Looking at the
accuracy of the unrelated class, we observe that the accuracy is
greater when the learned representations are well-separated, as in
the independent and learned cases. Furthermore, looking at all the
other scores, we observe that the learned assumption outperforms
both the independent and dependent assumptions in all other cases,
demonstrating that learning together both, relatedness and stance
of the evidences towards claims, is benecial to the stance detection
6.2 Overall Performance
In Tables 2 and 3 we compare our model against the state-of-the-art
models. Our model achieves the best stance detection performance
for the relative score on both datasets. The model achieves 89.30%
on the augmented Emergent test subset and 88.15% on the FNC-1
test subset.
By comparing with four-way classication baselines (OB, MLP,
BiLSTM, AtCNN, MN and RM) we demonstrate the advantage of
separating the relatedness detection from the stance detection. We
observe that these classiers perform poorly on the disagree class,
which is caused by the large percentage dierence between the
minority disagree class and the majority unrelated class. Further,
the more imbalanced the evaluation dataset becomes, the worse per-
formance the four-way classiers achieve on the minority disagree
(a) Independent (b) Dependent (c) Learned
Figure 2: t-SNE visualization of the hidden representations on the training data. The hidden representations of model trained
(a) with separated layers (b) together but without regularization, and (c) with MMD regularization.
Table 2: Performance comparison of our model against the State-of-the-Art models on the augmented Emergent dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 33.56 23.44 70.23 84.00 74.86
LR (Bourgonje et al.) 66.73 40.51 78.33 78.00 83.45
GBDT (Wang et al.) 80.62 50.42 83.52 88.00 87.53
MLP (Riedel et al.) 58.53 23.64 79.05 95.00 85.43
RM (Zhang et al.) 64.56 40.42 85.45 96.00 87.69
Representation Learning-Based Baselines
BiLSTM (Augenstein et al. ) 43.21 12.57 78.55 96.00 81.37
AtCNN (Bajaj) 44.78 14.60 72.44 97.00 83.56
MN (Mohtarami et al.) 54.64 40.05 72.10 89.00 85.92
Our Models
Independent 74.54 45.32 82.59 95.49 86.33
Dependent 63.54 44.68 68.35 95.00 86.72
Learned 82.52 69.05 84.30 97.00 89.30
By comparing with baselines that separate the relatedness detec-
tion from the stance detection (LR and GBDT) we demonstrate the
superiority of a single end-to-end model. LR and GBDT are better
on the disagree class, although their overall performance is worse
than our model.
In Figure 3 we show the confusion matrix of our model. Here
we observe the detection performance on a per class basis. For the
related/unrelated classication, we correctly classify 97.00% and
99.53% unrelated instances on the augmented Emergent and the
FNC-1 test subsets. We can see that there is some misclassication
between the agree and unrelated classes, and between the discuss
and unrelated classes. The misclassication of the disagree class
accounts for the largest error of the unrelated instances.
Our model achieves an accuracy of 69.05% and 72.35% for the
disagree class on the Emergent and the FNC-1 test subsets. The
classication accuracy is largely improved compared to the state-
of-the-art. Some misclassication error exists between agree and
disagree. However, our model can distinguish between the discuss
and the disagree with few errors. While the number of discuss cases
is the largest and the number of disagree instances is the smallest,
our model does not mistake disagree instances as discuss ones, i.e.,
the model has learned the core representation dierence between
these two classes. Due to ambiguous expressions, misclassication
between agree and discuss is the cause of most errors between these
classes, which leads to a slightly worse accuracy for the discuss
class on the Emergent (84.30%) and FNC-1 (77.49%) test subsets.
Table 3: Performance comparison of our model against the State-of-the-Art models on the FNC-1 dataset.
Model Accuracy (%) Relative Score (%)
agree disagree discuss unrelated
Feature Engineering-Based Baselines
OB 10.51 1.00 79.66 97.98 75.20
LR (Bourgonje et al.) 67.42 31.61 75.23 95.36 80.63
GBDT (Wang et al.) 82.93 69.82 33.52 95.42 86.72
MLP (Riedel et al.) 44.04 6.60 81.38 97.90 81.72
RM (Zhang et al.) 64.90 27.26 84.41 99.12 86.66
Representation Learning-Based Baselines
BiLSTM (Augenstein et al.) 35.96 0.94 80.33 98.54 78.70
AtCNN (Bajaj ) 38.67 8.24 70.63 91.25 75.77
MN (Mohtarami et al.) 16.92 60.22 81.27 95.50 79.92
Our Models
Independent 72.41 37.90 68.23 97.43 83.47
Dependent 61.34 42.93 59.38 99.05 85.32
Learned 80.61 72.35 77.49 99.53 88.15
Figure 3: The confusion matrices of our model for the augmented Emergent (on the left) and FNC-1 (on the right) datasets.
Two reasons account for the improved empirical performance
observed on our model. On the one hand, the mitigation of the
class imbalance problem. Contrary to the four-way classiers that
directly compare the disagree and unrelated instances, the hierar-
chical model avoids the direct comparison of this minority disagree
class (which is less than 2% in the FNC-1 dataset) with the majority
unrelated one (which is more than 70% in the FNC-1 dataset). On
the other hand, the MMD term that maximizes the discrepancy be-
tween the unrelated class and the aggregated related classes. Since
the agree, disagree and discuss belong to the same class, the related
class, the MMD regularization promotes the emergence of features
that are useful to separate the class pairs: agree with unrelated,
disagree with unrelated, and discuss with unrelated.
6.3 Hyperparameters Sensitivity
In this subsection we discuss the sensitivity to the hyperparame-
ters of our model. The most inuential hyperparameters for the
proposed model are
. The former controls the relative impor-
tance of classication layers. The latter leverages the regularization.
In Figures 4(a) and 4(b) we show how the performance of the
model changes when varying
for the augmented Emergent
and FNC-1 test subsets.
is searched between 0
1and 3
steps of 0
1, and
is searched in
, we observe that the performance of the model improves
quickly as
increases and peaks at 1
5and 1
3for the FNC-1 and
augmented Emergent datasets, then the performance experiences
a slight decrease when
is increased. We hypothesize that the
is related to the class balance between the unrelated class
and the related ones. The more unbalanced the dataset is towards
the unrelated class, larger is the optimal
. For
, we observe that the
performance is the highest when
is set to 0.001. This happens for
both augmented Emergent and FNC-1 test subsets. These optimal
values of
observed on the test subsets are equal to the one
found when training the model.
(a) Emergent (b) FNC-1
Figure 4: Sensitivity of the trained model when varying the parameters αand βon the test subset of the augmented Emergent
(on the left) and FNC-1 (on the right) datasets.
Table 4: Performance of our model with dierent feature sets on the FNC-1 dataset. “/” denotes no feature set is removed.
Removed Feature Set Accuracy (%)
agree disagree discuss unrelated
CosSim 71.53 85.08 78.76 69.37
WordLap 67.43 80.49 77.31 77.89
RefWord 74.43 64.37 77.03 97.49
Pol 60.49 67.93 80.92 98.79
NGrams 74.27 75.73 87.82 84.52
/ 80.61 82.35 77.49 99.53
6.4 Feature Analysis
In this subsection we evaluate and discuss the importance of each
feature towards the nal prediction. To examine the inuence of
each feature on the nal performance, we do a leave-one feature set-
out approach and record the classication accuracy on the stance
detection task. The following analysis is only based on the FNC-1
dataset. Similar results are observed on the augmented Emergent
In Table 4 we show the results of this analysis. We observe that
removing the CosSim feature leads to a large decrease in accuracy
for the unrelated class. Similarly, the use of WordLap has a positive
eect for the agree class, and it also contributes to the unrelated
class. The RefWord and Pol features help for the classes agree and
disagree, while removing the NGram feature leads to an increase on
the discuss class, i.e., the NGram feature causes confusion between
the discuss and the other classes.
In this paper, we studied the problem of stance detection: the clas-
sication of the stance of an evidence towards a claim into one of
the four classes: agree, disagree, discuss and unrelated.
We proposed a hierarchical representation of the stance classes,
where the classes agree, disagree and discuss are combined together
into a class referred as the related class. The main idea here is to
divide a concept into sub-concepts that are organized in a hierar-
chical structure, and design constraints between sub-concepts in
order to make the model parameter optimization more sensible.
The primary advantage of this hierarchical representation is that it
is useful to overcome the class imbalance problem.
This hierarchical representation has inspired the proposed two-
layer neural network to tackle the stance detection task. The rst
layer performs a related-unrelated classication, while the second
layer performs a more ne-grained classication among the related
classes. Furthermore, we have empirically demonstrated that (1) it
is advantageous to learn these two classication tasks together, and
(2) the dependency between these two layers can be learned through
a MMD regularization term, which measures the representation
discrepancy between the two layers. Experiments on two publicly
available datasets have shown that our model is able to outperform
the state-of-the-art stance detection methods.
As future work we consider the enriching of the proposed model
as follows. First, integrating a credibility evaluation of information
sources as features. Second, improving the explainability of the
model by showing which words or phrases are the most inuential
in predicting the stance via attention mechanisms.
This project was funded by the EPSRC Fellowship titled "Task Based
Information Retrieval", grant reference number EP/P024289/1. We
acknowledge the support of NVIDIA Corporation with the donation
of the Titan Xp GPU used for this research.
