ArticlePDF Available

Effective Semantic Text Similarity Metric Using Normalized Root Mean Scaled Square Error

Authors:

Abstract and Figures

The Pearson correlation is a performance measure that indicates the extent to which two variables are linearly related. When Pearson is applied to the semantic similarity domain, it shows the degree of correlation between scores of dataset test-pairs, the human and the observed similarity scores. However, the Pearson correlation is sensitive to outliers of benchmark datasets. Although many works have tackled the outlier problem, little research has focused on the internal distribution of the benchmark dataset's bins. A representative and well-distributed text benchmark dataset embody a wide range of similarity scores values; therefore, the benchmark dataset could be considered a cross-sectional dataset. Although a perfect text similarity method could report a high Pearson correlation, the standard Pearson correlation is unaware of correlated individual text pairs in a single dataset's cross-section due to outliers. Therefore, this paper proposes the normalized mean scaled square error method, inferred from the standard scaled error to eliminate the outliers. The newly proposed metric was applied to five benchmark datasets. Results showed that the metric is interpretable, robust to outliers, and competitive to other related metrics.
Content may be subject to copyright.
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3436
EFFECTIVE SEMANTIC TEXT SIMILARITY METRIC USING
NORMALIZED ROOT MEAN SCALED SQUARE ERROR
1ISSA ATOUM, 2MARUTHI ROHIT AYYAGARI
1Department of Software Engineering, The World Islamic Sciences and Education, Jordan
2College of Business, University of Dallas, Texas, USA
E-mail: 1issa.atoum@wise.edu.jo, 2rayyagari@udallas.edu
ABSTRACT
The Pearson correlation is a performance measure that indicates the extent to which two variables are linearly
related. When Pearson is applied to the semantic similarity domain, it shows the degree of correlation
between scores of dataset test-pairs, the human and the observed similarity scores. However, the Pearson
correlation is sensitive to outliers of benchmark datasets. Although many works have tackled the outlier
problem, little research has focused on the internal distribution of the benchmark dataset’s bins. A
representative and well-distributed text benchmark dataset embody a wide range of similarity scores values;
therefore, the benchmark dataset could be considered a cross-sectional dataset. Although a perfect text
similarity method could report a high Pearson correlation, the standard Pearson correlation is unaware of
correlated individual text pairs in a single dataset’s cross-section due to outliers. Therefore, this paper
proposes the normalized mean scaled square error method, inferred from the standard scaled error to
eliminate the outliers. The newly proposed metric was applied to five benchmark datasets. Results showed
that the metric is interpretable, robust to outliers, and competitive to other related metrics.
Keywords: Pearson, Absolute Error, Text Similarity, Correlation, Scaled Square Error, Outliers
1. INTRODUCTION
Under heavy noise conditions, extracting
the correlation coefficient between two sets of
stochastic variables is nontrivial [1]. The
performance of a Text Similarity (TS) method is
most often calculated by the Pearson correlation
between the human-mean scores (first variable or the
reference), and the method observed scores (second
variable). Formally, the performance of a text
similarity method is calculated as the covariance of
the two variables divided by the product of their
standard deviations, which is a figure value from -1
to 1. When the figure is high, it implies a high
correlation with the human scores; therefore, the
similarity method becomes favorable over another
method in a specific task.
Although Pearson correlation has been
theoretically approved and used in many domains,
the Pearson correlation if taken in isolation may
incidentally indicate invalid causation. It was shown
that correlation might indicate (humorously) that
babies are delivered by storks[2]. Similarly, and
using the same correlation, it was reported that the
consumption of cocoa flavanols results in an acute
improvement in visual and cognitive functions [3].
Therefore, the simplicity of a correlation could hide
the considerable complexity in interpreting its
meaning[4]. Moreover, the application of Pearson
correlation, as a linear relationship is limited to
predict the correlation in domains that are not
normally distributed. For example, it was shown that
the Pearson correlation is not a good predictor for the
reliability of characteristics of interest[5]. Despite
the ever increasing interests in other alternatives [6]–
[10], the Pearson correlation is still dominant in
domains of text similarity such as those related to the
SemEval tasks workshop series [11], [12].
In Spite Of the simplicity and
interpretability of the Pearson correlation in the text
similarity domain [13]–[15], the cosine similarity is
among others getting attention from scholars,
especially in word embeddings applications [16],
[17]. It was pointed out that Pearson correlation does
not provide enough justifiable results in software
engineering domain [18].Therefore, the Pearson
correlation should be adapted or modified to handle
software engineering issues related to software
requirements engineering and testing [19]–[21] .
One major problem of Pearson correlation
is the outliers. Outliers have a reflective influence on
Journal of Theoretical and Applied Information Technology
30
th
June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN:
1992-8645
www.jatit.org E-ISSN:
1817-3195
3437
the slope of the regression line, and consequently on
the value of the correlation coefficient. The problem
is known in the literature as the Anscombe’s quartet
[22] problem, as shown in Figure 1. The Anscombe's
quartet comprises four datasets that have nearly
identical Pearson's correlation (0.816), yet they
appear very different when graphed. Therefore,
datasets distributions should be analyzed to handle
outliers.
When a benchmark dataset is designed, it
usually works competitively over pairs of text in at
least three bins of the dataset that vary in similarity
from low (L), medium (M), to high similarity (H).
An appropriate similarity method should work well
in all cases of dataset scores, L, M, and H. The
inherent problem of the standard Pearson correlation
is the way of calculation. The standard Pearson
correlation does not take into consideration the
cross-sectional property of the dataset; instead, it
considers all values, including outliers. Therefore, a
high Pearson correlation does not guarantee the
suitability of the similarity method to its application
Based on the assumption that a useful
benchmark dataset is cross-sectional, we claim that
there are at least four different similarity methods,
low-similarity-method (α), medium-similarity-
method (β), high-similarity-method (Ω), and the
optimal similarity method (𝛿). The α method is fair
when the dataset (or the cross-section) has low
human scores, while the β method is fair when the
dataset (or the cross-section) has high human scores.
In contrast, the optimal method (𝛿) should work
with all cases of the dataset.
Figure 2 explains the problem with the four
types of similarity methods using our crafted demo
dataset. The demo dataset reports 0.7 correlation for
α, β, and Ω methods and 1.0 for the optimum method
(𝛿). On the first hand, an α method (Figure 2a) has a
high correlation with text pairs that has low
similarity as per human-means (pairs 1-3). On the
second hand, an method (Figure 2c) has a high
correlation with text pairs that have high similarity
as per human-means (pairs 7-9). In contrast, the β
method (Figure 2b) has a high correlation with text
pairs that has medium similarity as per human-
means (pairs 4-6).
Figure 2a is an example with a similarity
method that works very well on text pairs that have
low similarity while Figure 2b is an example with a
similarity method that works very well on text pairs
that have a medium similarity, and Figure 2c is an
example with a similarity method that works very
well on text pairs that are literary similar. The
objective is to find a suitable similarity measure that
works very well on all benchmark scales. Therefore,
a useful method should reduce the errors between
actual and observed scores. Therefore, for a task that
needs to discover similar text such as plagiarism, the
(Ω) is favorable, and for tasks that need to find
irrelevant text (irrelevant documents) the method (α)
is suitable. Therefore, the standard Pearson method
was not able to consider variabilities in text
similarity scores. The goal is to choose a method that
gives high correlation, such as the optimum method
in Figure 2d.
Although there are many alternatives to
Pearson correlation, most text similarity
competitions (e.g., SemEval series [11], [12] ) uses
Pearson correlation as a standard. Nevertheless,
many types of research are pushing toward making a
Figure 1 Effect of outliers on Pearson’s
correlation (Anscombe)
Figure 2 Effect of Similarity method on Pearson’s
correlation (r=0.7 for a,b,c; r=1.00 for d)
0.00
0.20
0.40
0.60
0.80
1.00
0123456789
Human α
0.00
0.20
0.40
0.60
0.80
1.00
0123456789
Human β
0.00
0.20
0.40
0.60
0.80
1.00
0123456789
Human
0.00
0.20
0.40
0.60
0.80
1.00
0123456789
Human
(a) (b)
(d)(c)
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3438
new correlation measure in the text similarity
domain. However, most of the ranked correlation
methods such as Spearman[7] and the Kendall tau
correlation[23] methods suffer from ties and are
suitable for datasets that are ranked in nature [24].
Therefore, the aim is to find a method that handles
issues of the Pearson correlation and providing
alternatives that were not studied deeply in the
semantic similarity domain.
Hyndman and Koehler [25] proposed the
scaling absolute error methods to scale down
observed values in the finance domain. Compared to
the relative error methods, the scaling absolute error
method is independent of the scale of the observed
data, and it can remove the problems of undefined
means and infinite variance. Hyndman extended the
scaling absolute error method to the Mean Squared
Scaled Error (MSSE).
In our context, the absolute error measure is
the difference between the text-pair human score and
the similarity method observed score. The MSSE is
a function of absolute error of human and observed
scores concerning the mean variability of observed
scores. Consequently, the MSSE should be able to
reduce the absolute errors presented in Figure 2. We
normalize the MSSE (NMSSE) to a scale between 0
to 1 using the exponent function. The NMSSE,
compared to Pearson correlation, ranks text
similarity methods based on the target text
application task.
Practically, and as a proof of concept, our
proposed metric shows the divergence of some
commonly cited works. Although the LSA measure
of [26] reported good Pearson correlation, it is
misjudging text-pairs scores reporting an absolute
relative error approaching 80%. Moreover, methods
that depend on large corpus tend to overestimate
scores of text pairs [27]. The objective of this paper
is to propose a new approach that could be used to
eliminate data outliers and provide a performance
metric to select the best text similarity method.
First, Pearson and its related measures are
explained. Next, the proposed metric is explained.
Then, the metric is evaluated. After that, we
highlight the research implications and limitations.
Finally, the paper is concluded.
2. RELATED WORKS
2.1. Pearson Correlation
The Pearson correlation has been proposed
long back [28], yet it still applicable as an evaluation
metric for many SemEval tasks workshop series
[11], [12].
The Pearson correlation is calculated as the
covariance of the two variables divided by the
product of their standard deviations[29]. In the text
similarity domain, the variables are the human-mean
scores’ group and the related observed test-score
group. So, if we have one dataset scores {,...,}
that represent the human-mean scores of a list of text
pairs and another dataset { 𝑜,..., 𝑜 }
containing 𝑛 observed scores (from a text similarity
method), the Pearson's correlation coefficient, 𝑟, is
shown in (1).
𝑟 ℎℎ
 𝑜𝑜
̅
ℎℎ
 .
𝑜𝑜
̅
 (1)
Where 𝑛 is the the number of text pairs.,𝑜 are the
ith score of human-mean (i.e., reference) and test
(observed) scores text pairs.
and 𝑜
are the mean
of the gold standard and test scores respectively.
2.2. Ranking Methods
The Spearman method [7] is considered
one of the most cited alternative methods to Pearson
correlation; however, it is not used regularity in text
similarity domain because it works on ranked data,
which is not reasonable in text similarity [24].
Similarly, the Kendall tau correlation[23], which
calculates the proportion ranks between datasets, is
rarely seen in text similarity domain.
Several other methods measure the gain of
a document based on its position in the result list [8]–
[10]; however, these methods suffer from ties and
are not suitable for scaled text similarity
measures[30]. The Hoeffding’s D method, a non-
parametric measure, measures the difference
between the joint ranks and the product of their
marginal ranks[31]. The distance correlation as its
name implied, is based on the distance (usually
Euclidian ) to measure the dependence between two
variables[32], [33]. The maximal information
coefficient (MIC) is a measure of the strength of the
linear or non-linear association between two
variables[34]; however, it does not perform well in
low sample size[35].
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3439
2.3. Error methods
Error methods are used to quantify the
difference or percentage between actual and forecast
values. The absolute error computes the amount of
error in a trial. The relative error is an extension to
the absolute error with relative to the original real
value. These methods are easy-to-use [36].
3. PROPOSED METRIC
Equation (2) defines the absolute error (𝐴E) of a
text pair 𝑗, as the difference between human scores
(actual, ) score and the observed scores
(predicted,
) of a text similarity measure.
𝐴
E|
ℎ
| (2)
𝑆
𝐴
E
ℎ
 𝑛 (3)
The scaled error (𝑆) for each text pair 𝑗
is given by equation (3) , where 𝑛 is the number of
text pairs in the benchmark dataset. The
is the
mean of the observed method similarity score. Then
the mean scaled square error (MSSE) is defined by
equation (4).
MSSE𝑆
𝑛 (4)
The lowest value of MSSE is zero when the
absolute error of actual and predicted values is zero
and is infinity when all predicted values are
identical; that is the mean of observed scores (
)
equals every predicted value (
). Therefore, we
normalize the values of the MSSE between (0,1) to
allow a quantitative comparison between different
datasets as shown in equation (5). Where the
𝑀𝑆𝑆𝐸 as shown in equation (4), and 𝑒 is the
exponent value. The NMSSE equals the value of 1
when the error is at the maximum and 0 when the
error is very low. Therefore, for ranking similarity
methods, the lower NMSSE the better.
NMSSE1𝑒 (5)
4. EVALUATION AND DISCUSSION
4.1. Datasets used in the Experiments
Table 1 shows the set of datasets used in the
experiment. The datasets are split into two
categories: development (6,427 text pairs) and test
datasets (1,909 text pairs). The goal of the split was
to support text similarity measures that depended on
pre-training or test training[12]; however, in our
case, we used both datasets for the selected text
measures. We filter datasets from stopwords using
the nltk stop words’ list.
4.2. Selected Text Similarity Measures
For this paper, the selected text measures
illustrate the applicability of the proposed metric
over a wide range of text similarity measures, as
shown in Table 2.
Table 1 Benchmark datasets
Dataset Dev. Test Total Description
Demo Crafted
Dataset - 9 9
We prepare this dataset to illustrate similarity measures’
problems and to apply the proposed metric on a simple to view
dataset.
STS -30 - 30 30 30-sentence pairs collected by Li [37] based on dictionary
definitions of words from [38].
SemEval STS
1500 1379 2879
The datasets include text from image captions, news headlines,
and user forums which are part of the text similarity tasks of
SemEval series [12]
SICK 4927 500 5427
Sentences Involving Compositional Knowledge (SICK) are
English sentences from the 8K ImageFlickr and the SemEval
2012 STS MSR-Video Description dataset[39]
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3440
Table 2 Methods used in this experiment
Method Description
α Method A demo method used on our crafted demo dataset. An α method is a similarity method used to
demonstrate a text similarity method that is leaned toward dissimilar text pairs. The method produces
an observed score that is 95% accurate to the human-means for the first three pairs and value at random
for the remaining pairs.
β Method A demo method used on our crafted demo dataset. A β method is a similarity method used to demonstrate
a text similarity method that is leaned toward moderately similar text pairs. The method produces an
observed score that is 95% accurate to the human-means for the 4-6 pairs and value at random for the
remaining pairs.
Ω Method A demo method used on our crafted demo dataset. An Ω method is a similarity method used to
demonstrate a text similarity method that is leaned toward high similar text pairs. The method produces
an observed score that is 95% accurate to the human-means for the 6-9 pairs and value at random for
the remaining pairs.
𝛿 method A demo method to show the method that scores the highest Pearson score. The method produces an
observed score that is 95% accurate to the human-means scores.
InferSent InferSent (INF for shorthand), a sentence embedding trained on fastText vectors of Facebook research.
INF is BiLSTM with max pooling that was trained on the 570k English sentence pairs of SNLI dataset.
[40].
GSE The universal Google’s sentence encoder (GSE) converts any text to a semantic vector. The semantic
measure is based on deep learning on the semantic space. We use the Encoder 2 from Google
TensorFlow Hub.
TSM Text Similarity Measure (TSM) is a WordNet measure that calculates the semantic similarity of two
sentences using information from WordNet and corpus statistics [27].
WMD The Word Mover's Distance (WMD) method uses the word embeddings of the words in two texts to
measure the minimum amount that the words in one text need to "travel" in semantic space to reach the
words of the other text [41]. We use the pre-trained word vectors of Glove (840B tokens) and fastText
word vectors W2V (2 million-word vectors).
SIF The Smooth Inverse Frequency (SIF) uses less weight to solely unrelated words, and so word
embeddings are weighted based on the estimated relative frequency of a word in a reference corpus and
the common component analysis technique [42]. We use the pre-trained word vectors of Glove (840B
tokens) and fastText word vectors W2V (2 million-word vectors).
4.3. NMSSE Illustrated over the Demo Dataset
For illustration and showing various cases
of text similarity measures over a wide range of
datasets, we use a demo dataset for this experiment.
Table 3 shows the list of crafted text pair’s scores
over four crafted methods α, β, Ω, 𝛿 methods as
described in Table 2. The table shows the cross-
sections of the dataset (bins 1 to 3), and the score for
each individual pair using the crafted methods.
Figure 3 shows the Pearson correlation,
Spearman, and the proposed NMSSE metric of the
data in Table 2. The figure also shows the Pearson
correlation for the text pairs 1-3, 4-6,7-9 legend as
Pearson_Q1, Pearson_Q2, Pearson_Q3 respectively.
Results show that methods that are good to measure
no similar text method) have high Pearson
correlation on the first three text pairs (Pearson_Q1),
while methods that are good to measure high similar
text (Ω method) has high correlation on the last three
text pairs (Pearson_Q3). In the middle between the
two methods, the β method shows a high Pearson
correlation between the 4-6 pairs (Pearson_Q2).
The reported findings of the three demo
methods indicate that the absolute error between
human scores and predicted scores is low. Therefore,
for a task that needs to discover similar text such as
plagiarism the () is favorable and for semantic
tasks (irrelevant documents) that needs to find
irrelevant text the (α) method is appropriate.
Figure 3 Crafted dataset Pearson correlation
‐1.00
‐0.50
0.00
0.50
1.00
Pearson N MSSE Pearson_Q1 Pearson_Q2 Pearson_Q3
α β Ω
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3441
Table 3 The demo similarity methods
Bin
Pair Human
α
Sim.
β
Sim.
Sim. 𝛿
Sim.
1 1 0.01 0.01 0.29 0.47 0.01
2 0.12 0.12 0.00 0.19 0.11
3 0.23 0.22 0.01 0.16 0.21
2 4 0.34 0.73 0 .33 0.33 0.30
5 0.45 0.27 0.43 0.10 0.41
6 0.57 0.76 0.54 0.93 0.51
3 7 0.68 0.79 0 .61 0.64 0.60
8 0.79 0.40 0.24 0.75 0.75
9 0.90 0.67 0.70 0.85 0.81
In contrast, the 𝛿 method, the best method,
has a smooth absolute error except for the outlier
shown in the pair number 8. The best method (𝛿)
shows the lowest errors over the dataset.
The unproductive performance of Pearson
correlation shown in Figure 3 is illustrated in Figure
4. According to Figure 4, the NMSSE is the lowest
for the α method because the α method was doing
well in pairs 1-3. The NMSSE also was the lowest
for the method since the method is doing well
for pairs 7-9. The same thing could be applied to the
β method since the β method was doing well for pairs
4-6. The best optimum method 𝛿 shows lower values
for NMSSE for all the three cross-sections of the
dataset.
Table 4 shows the statistics of the demo
data as per equations (2) – (4). Although α, β are
similar in absolute error, they are different in scaled
errors because β is higher in the MASE as shown the
Figure 4. The root cause of this problem is that as
ℎ
increase, the denominator in the equation (3)
increase and as a result, the value of the equation is
reduced. If
≅ℎ
, that is the value of the predicted
score is like the mean of all predictions, we will get
the highest possible error. Although method has
the highest test score mean variability (
ℎ
 ),
it ranked as the third method using the NMSSE. As
shown in Figure 5, the scaled errors are reduced
when the method matches the type of the similarity
method.
Figure 4 NMSSE performance over sections of datasets
0.00
0.10
0.20
0.30
0.40
0.50
0.60
0.70
0.80
0.90
1.00
NMSSE_q1 NMSSE_q2 NMSSE_q3
α β Ω
Figure 5 Scaled errors over different measures
‐1.00
0.00
1.00
2.00
0123456789
α
α
0.00
1.00
2.00
3.00
0123456789
β
β
0.00
1.00
2.00
0123456789
0.00
0.20
0.40
0.60
0123456789
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3442
Table 4 Statistics of the crafted dataset
α β 𝛿
e

1.52 1.51 1.59 0.39
ℎ
ℎ

2.38 1.76 2.4 2.04
𝑆

5.74 7.72 5.35 1.72
MAE 0.17 0.17 0.16 0.04
MASE 0.64 0.86 0.59 0.19
Pearson 0.70 0.70 0.70 1.00
Spearman 0.70 0.67 0.62 1.00
NMSSE 0.47 0.58 0.48 0.17
Furthermore, we calculate the variability
between a performance metric (including the
NMSSE) on the whole dataset and the value of the
compared metric on each section of the dataset Q1,
Q2, and Q3. The target is that we should select the
performance metric that has the lowest variability; a
metric that works well in many situations. Figure 6
shows the variability between Pearson, Spearman,
and the proposed NMSSE concerning the three
sections of the dataset; pairs 1-3,4-6,7-9
respectively. The lowest variability was in NMSSE
for the best method, 𝛿. Whereas the Pearson measure
shows a higher variability due to outliers in each
dataset section. We deduce that NMSSE is effective
in scaling data and in removing outliers. However,
the NMSSE shows a relatively higher variability in
the Q1 dataset because most datasets in this section
has low similarity scores that will affect the
denominator in equation (3).
4.4. Practical Evaluation of NMSSE
Table 5-7 shows the performance of the
NMSE, Pearson correlation, Spearman, and the
MAE for the selected methods presented in Table 2.
The scores were calculated using the weighted
average method based on the number of text pairs in
both development and test benchmark datasets. The
predicted values and human-mean scores were
normalized to be in range 0 to 1 to normalize errors
for method.
The NMSSE proposes to rank text
similarity methods. As Table 5 shows, if an
application is looking for an alternative text
similarity method, the GSE is preferred over other
methods as they have the lowest NMSSE. The only
restriction in this scenario is that the application
should be based on any dataset that imitates a similar
domain to the SICK dataset. On the STS dataset
(Table 6) the SIF method is the best method as it got
the lowest NMSSE. However, on the 30-pair dataset
(STS-65) shown in Table 7, the SIF had the lowest
NMSSE. We emphasize that the proposed metric is
smooth-grained with the benchmark dataset, which
gives an advantage of our metric over other methods.
Figure 6 Variability over data segments
‐0.4
0.1
0.6
1.1
1.6
Pe arso n Sp ea rma n NMS SE P ear so n Spe a rma n N MS SE P ear so n Spe a rma n N MS SE
Q1 Q2 Q3
α β Ω
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3443
Table 5 Weighted Scores on the SICK dataset (Dev, Test)
GSE INF SIF (W2V) SIF (GLOVE) WMD (GLOVE) WMD (W2V) TSM
Pearson 0.820.76 0.73 0.72 0.64 0.64 0.48
Spearman 0.770.70 0.61 0.59 0.59 0.59 0.43
MAE 0.090.12 0.16 0.15 0.43 0.43 0.15
NMSSE 0.430.54 0.52 0.53 1.00 1.00 0.98
Table 7 STS65 scores
GSE INF SIF (W2V) SIF (GLOVE) WMD (GLOVE) WMD (W2V) TSM
Pearson 0.78 0.80 0.80 0.73 0.69 0.74 0.52
Spearman 0.80 0.79 0.77 0.79 0.63 0.68 0.47
MAE 0.27 0.39 0.12 0.16 0.47 0.42 0.35
NMSSE 0.91 1.00 0.37 0.50 1.00 1.00 1.00
Table 6 Weighted Scores on STS dataset (DEV, Test)
GSE INF SIF (W2V) SIF (GLOVE) WMD (GLOVE) WMD (W2V) TSM
Pearson 0.78 0.75 0.73 0.72 0.55 0.61 0.36
Spearman 0.77 0.74 0.70 0.71 0.55 0.61 0.37
MAE 0.23 0.21 0.18 0.20 0.37 0.38 0.24
NMSSE 0.88 0.92 0.72 0.83 1.00 1.00 0.94
Figure 7 Ranking methods using NMSEE
0.81 0.77
0.14
0.59
0.75 0.71
0.15
0.67
0.72
0.60
0.41
0.82
0.43 0.41
0.18
0.96
0.62 0.59
0.41
1.00
0.00
0.20
0.40
0.60
0.80
1.00
1.20
Pearson Spearman MAE NMSSE
GSE INF SIF TSM WMD
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3444
The application of the NMSSE handles the
problematic issues of Pearson correlation, as shown
in Figure 7. The figure shows the weighted scores
over all the five benchmark datasets. The leaders are
the GSE and the INF methods as they have the
lowest SSE compared to other methods. Over the
datasets, the traditional edge counting method TSM
method outperformed the frequency (SIF) and word
distance method (WMD) due to the addition of
knowledge from WordNet exploited by the TSM.
We noticed that the WMD method got the highest
NMSSE due to the scaled error value which was (10-
6); consequently, the NMSSE will be high as the
denominator of equation (2) becomes low. The root
cause of the low scaled error was due to the predicted
values of the WMD method that had a mean of 0.5;
In other words, the average of the difference between
the prediction of the scores and the mean of the
prediction approaches zero. Figure 8 shows the
WMD method scores and the human scores for the
1380 text pairs of the STS test benchmark dataset.
The figure shows that the WMD is overestimating or
underestimating scores by almost a constant value.
Therefore, the WMD got the lowest NMSSE.
4.5. Comparing NMSSE with Related Methods
To our knowledge, no complete performance metric
could be used for the text similarity domain. We
carry out a comparison between the proposed
NMSSE and other methods over the following
criteria:
A. Interpretability: a useful performance metric
should be easy to use and interpret; therefore,
its output can be easily compared within a
predefined scale.
B. Dependency: a useful metric should find the
dependency between the human scores and
the predicted scores.
C. In-group relationship: a useful metric should
indicate how each value in the group is related
to each other. As the human scores in a
benchmark dataset have a range of values
between 0 to 5, the predicted scores should
have similar consistent behavior.
D. Robustness to outliers: performance metrics
should resolve outliers’ issues without
affecting the ultimate performance metric
score.
Figure 8 NMSEE of WMD over STS dataset
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
‐19 81 181 281 381 481 581 681 781 881 981 1081 1181 1281 1381
Prediction
Hscor e
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3445
E. scale: a performance metric that has a
numeric value (e.g., 0 to 1) is quantifiable
when compared to other values resulted from
other related applications.
Table 8 shows the comparison of our metric
and a list of selected metrics where the stands for
the availability of the criterion while stands for a
non-applicable criterion. Although most of the
compared methods are interpretable (A), they suffer
from outliers (D). The MAE can be made
interpretable by getting the relative or percentage
error. The drawback of the MAE is that it does not
take into consideration the in-group predicted scores
(C), and it does not provide a standard scale (E). We
underline that we are not looking to replace Pearson
correlation but to add extra information that could be
utilized to researchers in natural language processing
and machine learning communities.
Table 8 Comparison of the proposed metric and related
approaches
Criterion NMSSE Pearson Ranking
Methods
MAE
A
B
C
D
E
5. IMPLICATIONS
The implication of this research is
theoretical and practical. The new measure suggests
a re-look to the ongoing usage of the Pearson
correlation for a long time. In practice, applications
should select the similarity method with the lowest
possible normalized error. Although the scaled error
method was borrowed from a non-related domain
(the finance domain), the new proposed normalized
scaled square error could be used in other domains
where outliers play a significant effect in natural
language processing task. Since the proposed metric
is robust to outliers and provides an interpretable
scaled value, it would be practical in comparing text
in domains such as plagiarism detection and text
entailments.
6. LIMITATION
Despite the fact that the proposed method
is superior in ranking and text evaluation,
researchers need to do more research before
generalizing results. The method was applied to five
datasets only, and it was not applied practically in
any semantic text similarity task.
The research direction should target to
generalize the results with text similarity by
annotating current and new datasets to allow the
comparison of the proposed approach with other
alternatives. Therefore, further experiments are
needed to test the situations where we would prefer
the Pearson correlation over the proposed
normalized means scaled square error method.
In the future, the proposed approach should
be evaluated using simulations and applying the
proposed method on a large empirical dataset.
7. CONCLUSION
This paper proposes a new semantic
similarity metric that could be used to compare and
rank semantic similarity methods. The proposed
metric reduces dataset noise by scaling absolute
error with the mean of the absolute difference of
observed scores with observed mean scores. The
metric was compared with Pearson, Spearman, and
the Mean Absolute Error. Results showed that the
new proposed normalized scaled square error is
effective in reducing skewness and is applicable in
domains with different observed scores. In the
future, we plan to run several simulations over the
new metric and evaluate the metric with extra-large
benchmark datasets.
REFERENCES
[1] N. Moriya, “Noise-Related Multivariate
Optimal Joint-Analysis in Longitudinal
Stochastic Processes,” Prog. Appl. Math.
Model., pp. 223–260, 2008.
[2] T. Höfer, H. Przyrembel, and S. Verleger,
“New evidence for the theory of the stork,”
Paediatr. Perinat. Epidemiol., vol. 18, no. 1,
pp. 88–92, 2004.
[3] D. T. Field, C. M. Williams, and L. T.
Butler, “Consumption of cocoa flavanols
results in an acute improvement in visual
and cognitive functions,” Physiol. Behav.,
vol. 103, no. 3–4, pp. 255–260, 2011.
[4] R. Aggarwal and P. Ranganathan,
“Common pitfalls in statistical analysis: The
use of correlation techniques,” Perspect.
Clin. Res., vol. 7, no. 4, p. 187, 2016.
[5] O. Hryniewicz and J. KArpińsKi,
“Prediction of reliability--the pitfalls of
using Pearson’s correlation,” Eksploat. i
Niezawodn., vol. 16, 2014.
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3446
[6] F. Serinaldi, A. Bárdossy, and C. G. Kilsby,
“Upper tail dependence in rainfall extremes:
would we know it if we saw it?,” Stoch.
Environ. Res. risk Assess., vol. 29, no. 4, pp.
1211–1233, 2015.
[7] C. Spearman, “The proof and measurement
of association between two things,” Am. J.
Psychol., vol. 15, no. 1, pp. 72–101, 1904.
[8] K. Järvelin and J. Kekäläinen, “IR
evaluation methods for retrieving highly
relevant documents,” in Proceedings of the
23rd annual international ACM SIGIR
conference on Research and development in
information retrieval, 2000, pp. 41–48.
[9] J. Kekäläinen, “Binary and graded relevance
in IR evaluations—comparison of the effects
on ranking of IR systems,” Inf. Process.
Manag., vol. 41, no. 5, pp. 1019–1033, 2005.
[10] D. Katerenchuk and A. Rosenberg,
“RankDCG: Rank-Ordering Evaluation
Measure,” CoRR, vol. abs/1803.0, 2018.
[11] E. Agirre, D. Cer, M. Diab, A. Gonzalez-
Agirre, D. Cer, and A. Gonzalez-Agirre,
“Semeval-2012 task 6: A pilot on semantic
textual similarity,” in Proceedings of the
First Joint Conference on Lexical and
Computational Semantics-Volume 1:
Proceedings of the main conference and the
shared task, and Volume 2: Proceedings of
the Sixth International Workshop on
Semantic Evaluation, 2012, no. 3, pp. 385–
393.
[12] D. M. Cer, M. T. Diab, E. Agirre, I. Lopez-
Gazpio, and L. Specia, “SemEval-2017 Task
1: Semantic Textual Similarity -
Multilingual and Cross-lingual Focused
Evaluation,” CoRR, vol. abs/1708.0, 2017.
[13] S. Nithya, A. Srinivasan, M. Senthilkumar,
and others, “Calculating the user-item
similarity using Pearson’s and cosine
correlation,” in 2017 International
Conference on Trends in Electronics and
Informatics (ICEI), 2017, pp. 1000–1004.
[14] I. Atoum, “A Novel Framework for
Measuring Software Quality-in-use based
on Semantic Similarity and Sentiment
Analysis of Software Reviews,” J. King
Saud Univ. - Comput. Inf. Sci., p. , 2018.
[15] I. Atoum, A. Otoom, and N.
Kulathuramaiyer, “A Comprehensive
Comparative Study of Word and Sentence
Similarity Measures,” International Journal
of Computer Applications, vol. 135, no. 1.
Foundation of Computer Science (FCS),
NY, USA, pp. 10–17, 2016.
[16] J. Pennington, R. Socher, and C. D.
Manning, “Glove: Global vectors for word
representation,” in Proceedings of the
Empiricial Methods in Natural Language
Processing (EMNLP 2014), 2014, vol. 12,
pp. 1532–1543.
[17] Y. Li, L. Xu, F. Tian, L. Jiang, X. Zhong,
and E. Chen, “Word embedding revisited: A
new representation learning and explicit
matrix factorization perspective,” in
Proceedings of the 24th International Joint
Conference on Artificial Intelligence,
Buenos Aires, Argentina, 2015, pp. 3650–
3656.
[18] I. Atoum, “A Scalable Operational
Framework for Requirements Validation
Using Semantic and Functional Models,” in
Proceedings of the 2Nd International
Conference on Software Engineering and
Information Management, 2019, pp. 1–6.
[19] M. R. Ayyagari and I. Atoum, “CMMI-DEV
Implementation Simplified:A Spiral
Software Model,” Int. J. Adv. Comput. Sci.
Appl., vol. 10, no. 4, pp. 445–450, 2019.
[20] M. R. Ayyagari, “iScrum: Effective
Innovation Steering using Scrum
Methodology,” Int. J. Comput. Appl., vol.
178, no. 10, pp. 8–13, May 2019.
[21] I. Atoum, “Requirements Elicitation
Approach for Cyber Security Systems,” i-
manager’s J. Softw. Eng., vol. 10, no. 3, pp.
1–5, 2016.
[22] F. J. Anscombe, “Graphs in Statistical
Analysis,” Am. Stat., vol. 27, no. 1, pp. 17–
21, 1973.
[23] M. G. Kendall, “A new measure of rank
correlation,” Biometrika, vol. 30, no. 1/2, pp.
81–93, 1938.
[24] J. D. Gibbons and M. Kendall, “Rank
correlation methods,” Edward Arnold, 1990.
[25] R. J. Hyndman and A. B. Koehler, “Another
look at measures of forecast accuracy,” Int.
J. Forecast., vol. 22, no. 4, pp. 679–688,
Oct. 2006.
[26] J. O’Shea, Z. Bandar, K. Crockett, and D.
McLean, “A Comparative Study of Two
Short Text Semantic Similarity Measures,”
in Agent and Multi-Agent Systems:
Technologies and Applications, vol. 4953,
N. Nguyen, G. Jo, R. Howlett, and L. Jain,
Eds. Springer Berlin Heidelberg, 2008, pp.
172–181.
Journal of Theoretical and Applied Information Technology
30th June 2019. Vol.97. No 12
© 2005 – ongoing JATIT & LLS
ISSN: 1992-8645 www.jatit.org E-ISSN: 1817-3195
3447
[27] I. Atoum and A. Otoom, “Efficient Hybrid
Semantic Text Similarity using Wordnet and
a Corpus,” International Journal of
Advanced Computer Science and
Applications(IJACSA), vol. 7, no. 9. The
Science and Information (SAI) Organization
Limited, pp. 124–130, 2016.
[28] R. S. (Great Britain), Proceedings of the
Royal Society of London, no. v. 58. Taylor &
Francis, 1895.
[29] J. Lee Rodgers and W. A. Nicewander,
“Thirteen ways to look at the correlation
coefficient,” Am. Stat., vol. 42, no. 1, pp. 59–
66, 1988.
[30] L. A. Goodman and W. H. Kruskal,
“Measures of association for cross
classifications,” in Measures of association
for cross classifications, Springer, 1979, pp.
2–34.
[31] W. Hoeffding, “A non-parametric test of
independence,” Ann. Math. Stat., pp. 546–
557, 1948.
[32] G. J. Székely, M. L. Rizzo, and others,
“Brownian distance covariance,” Ann. Appl.
Stat., vol. 3, no. 4, pp. 1236–1265, 2009.
[33] G. J. Skely, M. L. Rizzo, N. K. Bakirov,
and others, “Measuring and testing
dependence by correlation of distances,”
Ann. Stat., vol. 35, no. 6, pp. 2769–2794,
2007.
[34] D. N. Reshef et al., “Detecting novel
associations in large data sets,” Science (80-
. )., vol. 334, no. 6062, pp. 1518–1524, 2011.
[35] R. Heller, Y. Heller, and M. Gorfine, “A
consistent multivariate test of association
based on ranks of distances,” Biometrika,
vol. 100, no. 2, pp. 503–510, 2012.
[36] Z. Wang and A. C. Bovik, “Mean squared
error: Love it or leave it? A new look at
signal fidelity measures,” IEEE Signal
Process. Mag., vol. 26, no. 1, pp. 98–117,
2009.
[37] Y. Li, D. Mclean, Z. Bandar, J. D. O. Shea,
and K. Crockett, “Sentence Similarity Based
on Semantic Nets and Corpus Statistics,”
vol. 18, no. 8, pp. 1–35, 2006.
[38] H. Rubenstein and J. B. Goodenough,
“Contextual correlates of synonymy,”
Commun. ACM, vol. 8, no. 10, pp. 627–633,
Oct. 1965.
[39] M. Marelli et al., “A SICK cure for the
evaluation of compositional distributional
semantic models.,” in LREC, 2014, pp. 216–
223.
[40] A. Conneau, D. Kiela, H. Schwenk, L.
Barrault, and A. Bordes, “Supervised
Learning of Universal Sentence
Representations from Natural Language
Inference Data,” in Proceedings of the 2017
Conference on Empirical Methods in
Natural Language Processing, 2017, pp.
670–680.
[41] M. Kusner, Y. Sun, N. Kolkin, and K.
Weinberger, “From word embeddings to
document distances,” in International
Conference on Machine Learning, 2015, pp.
957–966.
[42] S. Arora, Y. Liang, and T. Ma, “A Simple
but Tough-to-Beat Baseline for Sentence
Embeddings,” in International Conference
on Learning Representations, 2017.
... The root node is only one also called parent node and two or more child nodes, also called internal node and descendants. The terminal nodes are leaf nodes, which has not any other nodes [24]. The decision tree has various types such as homogeneous decision tree, uni-variate decision tree, multi-variate decision tree and hybrid decision tree. ...
... When the result of the test received, the data is divided into two or more segments, and this will happen till the leaf nodes are not received [25]. The main difference between multivariate decision tree and uni-variate decision tree, is that the multivariate decision tree dividing test at every vertex for multiple feature and uni-variate decision tree dividing test for every vertex for single feature [24]. The random forest technique depends on, the multiple decision trees were combined, where selection of fewer portion of components was accumulated method similarly needed training sample to build the application like other techniques such as regression and classification techniques. ...
Article
Full-text available
Nowadays, human gait recognition is a popular technique due to security requirements in public places. Gait recognition technique is used to identify a person from his/her walking cycle and cooperation of a human being is not required in this process. In this article, the DWT (Discrete Wavelet Transform) and DCT (Discrete Cosine Transform) feature extraction techniques are considered for extracting the unique properties for gait recognition of an individual. For classification, DT (Decision Tree), RF (Random Forest) and K-NN (K-Nearest Neighbors) are considered in this work, because these classifiers are performing well in the field of pattern recognition research area. The gait cycle has two phases, namely, stance phase and swing phase. Stance phase has included heel strike, foot flat, mid stance, heel off, toe off. Swing phase included acceleration, mid swing, de-acceleration. These both phases are considered in this work to recognize gait of an individual. Information of gait is obtained from different parts of silhouettes. The human silhouette is segmented into seven components namely head, arm, trunk, thigh, front leg, back leg, and foot. People can be recognized by their gait become popular and there are the various reasons such as this can be done remotely, do not need high resolution videos etc. In this article, the authors have considered publicly available dataset, namely CASIA-A gait image dataset for the experimental work. Using features and classification methods considered in this work, the authors have achieved a recognition accuracy of 84.26% with random forest classifier for CASIA-A public dataset. Significance of the work-In this article, the authors have presented DCT and DWT feature extraction techniques and decision tree, random forest and K-NN classification techniques for gait recognition of an individual. The authors have reported a recognition accuracy of 84.26% for CASIA-A public dataset of gait recognition using DCT features and Random Forest classifier.
... Recommender systems are used by companies such as Netflix, Amazon, and Spotify to recommend products (e.g., TV shows, books, and music) to their users. Recommender systems make recommendations based on a user's past behavior [11] and the behavior of other similar users; they attempt to find items that the user will like but would not have found otherwise. ...
... According to TABLE V, the averaged RMSE is low and shows that results are consistent between experts, indicating that the dataset is acceptable for machine learning. The characteristics of the dataset are consistent with the literature that a suitable error estimation method is a method that reduces the dataset noise and scales out the absolute error to a minimum [38]. ...
Article
Full-text available
With the advance technology and increase in customer requirements, software organizations pursue to reduce cost and increase productivity by using standards and best practices. The Capability Maturity Model Integration (CMMI) is a software process improvement model that enhances productivity and reduces time and cost of running projects. As a reference model, CMMI does not specify systemic steps of how to implement the model practices, leaving a room for organization development approaches. Small organizations with low budgets and those who are not looking for CMMI appraisals cannot cope with the high price of CMMI implementation. However, they need to manage the risk of CMMI implementation under their administration. Therefore, this paper proposes a simplified plan using the spiral model to implement the CMMI to reach level 2. The objective is to make the implementation more straightforward to implement and fit CMMI specification without hiring external experts. Compared to related implementation frameworks, the proposed model is deemed competitive and applicable under organizations' conditions.
Conference Paper
Full-text available
A successful operational software depends on adequacy and degrees of freedom in requirements definitions. The software developer in conjunction with the customer validates requirements to ensure the completion of the intended use and the capability of the target application. Notwithstanding, requirements validation is time-consuming, effortless and expensive , and many times involves error-prone manual activities. The difficulty of the problem increases with an increase in the application size, the application domain, and inherit textual requirements constructs. Current approaches to the problem are considered as passive-defect aggregations, domain specific , or rather fine-grained with formal specifications. We propose a scalable operational framework to learn, predict, and recognize requirements defects using semantic similarity models and the Integration Functional Definition methods. The proposed framework automates the validation process and increases the productivity of software engineers online with customer needs. A proof of concept shows the applicability of our solution to requirements inconsistency defects.
Article
Full-text available
Software quality in use (QinU) relates to human-software interactions when a software product is used in a particular context. Currently, QinU measurement models are bound to ineffective measurement formulation and many models are subjectively incoherent. This paper proposes a novel QinU framework (QinUF) to measure QinU competently consuming software reviews. The framework has three components: QinU prediction, polarity classification, and QinU scoring. The QinU prediction component computationally maps software review-sentences to its respective QinU characteristics (topics) of the ISO 25010 model based on a text similarity measure. The topic prediction problem is run as a text to text similarity; where the first text (test) is the actual unlabeled review-sentence and the second text is the set of selected features (keywords) from a benchmark dataset. The polarity classification component classifies each test sentence to its polarity orientation; the respective sentimental values are recorded. To score QinU, the sentimental values are grouped and summarized into their respective QinU topics. The QinUF evaluation over real-life scenarios showed that the QinUF automates software QinU measurement; therefore, users could compare and acquire software on the fly. The framework is consistent and superior to related compared works.
Article
Full-text available
Ranking is used for a wide array of problems, most notably information retrieval (search). There are a number of popular approaches to the evaluation of ranking such as Kendall's $\tau$, Average Precision, and nDCG. When dealing with problems such as user ranking or recommendation systems, all these measures suffer from various problems, including an inability to deal with elements of the same rank, inconsistent and ambiguous lower bound scores, and an inappropriate cost function. We propose a new measure, rankDCG, that addresses these problems. This is a modification of the popular nDCG algorithm. We provide a number of criteria for any effective ranking algorithm and show that only rankDCG satisfies all of them. Results are presented on constructed and real data sets. We release a publicly available rankDCG evaluation package.
Article
Full-text available
similarity plays an important role in natural language processing tasks such as answering questions and summarizing text. At present, state-of-the-art text similarity algorithms rely on inefficient word pairings and/or knowledge derived from large corpora such as Wikipedia. This article evaluates previous word similarity measures on benchmark datasets and then uses a hybrid word similarity in a novel text similarity measure (TSM). The proposed TSM is based on information content and WordNet semantic relations. TSM includes exact word match, the length of both sentences in a pair, and the maximum similarity between one word and the compared text. Compared with other well-known measures, results of TSM are surpassing or comparable with the best algorithms in the literature.�Text
Article
Full-text available
During the conduct of clinical trials, it is not uncommon to have protocol violations or inability to assess outcomes. This article in our series on common pitfalls in statistical analysis explains the complexities of analyzing results from such trials and highlights the importance of "intention-to-treat" analysis.
Article
Full-text available
Requirements elicitation is considered the most important step in software engineering. There are several techniques to elicit requirements, however they are limited. Most approaches are general qualitative approaches. Thus, they do not suite specific software domain, such as cyber security. This article proposes a new technique to elicit requirements from cyber security strategies. The approach is able to formally define requirements’ strengths, and link them with respective analyst’s expertise. Consequently, management can easily select appropriate requirements to be implemented. The use of the proposed approach on a selected cyber security domain showed its applicability on cyber security framework implementations.
Conference Paper
Many modern NLP systems rely on word embeddings, previously trained in an unsupervised manner on large corpora, as base features. Efforts to obtain embeddings for larger chunks of text, such as sentences, have however not been so successful. Several attempts at learning unsupervised representations of sentences have not reached satisfactory enough performance to be widely adopted. In this paper, we show how universal sentence representations trained using the supervised data of the Stanford Natural Language Inference datasets can consistently outperform unsupervised methods like SkipThought vectors on a wide range of transfer tasks. Much like how computer vision uses ImageNet to obtain features, which can then be transferred to other tasks, our work tends to indicate the suitability of natural language inference for transfer learning to other NLP tasks. Our encoder is publicly available.