ArticlePDF Available

Abstract and Figures

Computing the semantic similarity between terms (or short text expressions) that have the same meaning but which are not lexicographically similar is an important challenge in the information integration field. The problem is that techniques for textual semantic similarity measurement often fail to deal with words not covered by synonym dictionaries. In this paper, we try to solve this problem by determining the semantic similarity for terms using the knowledge inherent in the search history logs from the Google search engine. To do this, we have designed and evaluated four algorithmic methods for measuring the semantic similarity between terms using their associated history search patterns. These algorithmic methods are: a) frequent co-occurrence of terms in search patterns, b) computation of the relationship between search patterns, c) outlier coincidence on search patterns, and d) forecasting comparisons. We have shown experimentally that some of these methods correlate well with respect to human judgment when evaluating general purpose benchmark datasets, and significantly outperform existing methods when evaluating datasets containing terms that do not usually appear in dictionaries.
Content may be subject to copyright.
Inf Syst Front (2013) 15:399–410
DOI 10.1007/s10796-012-9404-7
Semantic similarity measurement using historical google
search patterns
Jorge Martinez-Gil ·Jos´
e F. Aldana-Montes
Published online: 15 January 2013
© Springer Science+Business Media New York 2013
Abstract Computing the semantic similarity between
terms (or short text expressions) that have the same meaning
but which are not lexicographically similar is an impor-
tant challenge in the information integration field. The
problem is that techniques for textual semantic similarity
measurement often fail to deal with words not covered by
synonym dictionaries. In this paper, we try to solve this
problem by determining the semantic similarity for terms
using the knowledge inherent in the search history logs from
the Google search engine. To do this, we have designed
and evaluated four algorithmic methods for measuring the
semantic similarity between terms using their associated
history search patterns. These algorithmic methods are: a)
frequent co-occurrence of terms in search patterns, b) com-
putation of the relationship between search patterns, c)
outlier coincidence on search patterns, and d) forecasting
comparisons. We have shown experimentally that some of
these methods correlate well with respect to human judg-
ment when evaluating general purpose benchmark datasets,
and significantly outperform existing methods when evalu-
ating datasets containing terms that do not usually appear in
dictionaries.
Keywords Information integration ·Web Intelligence ·
Semantic similarity
J. Martinez-Gil ()·J. F. Aldana-Montes
Department of Computer Science, University of Malaga,
Boulevard Louis Pasteur 35, Malaga, Spain
e-mail: jorgemar@acm.org
J. F. Aldana-Montes
e-mail: jfam@lcc.uma.es
1 Introduction
Semantic similarity measurement relates to computing the
similarity between terms or short text expressions, hav-
ing the same meaning or related information, but which
are not lexicographically similar (Li et al. 2003). This is a
key challenge in a lot of computer science related fields,
for instance, in data warehouse integration when creating
mappings that link mutually components of data warehouse
schemas (semi)automatically (Banek et al. 2007), in the
field of identity matching where personal and social identity
features are used (Li et al. 2011) or, in the entity resolu-
tion field where two given text objects have to be compared
(Kopcke et al. 2010). But the problem is that semantic simi-
larity changes over time and across domains (Bollegala et al.
2008). The traditional approach for solving this problem has
consisted of using manually compiled taxonomies such as
WordNet (Budanitsky and Hirst 2006). The problem is that
with the emergence of social networks or instant messag-
ing systems (Retzer et al. 2012), a lot of (sets of) terms
(proper nouns, brands, acronyms, new words, and so on) are
not included in these kinds of taxonomies; therefore, sim-
ilarity measures that are based on these kinds of resources
cannot be used in these tasks. However, we think that the
great advances in the web research field have provided new
opportunities for developing accurate solutions.
On the other hand, Collective Intelligence (CI) is an
active field of research that explores the potential of collab-
orative work in order to solve complex problems (Scarlat
and Maries 2009). Scientists from the fields of sociology,
mass behavior, and computer science have made impor-
tant contributions to this field. It is supposed that when a
group of people collaborate or compete with each other,
intelligence or behavior that otherwise did not exist sud-
denly emerges. We use the name Web Intelligence (WI)
400 Inf Syst Front (2013) 15:399–410
when these people use the Web as a means of collabo-
ration. We want to profit from the fact that through their
interactions with the web search engines, users provide
an interesting kind of information that can be converted
into knowledge reusable for solving problems related with
semantic similarity measurement.
To do this, we are going to use Google Trends (Choi and
Varian 2009) which is a web application owned by Google
Inc. based on Google Search (Brin and Page 1998). This
web application shows how often a particular search-term
is entered relative to the total search-volume across various
specific regions, categories, time frames and properties. We
are working under the assumption that users are express-
ing themselves. This expression is in the form of searching
for the same concepts from the real world at the same
time but represented with different lexicographies. There-
fore, the main contributions of this work can be summarized
as follows:
We propose for the first time (to the best of our knowl-
edge) to use historical search patterns from web search
engine users to determine the degree of semantic simi-
larity between (sets of) terms. We are especially inter-
ested in measuring the similarity between new terms or
short text expressions.
We propose and evaluate four algorithmic methods for
measuring the semantic similarity between terms using
their historical search patterns. These algorithmic meth-
ods are: a) frequent co-occurrence of terms in search
patterns, b) computation of the relationship between
search patterns, c) outlier coincidence on search pat-
terns, and d) forecasting comparisons.
The rest of this paper is organized as follows: Section 2
describes the related work that is proposed in the litera-
ture. Section 3describes the key aspects of our contribution,
including the different ways of computing the semantic
similarity. Section 4presents a statistical evaluation of our
approach in relation to existing ones. Section 5presents
a discussion based on our results, and finally, Section 6
describes the conclusions and future lines of research.
2 Related work
We have not found proposals addressing the problem of
semantic similarity measurements using search logs. Only
Nandi & Bernstein have proposed a technique which was
based on logs from virtual shops for computing similarity
between products (Nandi and Bernstein 2009). However, a
number of approaches have addressed the semantic similar-
ity measurement (Hliaoutakis et al. 2006; Patwardhan et al.
2003;Petrakisetal.2006; Sanchez et al. 2010; Sanchez
et al. 2010), and the use of WI techniques for solving
computational problems (Jung and Thanh Nguyen 2008;
Scarlat and Maries 2009; Sparck Jones 2006) separately.
With regards to the first topic, identifying semantic sim-
ilarities between terms is not only an indicator of mastery
of a language, but a key aspect in a lot of computer-related
fields too. It should be taken into account that semantic simi-
larity measures can help computers to distinguish one object
from another, group them based on the similarity, classify a
new object inside the group, predict the behavior of the new
object or simplify all the data into reasonable relationships.
There are a lot of disciplines that could benefit from these
capabilities (Hjorland 2007). Within the most relevant areas
is the data warehouse field where applications are character-
ized by heterogeneous models that have to be analyzed and
matched either manually or semi-automatically at design
time (Fong et al. 2009). The main advantage of matching
these models consists of enabling a broader knowledge base
for decision-support systems, knowledge discovery and data
mining than each of the independent warehouses could offer
separately (Banek et al. 2007). There is also the possibility
of avoiding model matching by manually copying all data
in a centralized warehouse, but this task requires a great
cost in terms of resource consumption, and the results are
not reusable in other situations. Designing good semantic
similarity measures allows us to build a mechanism for auto-
matically query translation (which is a prerequisite for a
successful decoupled integration) in an efficient, cheap and
highly reusable manner.
Several works have been developed over the last few
years proposing different ways to measure semantic similar-
ity. Petrakis et al. stated that according to the data sources
and the way in which they are used, different families of
methods can be identified (Petrakis et al. 2006). These
families are:
Edge Counting Measures: path linking the terms in
the taxonomy and of the position of the terms in the
taxonomy.
Information Content Measures: measures the difference
of information content of the two terms as a function of
their probability of occurrence in a text corpus.
Feature based Measures: measures the similarity
between terms as a function of their properties or based
on their relationships to other similar terms.
Hybrid Measures: combine all of the above.
Our proposal does not fit in well enough in any of these
families of methods, so that it proposes a new one: Based on
WI Measures. However, regarding the use of WI techniques
for solving computational problems, we have found many
approaches.
Aggregate information that consists of creating lists
of items generated in the aggregate by your users
Dhurandhar (2011). Some examples are a Top List of
Inf Syst Front (2013) 15:399–410 401
items bought, or a Top Search Items or a List of Recent
Items.
Ratings, reviews, and recommendations that consists of
understanding how collective information from users
can influence others (Hu et al. 2012).
User-generated content like blogs or wikis that consist
of extracting some kind of intelligence from contribu-
tions by users Liu and Zhang (2012).
Finally, in order to compare our approaches with exist-
ing ones; we are considering techniques which are based
on dictionaries. We have chosen the Path Length algorithm
(Pedersen et al. 2004) which is a simple edge counting tech-
nique. The score is inversely proportional to the number of
nodes along the shortest path between the definitions. The
shortest possible path occurs when the two definitions are
the same, in which case the length is 1. Thus, the maxi-
mum score is 1. Another approach proposed by Lesk (1986)
which consists of finding overlaps in the definitions of the
two terms. The score is the sum of the squares of the overlap
lengths. The Leacock and Chodorow algorithm (Leacock
et al. 1998) takes into account the depth of the taxonomy
in which the definitions are found. An Information Content
(IC) measure proposed by Resnik (1995) and which com-
putes common information between concepts a and b is rep-
resented by the IC of their most specific common ancestor
subsuming both concepts found in the taxonomy to which
they belong. Finally, the Vector Pairs technique (Banerjee
and Pedersen 2003) which is a Feature based measure which
works by comparing the co-occurrence vectors from the
WordNet definitions of concepts.
Now we propose using a kind of WI technique for deter-
mining the semantic similarity between terms that consists
of comparing the historical web search logs from the users.
The rest of this paper is taken up with explaining, evalu-
ating, and discussing the semantic similarity measurement
using historical search patterns from the Google search engine.
3 Contribution
Web searching is the process of typing freeform text, either
words or small phrases, in order to look for websites, pho-
tos, articles, bookmarks, blog entries, videos, and more. In
a globalized world, our assumption is that large sets of peo-
ple will search for the same things at the same time but
probably from different parts of the world and using dif-
ferent lexicographies. We would like to use this in order
to identify similarities between text expressions. Although
our proposal also works with longer text statements, we are
going to focus on short expressions only.
The problem which we are facing consists of measur-
ing the semantic similarity between two given (sets of)
terms. Semantic similarity is a concept that extends beyond
synonymy and is often called semantic relatedness in the lit-
erature. According to Bollegala et al.; a certain degree of
semantic similarity can be observed not only between syn-
onyms (e.g. lift and elevator), but also between meronyms
(e.g. car and wheel), hyponyms (leopard and cat), related
words (e.g. blood and hospital) as well as between antonyms
(e.g. day and night) (Bollegala et al. 2007). To do that, we
are going to work with time series. The reason is that Google
stores the user queries in the form of time series in order to
offer or exploit this information efficiently in the future.
According to the Australian Bureau of Statistics,1a time
series is a collection of observations of well-defined data
items obtained through repeated measurements over time.
For example, measuring the value of retail sales each month
of the year would comprise a time series. This is because
sales revenue is well defined, and consistently measured at
equally spaced intervals. In this way, data which is collected
irregularly or only once are not time series.
The similarity problem in time series consists of com-
puting the similarity for two sequences of real numbers
(which represent some measurements of a real variable at
equal time intervals). However, this is not a trivial task,
because even between different people, the notion of sim-
ilarity varies. However, it is possible to offer a minimal
notion of what is a similarity measure from a mathematical
point of view:
Definition 1 (Similarity measure) A similarity measure sm
is a function sm :μ1×μ2→ Rthat associates the similar-
ity of two input terms μ1and μ2to a similarity score sc ∈
in the range [0, 1].
We use the expression semantic similarity to indicate that
we are comparing the meaning of terms instead of compar-
ing the way they are written. For example, the terms card
and car are quite similar from a lexicographical point of
view but do not share the same meaning.
Before beginning to discuss our proposal it is neces-
sary to take into account that in this work we have worked
under the assumption that Google has not suffered any tran-
sient malfunction when taking measurements of the user
searches, so that the morphology of the search patterns is
only due to user searches on the Web. Once the problem is
clear, the first, and perhaps most intuitive solution, may con-
sist of viewing each sequence as a point in n-dimensional
space, and defining similarity between the two sequences,
this solution would be trivial to compute but there is an
important problem because there are no scales used in the
1http://www.abs.gov.au/
402 Inf Syst Front (2013) 15:399–410
graphics due to the normalized results and, therefore we do
not know what the absolute numbers are.
To avoid these kinds of problems, we propose using four
different ways to define and compute the semantic simi-
larity: Measuring the co-occurrence of terms in search pat-
terns, identifying the relationships between search patterns,
determining the outlier coincidence on search patterns, and
making forecasting comparisons. Any of proposed algo-
rithms take into account the scale of the results, but other
features like frequent co-occurrences, correlations, anoma-
lies, or future trends respectively. Moreover, it should be
taken into account that for the rest of this work, we are
going to evaluate our four approaches using two benchmark
datasets:
Miller & Charles benchmark dataset which is a dataset
of term pairs rated by a group of 38 human beings
(Miller and Charles 1991). Term pairs are rated on a
scale from 0 (no similarity) to 4 (complete similarity).
Miller & Charles ratings have been considered as the
traditional benchmark dataset to evaluate solutions that
involve semantic similarity measures (Bollegala et al.
2007).
Another new dataset that we will name Martinez &
Aldana which is a dataset rated by a group of 20 peo-
ple belonging to several countries, indicating a value of
0 for not similar terms and 1 for totally similar terms.
This benchmark dataset has been created to evaluate
terms that are not frequently covered by dictionaries or
thesaurus but which are used by people. Therefore, we
will be able to determine the most appropriate algorithm
for comparing the semantic similarity of new words.
This could be useful in domains that create new text
expressions very frequently.
The comparison between these two benchmark datasets
and our results is made using Pearson’s Correlation Coeffi-
cient, which is a statistical measure for the comparison of
two sets of values. The results can be in the real interval
[1, 1], where 1 represents the worst case (totally differ-
ent values) and 1 represents the best case (totally equivalent
values). Note that all tables, except those for the Miller &
Charles ratings, are normalized into values in [0, 1] range
for ease of comparison. Pearson’s correlation coefficient is
invariant against a linear transformation (Bollegala et al.
2007). As a general rule, for all the tables in the section
below, the two first columns represent each of the term of
the pair to be studied, the third column presents the results
from the benchmark dataset, and finally the fourth column
represents the value returned by our algorithm.
3.1 Co-occurrence of terms in search patterns
The first algorithmic method that we propose consists of
measuring how often two terms appear in the same query.
Co-occurrence of terms in a text corpus is usually used
as an evidence of semantic similarity in the literature
(Bollegala et al. 2007; Cilibrasi and Vit´
anyi 2007; Sanchez
et al. 2010). We propose adapting this paradigm for our
purposes. To do this, we are going to calculate the joint
probability so that a user query may contain both search
terms over time. Figure 1shows an example for the co-
occurrence of the terms car and automobile over time.
These terms appear together 6 years and the search log is 6
years old, so the resulting score is 6 divided by 6, thus 1.
Therefore, we have evidence of their semantic similarity.
The method that we propose to measure the similar-
ity using the notion of co-occurrence consists of using the
following formula:
n. years terms co-occur
n. years registered in the log (1)
We think that this formula is appropriate because it com-
putes a similarity score so that the (set of) terms never
appear together or appear together in the same queries each
year.
Fig. 1 Search pattern containing both terms car and automobile. User queries have included both terms at the same time frequently so that there
is evidence that the both terms represent the same object
Inf Syst Front (2013) 15:399–410 403
Tabl e 1 Results for the study of the co-occurrence using the Miller &
Charles dataset
Miller-Charles Co-occurrence
rooster voyage 0.080 0.000
noon string 0.080 0.000
glass magician 0.110 0.000
cord smile 0.130 0.000
coast forest 0.420 0.625
lad wizard 0.420 0.000
monk slave 0.550 0.000
forest graveyard 0.840 0.000
coast hill 0.870 0.750
food rooster 0.890 0.000
monk oracle 1.100 0.000
car journey 1.160 0.750
brother lad 1.660 0.000
crane implement 1.680 0.000
brother monk 2.820 0.000
implement tool 2.950 0.000
bird crane 2.970 0.625
bird cock 3.050 0.000
food fruit 3.080 1.000
furnace stove 3.110 0.875
midday noon 3.420 0.000
magician wizard 3.500 0.125
asylum madhouse 3.610 0.000
coast shore 3.700 0.750
boy lad 3.760 0.250
journey voyage 3.840 0.375
gem jewel 3.840 0.500
automobile car 3.920 1.000
Score 1.000 0.364
Table 1shows us the results obtained using this method.
The problem is that there are terms that are not semantically
similar but are searched together frequently, for instance:
coast and forest,orcoast and hill in this dataset. However,
our technique provides good results in most cases, therefore,
the correlation of this technique with respect to human judg-
ment is moderate and could be useful in such cases where a
dictionary or thesaurus are not available.
Table 2shows us the results obtained using the study of
co-occurrence over the specific benchmark. The problem is
that there are terms that are not semantically similar but are
searched together frequently, for instance the terms sustain-
able and renewable or slumdog and underprivileged.How-
ever, the global score is fine which confirms for us that it
could be used for identifying similarities when dictionaries
or other kinds of external resources do not exist.
Tabl e 2 Results for the study of the co-occurrence using the Martinez
& Aldana dataset
Martinez- Co-occurrence
Aldana
peak oil apocalypse 0.056 0.000
bobo bohemian 0.185 0.000
windmills offshore 0.278 0.000
copyleft copyright 0.283 0.000
whalewatching birdwatching 0.310 0.000
tweet snippet 0.314 0.000
subprime risky business 0.336 0.000
imo in my opinion 0.376 0.000
buzzword neologism 0.383 0.000
quantitave money flood 0.410 0.000
easing
glamping luxury camping 0.463 0.000
slumdog underprivileged 0.482 0.500
i18n internationalization 0.518 0.000
vuvuzela soccer horn 0.523 0.125
pda computer 0.526 1.000
sustainable renewable 0.536 0.625
sudoku number place 0.538 0.000
terabyte gigabyte 0.573 0.625
ceo chief executive 0.603 0.375
officer
tanorexia tanning addiction 0.608 0.000
the big apple New York 0.641 0.500
asap as soon as possible 0.661 0.000
qwerty keyboard 0.676 1.000
thx thanks 0.784 0.375
vlog video blog 0.788 0.000
wifi wireless network 0.900 1.000
hi-tech high technology 0.903 0.000
app application 0.915 1.000
Score 1.000 0.523
3.2 Correlation between search patterns
The correlation between two variables is the degree to
which there is a kind of relationship between them (Aitken
2007). Correlation is usually expressed as a coefficient
which measures the strength of a relationship between two
variables. We propose using two ways to measure the cor-
relation: Pearson’s correlation coefficient and Spearman’s
correlation coefficient.
The first measure of correlation that we propose is the
Pearson’s correlation coefficient (Egghe and Leydesdorff
2009). Using this measure means that we are interested in
the “shape” of the time series instead of their quantitative
404 Inf Syst Front (2013) 15:399–410
values. The philosophy behind this technique can be appre-
ciated in Fig. 2,wherethetermsfurnace and stove present
almost exactly the same “shape” and, therefore, semantic
similarity between them is supposed to be very high. More-
over, Pearson’s correlation coefficient can be computed as
follows:
ρX,Y =cov(X, Y )
σXσY=E[(X μX)(Y μY)]
σXσY
(2)
Table 3shows us the results for the general purpose
benchmark dataset. Some term pairs present negative corre-
lation, i.e. one of them presents an ascendant pattern while
the other presents a descendant one, so the final quality of
the method is going to be decreased. Therefore, negative
correlations worsen the final score.
Table 4shows us the results for the specific benchmark
dataset. As in the Miller & Charles benchmark dataset,
some term pairs present negative correlation, i.e. one of
them presents an ascendant pattern whist the other presents
a descendant one, so the final quality of the method is not
good.
The second measure is the Spearman correlation coef-
ficient that determines how well the relationship between
two variables can be described using a monotonic function
(Aitken 2007). This is the formula to compute it:
ρX,Y =16d2
i
n(n21)(3)
After using this correlation coefficient for our exper-
iments, we have determined that is not useful for our
purposes, because no correlation was detected (a value near
to zero). We have discovered that an increase in the web
searches for a term does not suppose an increment in the
number of web searches for a synonym, so this kind of cor-
relation is not good for trying to determine the semantic
similarity between terms using historical search logs and
therefore is not going to be considered further in the paper.
Tabl e 3 Results for the Pearson’s correlation using the Miller &
Charles dataset
Miller-Charles Pearson
rooster voyage 0.080 0.060
noon string 0.080 0.338
glass magician 0.110 0.405
cord smile 0.130 0.007
coast forest 0.420 0.863
lad wizard 0.420 0.449
monk slave 0.550 0.423
forest graveyard 0.840 0.057
coast hill 0.870 0.539
food rooster 0.890 0.128
monk oracle 1.100 0.234
car journey 1.160 0.417
brother lad 1.660 0.101
crane implement 1.680 0.785
brother monk 2.820 0.121
implement tool 2.950 0.771
bird crane 2.970 0.610
bird cock 3.05 0.507
food fruit 3.080 0.286
furnace stove 3.110 0.728
midday noon 3.420 0.026
magician wizard 3.500 0.622
asylum madhouse 3.610 0.149
coast shore 3.700 0.183
boy lad 3.760 0.090
journey voyage 3.840 0.438
gem jewel 3.840 0.155
automobile car 3.920 0.840
Score 1.000 0.163
Fig. 2 Historical search log for the terms Furnace and Stove. According to Pearson coefficient, similarity between these temporal series is high
which shows us that maybe the two words represent a quite similar object
Inf Syst Front (2013) 15:399–410 405
Tabl e 4 Results for Pearson’s correlation using the Martinez &
Aldana benchmark dataset
Martinez- Pearson
Aldana
peak oil apocalypse 0.056 0.100
bobo bohemian 0.185 0.147
windmills offshore 0.278 0.779
copyleft copyright 0.283 0.127
whalewatching birdwatching 0.310 0.090
tweet snippet 0.314 0.159
subprime risky business 0.336 0.000
imo in my opinion 0.376 0.831
buzzword neologism 0.383 0.459
quantitave easing money flood 0.410 0.165
glamping luxury camping 0.463 0.000
slumdog underprivileged 0.482 0.010
i18n internationalization 0.518 0.966
vuvuzela soccer horn 0.523 0.828
pda computer 0.526 0.900
sustainable renewable 0.536 0.640
sudoku number place 0.538 0.220
terabyte gigabyte 0.573 0.060
ceo chief executive officer 0.603 0.163
tanorexia tanning addiction 0.608 0.000
the big apple New York 0.641 0.200
asap as soon as possible 0.661 0.455
qwerty keyboard 0.676 0.124
thx thanks 0.784 0.272
vlog video blog 0.788 0.838
wifi wireless network 0.900 0.659
hi-tech high technology 0.903 0.867
app application 0.915 0.473
Score 1.000 0.106
3.3 Outlier coincidence on search patterns
There is no formal mathematical definition of what consti-
tutes an outlier. Grubbs said that “An outlying observation,
or outlier, is one that appears to deviate markedly from other
members of the sample in which it occurs” (Grubbs 1969).
So our proposal consists of looking for elements of a time
series that stand out from the rest of the series. Outliers can
have many causes, but we assume that, in this context, out-
liers in historical search patterns may occur due to historical
events, and that users search for information related to this
event at the same time but maybe using different words from
different parts of the world using the same search engine.
Figure 3shows us a screenshot form Google Trends
where the time series representing the terms gem and jewel
can be seen. There is a common outlying observation in the
year 2007. We do not know the reason, but this information
is not necessary for our purpose. We look for overlapping
outliers in order to determine the similarity between search
patterns.
Various indicators are used to identify outliers. We use
the proposal of Rousseeuw and Leroy that affirm that an
outlier is an observation which has a value that is more than
2.5 standard deviations from the average mean (Rousseeuw
and Leroy 2005).
Table 5shows us the results obtained by this method
using the Miller & Charles benchmark dataset. The obtained
correlation for this benchmark dataset is low, because only
terms which have suffered a search boom in their search
histories can be identified as similar.
Table 6shows us the results obtained by this method
using the Martinez & Aldana benchmark dataset. The
obtained correlation for the this benchmark datasets is low,
because only terms which present outliers can be compared,
thus, it cannot be outlier coincidence if there are no outliers
in the historical search pattern.
So we have seen that the major problem for this technique
is that not all terms present outliers. It cannot be outlier
coincidence if outliers do not exist. Therefore, our method
does not fit well enough to all situations. However, the score
shows us that this kind of technique could be very use-
ful in situations where outliers exists, e.g. sustainable and
renewable,i18n and internationalization, and so on.
3.4 Forecasting comparison
Our proposal concerning forecasting comparison method
consists of comparing the predictions of the (sets of) terms
for the months following. There are a number of methods
for time series forecasting, but the problem is that people’s
behavior cannot be predicted, or at least, can be influenced
by complex or random causes making the predictions unre-
liable, i.e it is possible to predict searches related to ice
creams every summer (there is a cause-effect relationship),
but it is not possible to predict searches related to cars,
because it is a kind of non-stationary goods. Anyway, we
wish to obtain a quantitative result for the quality of this
approach in order to compare it with the others as we can
extract positive hints.
To do this, we propose training an artificial neural net-
work to predict the results for the user searches. We deter-
mine the similarity between two (sets of) term(s) on the
basis of the similarity between these predictions. We have
chosen a forecasting based on neural networks. The reason
406 Inf Syst Front (2013) 15:399–410
Fig. 3 Historical search log for the terms Gem and Jewel which are considered for the Miller & Charles benchmark dataset as synonyms. There
is a perfect coincidence on their respective outliers which is represented in the interval from November 18, 2007 to December 2, 2007
Tabl e 5 Obtained results from the outlier coincidence method using
the Miller & Charles benchmark dataset
Miller-Charles Outlier
rooster voyage 0.080 0.000
noon string 0.080 0.000
glass magician 0.110 0.000
cord smile 0.130 0.000
coast forest 0.420 0.000
lad wizard 0.420 0.000
monk slave 0.550 0.000
forest graveyard 0.840 0.000
coast hill 0.870 0.000
food rooster 0.890 0.000
monk oracle 1.100 0.000
car journey 1.160 0.000
brother lad 1.660 0.000
crane implement 1.680 0.307
brother monk 2.820 0.000
implement tool 2.950 0.037
bird crane 2.970 0.000
bird cock 3.050 0.000
food fruit 3.080 0.000
furnace stove 3.110 0.500
midday noon 3.420 0.000
magician wizard 3.500 0.000
asylum madhouse 3.610 0.000
coast shore 3.700 0.000
boy lad 3.760 0.000
journey voyage 3.840 0.889
gem jewel 3.840 1.000
automobile car 3.920 0.000
Score 1.000 0.372
Tabl e 6 Obtained results from outlier coincidence method using the
Martinez & Aldana benchmark dataset
Martinez- Outlier
Aldana
peak oil apocalypse 0.056 0.000
bobo bohemian 0.185 0.000
windmills offshore 0.278 0.400
copyleft copyright 0.283 0.000
whalewatching birdwatching 0.310 0.000
tweet snippet 0.314 0.000
subprime risky business 0.336 0.000
imo in my opinion 0.376 0.000
buzzword neologism 0.383 0.000
quantitave easing money flood 0.410 0.000
glamping luxury camping 0.463 0.454
slumdog underprivileged 0.482 0.000
i18n internationalization 0.518 0.375
vuvuzela soccer horn 0.523 0.333
pda computer 0.526 0.000
sustainable renewable 0.536 0.800
sudoku number place 0.538 0.000
terabyte gigabyte 0.573 0.000
ceo chief executive officer 0.603 0.000
tanorexia tanning addiction 0.608 0.009
the big apple New York 0.641 0.000
asap as soon as possible 0.661 0.000
qwerty keyboard 0.676 0.000
thx thanks 0.784 0.000
vlog video blog 0.788 0.000
wifi wireless network 0.900 0.000
hi-tech high technology 0.903 0.308
app application 0.915 0.000
Score 1.000 0.007
Inf Syst Front (2013) 15:399–410 407
is that artificial neural networks have been widely used as
time series forecasters for real situations (Patuwo and Hu
1998). In order to train our Neural Network we have chosen
the following parameters:
For neurons count: an input layer of 12, a hidden layer
of 12 and an output layer of 1
For the learning: a learning rate of 0.05, a Momentum
of 0.5 and maximum of 10,000 iterations
For the activation function: bipolar sigmoid
The period of time chosen is 6 months, the reason is that
after testing different time periods we have concluded
that his period is the best
In order to compare the predictions we have chosen
the Pearson’s correlation coefficient because in our previ-
ous experiment we have shown that it is better than the
Tabl e 7 Results from search forecasting comparison using the Miller
& Charles benchmark dataset
Miller-Charles Forecast
rooster voyage 0.080 0.661
noon string 0.080 0.108
glass magician 0.110 0.235
cord smile 0.130 0.176
coast forest 0.420 0.703
lad wizard 0.420 0.647
monk slave 0.550 0.971
forest graveyard 0.840 0.355
coast hill 0.870 0.218
food rooster 0.890 0.770
monk oracle 1.100 0.877
car journey 1.160 0.478
brother lad 1.660 0.707
crane implement 1.680 0.083
brother monk 2.820 0.154
implement tool 2.950 0.797
bird crane 2.970 0.315
bird cock 3.050 0.893
food fruit 3.080 0.229
furnace stove 3.110 0.876
midday noon 3.420 0.932
magician wizard 3.500 0.595
asylum madhouse 3.610 0.140
coast shore 3.700 0.631
boy lad 3.760 0.369
journey voyage 3.840 0.690
gem jewel 3.840 0.940
automobile car 3.920 0.689
Score 1.000 0.193
Tabl e 8 Results from search forecasting comparison using the Mar-
tinez & Aldana benchmark dataset
Martinez-Aldana Forecast
peak oil apocalypse 0.056 0.359
bobo bohemian 0.185 0.671
windmills offshore 0.278 0.731
copyleft copyright 0.283 0.352
whalewatching birdwatching 0.310 0.626
tweet snippet 0.314 0.010
subprime risky business 0.336 0.011
imo in my opinion 0.376 0.136
buzzword neologism 0.383 0.924
quantitave easing money flood 0.410 0.548
glamping luxury camping 0.463 0.166
slumdog underprivileged 0.482 0.701
i18n internationalization 0.518 0.401
vuvuzela soccer horn 0.523 0.374
pda computer 0.526 0.964
sustainable renewable 0.536 0.869
sudoku number place 0.538 0.137
terabyte gigabyte 0.573 0.896
ceo chief executive 0.603 0.396
officer
tanorexia tanning addiction 0.608 0.267
the big apple New York 0.641 0.830
asap as soon as possible 0.661 0.711
qwerty keyboard 0.676 0.879
thx thanks 0.784 0.760
vlog video blog 0.788 0.752
wifi wireless network 0.900 0.204
hi-tech high technology 0.903 0.117
app application 0.915 0.322
Score 1.000 0.027
Spearman coefficient. Table 7shows us the results obtained
for the the Miller & Charles benchmark dataset once again.
Table 8shows us the results obtained for the Martinez-
Aldana once again. The final score obtained is not particu-
larly good due to the partial negative correlations for some
term pairs. Making forecasting comparisons does not seem
to be very good for determining the semantic similarity of
this benchmark dataset.
4 Evaluation
In order to evaluate the considered approaches, we adopt
the Pearson correlation coefficient (Egghe and Leydesdorff
2009) as a measure of the strength of the relation between
408 Inf Syst Front (2013) 15:399–410
Tabl e 9 Results for the statistical study concerning the general pur-
pose benchmark dataset
Algorithm Score
Resnik 0.814
Leacock 0.782
Path length 0.749
Vector pairs 0.597
Outlier 0.372
Co-ocurr. 0.364
Lesk 0.348
Prediction 0.193
Pearson 0.163
human judgment of similarity and values from computa-
tional approaches. However, Pirro stated that to have a
deeper interpretation of the results it is also necessary to
evaluate the significance of this relation (Pirro 2009). To
do this, we are going to use the p-value technique, which
shows how unlikely a given correlation coefficient, r, will
occur given no relation in the population (Pirro 2009). Note
that the smaller the p-level, the more significant the relation.
Moreover, the larger the correlation value the stronger the
relation. The p-value for Pearson’s correlation coefficient is
based on the test statistics defined as follows:
s=r·n2
1n2(4)
where ris the correlation coefficient and nis the number
of pairs of data. When the p-value is less than 0.05, then
we can say the obtained value is statistically significant. We
have obtained that, for our benchmark datasets, all values
above 0.25 are statistically significant.
Before explaining the obtained results it is necessary
to state that all results have been obtained from data col-
lected before the 22nd May 2011. Results from third party
approaches have been obtained by the tool offered by
Pedersen.2
Table 9shows the results for the general purpose bench-
mark dataset, i.e. Miller & Charles. Existing techniques are
better than most of our approaches. However, Outlier and
Co-occurrence techniques present a moderate accuracy. The
rest of the approaches do not seem to be as good as most of
the techniques based on synonym dictionaries when iden-
tifying the semantic similarity for well known terms. The
reason is that knowledge represented in a dictionary is con-
sidered to be really good, and therefore, it is not possible for
artificial techniques to surpass it.
2http://marimba.d.umn.edu/
Tabl e 10 Results for the statistical study concerning the benchmark
dataset which contains terms that are not included in dictionaries
frequently
Algorithm Score
Co-ocurr. 0.523
Vector pairs 0.207
Pearson 0.106
Lesk 0.079
Path length 0.061
Prediction 0.027
Outlier 0.007
Leacock 0.005
Resnik 0.016
Table 10 shows the results for the specific purpose bench-
mark dataset, i.e. Martinez & Aldana. Our approaches
present, in general, a better quality than those currently
in existence. It is the case for the co-occurrence tech-
niques which significantly beat all others. Moreover, we
have experimentally confirmed our hypothesis related to the
fact that using historical search patterns could be more ben-
eficial when the terms to be analyzed are not covered by
dictionaries.
5 Discussion
Search trends in user’s web search data have traditionally
been shown to very useful when providing models of real
world phenomena. Now, we have proposed another way to
reuse these search patterns. We have proposed comparing
search patterns in order to determine the semantic similarity
between their associated terms. Despite the results that we
have obtained, there are several problems related to the use
of historical search patterns for determining the semantic
similarity between text expressions:
Terms typed by the users can have multiple meanings
based on their context
Users use multiple terms to look for both singular and
plurals
Many of these results rely on the careful choice of
queries that prior knowledge suggests should corre-
spond with the phenomenon
On the other hand, our proposal has a number of addi-
tional advantages with respect to other existing techniques.
It is not time consuming since it does not imply that large
corpora should be parsed. We have shown that it correlates
well with respect to human judgment (even better than some
Inf Syst Front (2013) 15:399–410 409
other preexisting measures). Moreover, our work could be
considered as seminal for new research lines:
The time series representing the historical search pat-
tern for a given term could be used as a kind of semantic
fingerprint, thus, some kind of data which identifies a
term on the Web. If two semantic fingerprints are simi-
lar, it could be supposed that the terms could represent
a similar real world entity.
The results of this work are also applicable to study the
stability of ontology mappings. This means that it is
possible to establish semantic correspondences between
any kind of ontologies according to time constraints.
6 Conclusions
In this paper, we have proposed a novel idea for determining
the semantic similarity between (sets of) terms which con-
sists of using the knowledge inherent in the historical search
logs from the Google search engine.
To validate our hypothesis, we have designed and eval-
uated four algorithmic methods for measuring the seman-
tic similarity between terms using their associated history
search patterns. These algorithmic methods are: a) frequent
co-occurrence of terms in search patterns, b) computation
of the relationship between search patterns, c) outlier coin-
cidence in search patterns, and d) forecasting comparisons.
We have shown experimentally that the method which
studies the co-occurrence of terms in the search patterns
correlates well with respect to human judgment when evalu-
ating general purpose benchmark datasets, and significantly
outperforms existing methods when evaluating datasets con-
taining terms that do not usually appear in dictionaries.
Moreover, we have found than the other three additional
methods seem to be better than most of the existing ones
when dealing with this special kind of emerging terms.
As future work, we want to keep working towards apply-
ing new time series comparison algorithms so that we can
determine which are the most appropriate approaches for
solving this problem and implement them in real informa-
tion systems where the automatic measurement of semantic
similarity between terms (or short text expressions) may
be necessary. Moreover, we want to analyze the possibil-
ity to smartly combine our algorithmic methods in order to
determine if two terms are or no semantically similar.
Acknowledgements We would like to to thank the reviewers for
their time and consideration. We thank Lisa Huckfield for proofread-
ing this manuscript. This work has been funded by Spanish Ministry of
Innovation and Science through: REALIDAD: Efficient Analysis, Man-
agement and Exploitation of Linked Data., Project Code: TIN2011-
25840 and by the Department of Innovation, Enterprise and Science
from the Regional Government of Andalucia through: Towards a plat-
form for exploiting and analyzing biological linked data, Project Code:
P11-TIC-7529.
References
Aitken, A. (2007). Statistical mathematics. Oliver & Boyd.
Badea, B., & Vlad, A. (2006). Revealing Statistical Independence of
Two Experimental Data Sets: An Improvement on Spearman’s
Algorithm. In ICCSA (pp. 1166–1176).
Banek, M., Vrdoljak, B., Min Tjoa, A., Skocir, Z. (2007). Automating
the Schema Matching Process for Heterogeneous Data Ware-
houses. In DaWaK (pp. 45–54).
Banek, M., Vrdoljak, B., Tjoa, A.M. (2007). Using Ontologies
for Measuring Semantic Similarity in Data Warehouse Schema
Matching Process. In CONTEL (pp. 227–234).
Banerjee, S., & Pedersen, T. (2003). Extended Gloss Overlaps as a
Measure of Semantic Relatedness. In IJCAI (pp. 805–810).
Bollegala, D., Matsuo, Y., Ishizuka, M. (2007). Measuring seman-
tic similarity between words using web search engines. In WWW
(pp. 757–766).
Bollegala, D., Honma, T., Matsuo, Y., Ishizuka, M. (2008). Mining for
personal name aliases on the web. In WWW (pp. 1107–1108).
Brin, S., & Page, L. (1998). The Anatomy of a Large-Scale Hyper-
textual Web Search Engine. Computer Networks,30(1–7), 107–
117.
Budanitsky, A., & Hirst, G. (2006). Evaluating WordNet-based Mea-
sures of Lexical Semantic Relatedness. Computational Linguis-
tics,32(1), 13–47.
Choi, H., & Varian, H. (2009). Predicting the present with Google
Trends. Technical Report, Economics Research Group, Google.
Cilibrasi, R., & Vit´
anyi, P.M. (2007). The Google Similarity Distance.
IEEE Transactions on Knowledge and Data Engineering,19(3),
370–383.
Dhurandhar, A. (2011). Improving predictions using aggregate infor-
mation. In KDD (pp. 1118–1126).
Egghe, L., & Leydesdorff, L. (2009). The relation between
Pearson’s correlation coefficient r and Salton’s cosine measure
CoRR abs/0911.1318.
Fong, J., Shiu, H., Cheung, D. (2009). A relational-XML data ware-
house for data aggregation with SQL and XQuery. Software,
Practice and Experience,38(11), 1183–1213.
Grubbs, F. (1969). Procedures for Detecting Outlying Observations in
Samples. Technometrics,11(1), 1–21.
Hliaoutakis, A., Varelas, G., Petrakis, E.G.M., Milios, E. (2006). Med-
Search: A Retrieval System for Medical Information Based on
Semantic Similarity. In ECDL (pp. 512–515).
Hu, N., Bose, I., Koh, N.S., Liu, L. (2012). Manipulation of online
reviews: An analysis of ratings, readability, and sentiments. Deci-
sion Support Systems (DSS),52(3), 674–684.
Hjorland, H. (2007). Semantics and knowledge organization. ARIST,
41(1), 367–405.
Jung, J.J., & Thanh Nguyen, N. (2008). Collective Intelligence for
Semantic and Knowledge Grid. Journal of Universal Computer
Science (JUCS),14(7), 1016–1019.
Kopcke, H., Thor, A., Rahm, E. (2010). Evaluation of entity resolution
approaches on real-world match problems. PVLDB,3(1), 484–493.
Leacock, C., Chodorow, M., Miller, G.A. (1998). Using Corpus
Statistics and WordNet Relations for Sense Identification. Compu-
tational Linguistics,24(1), 147–165.
Lesk, M. (1986). Information in Data: Using the Oxford English
Dictionary on a Computer. SIGIR Forum,20(1–4), 18–21.
Li, J., Alan Wang, G., Chen, H. (2011). Identity matching using per-
sonal and social identity features. Information Systems Frontiers,
13(1), 101–113.
Li, Y., Bandar, A., McLean, D. (2003). An approach for Measuring
Semantic Similarity between Words Using Multiple Information
Sources. IEEE Transactions on Knowledge and Data Engineering,
15(4), 871–882.
410 Inf Syst Front (2013) 15:399–410
Liu, B., & Zhang, L. (2012). A Survey of Opinion Mining and
Sentiment Analysis. In Mining Text Data (pp. 415–463).
Miller, G., & Charles, W. (1991). Contextual Correlates of Semantic
Similarity. Language and Cognitive Processes,6(1), 1–28.
Nandi, A., & Bernstein, P.A. (2009). HAMSTER: Using Search Click-
logs for Schema and Taxonomy Matching. PVLDB,2(1), 181–192.
Patuwo, B.E., & Hu, M. (1998). Forecasting with artificial neural net-
works: The state of the art. International Journal of Forecasting,
14(1), 35–62.
Patwardhan, S., Banerjee, S., Pedersen, T. (2003). Using Measures
of Semantic Relatedness for Word Sense Disambiguation. In
CICLing (pp. 241–257).
Pedersen, T., Patwardhan, S., Michelizzi, J. (2004). Word-
Net::Similarity - Measuring the Relatedness of Concepts. In AAAI
(pp. 1024–1025).
Petrakis, E.G.M., Varelas, G., Hliaoutakis, A., Raftopoulou, P. (2006).
X-Similarity: Computing Semantic Similarity between Concepts
from Different Ontologies. JDIM,4(4), 233–237.
Pirro, G. (2009). A semantic similarity metric combining features and
intrinsic information content. Data and Knowledge Engineering,
68(11), 1289–1308.
Resnik, P. (1995). Using Information Content to Evaluate Semantic
Similarity in a Taxonomy. In IJCAI (pp. 448–453).
Retzer, S., Yoong, P., Hooper, V. (2012). Inter-organisational knowl-
edge transfer in social networks: A definition of intermediate ties.
Information Systems Frontiers,14(2), 343–361.
Rousseeuw, P.J., & Leroy, A.M. (2005). Robust Regression and Out-
lier Detection: John Wiley & Sons Inc.
Sanchez, D., Batet, M., Valls, A. (2010). Web-Based Semantic Sim-
ilarity: An Evaluation in the Biomedical Domain. International
Journal of Software and Informatics,4(1), 39–52.
Sanchez, D., Batet, M., Valls, A., Gibert, K. (2010). Ontology-driven
web-based semantic similarity. Journal of Intelligent Information
Systems,35(3), 383–413.
Scarlat, E., & Maries, I. (2009). Towards an Increase of Collec-
tive Intelligence within Organizations Using Trust and Reputation
Models. In ICCCI (pp. 140–151).
Sparck Jones, K. (2006). Collective Intelligence: It’s All in the Num-
bers. IEEE Intelligent Systems (EXPERT),21(3), 64–65.
Tuan Duc, N., Bollegala, D., Ishizuka, M. (2010). Using Relational
Similarity between Word Pairs for Latent Relational Search on the
Web . In Web Intelligence (pp. 196–199).
Dr. Jorge Martinez-Gil is a senior researcher in the Department
of Computer Languages and Computing Sciences at the University
of Malaga (Spain). His main research interests are related with the
interoperability in the World Wide Web. In fact, his PhD thesis has
addressed the ontology meta-matching and reverse ontology matching
problems. Dr. Martinez-Gil has published several papers in prestigious
journals like SIGMOD Record, Knowledge and Information Systems,
Knowledge Engineering Review, Online Information Review, Journal
of Computer Science & Technology, Journal of Universal Computer
Science, and so on. Moreover, he is a reviewer for conferences and
journals related to the Data and Knowledge Engineering field.
Prof. Dr. Jos´
e F. Aldana-Montes is currently a full professor in the
Department of Languages of Computing Sciences at the Higher Tech-
nical School of Computer Science Engineering from the University of
Malaga (Spain) and Head of Khaos Research, a group for research-
ing about semantic aspects of databases. Prof. Aldana-Montes has
more than 20 years of experience in research about several aspects
of databases, semistructured data and semantic technologies and its
application to such fields as bioinformatics or tourism. He is author
of several relevant papers in top journals and conferences. Related
to teaching, he has been teaching theoretical and practical aspects of
databases, data integration and sematic web at all possible university
levels.
... The measurements of IC consist of ontological knowledge, which is a drawback because they depend entirely on the coverage and details of the input ontology [3]. With the appearance of social networks [66,67], diverse concepts or terms such as proper names, brands, acronyms, and new words are not contained in application and domain ontologies. Thus, we cannot compute the information content supported by the knowledge resource with this information source. ...
Preprint
Full-text available
This research uses the computing of conceptual distance to measure information content in Wikipedia categories. The proposed metric, generality, relates information content to conceptual distance by determining the ratio of information a concept provides to others compared to the information it receives. The DIS-C algorithm calculates generality values for each concept, considering each relationship's conceptual distance and distance weight. The findings of this study were compared to current methods in the field and found to be comparable to results obtained using the WordNet corpus. This method offers a new approach to measuring information content applied to any relationship or topology in conceptualization.
... Research along these lines has grown exponentially in the last few years. The truth is that¯nding reasonable solutions can bene¯t a wide range of¯elds and domains [9]. However, some researchers have decided not to propose more and more measures of semantic similarity but to apply criteria of rationality that allow them to take advantage of all the work done so far, just in the way that AutoML operates [10]. ...
Article
The challenge of assessing semantic similarity between pieces of text through computers has attracted considerable attention from industry and academia. New advances in neural computation have developed very sophisticated concepts, establishing a new state of the art in this respect. In this paper, we go one step further by proposing new techniques built on the existing methods. To do so, we bring to the table the stacking concept that has given such good results and propose a new architecture for ensemble learning based on genetic programming. As there are several possible variants, we compare them all and try to establish which one is the most appropriate to achieve successful results in this context. Analysis of the experiments indicates that Cartesian Genetic Programming seems to give better average results.
... On the other hand, semantic similarity identifies the concepts having common characteristics. The computational methods of semantic similarity developed so far use different information, knowledge resources and approaches, e.g., historical google search patterns [13], feature-based approaches using Wikipedia [23], Wikipedia-based information [24], multiple information sources [27], contextual correlation [31,37] and many of them have proven to be useful in some specific applications of computational intelligence. The task of SC is to extract information from databases in the sense of clarification of words and to explain the meaning of various constituents of sentences (words or phrases) or sentences themselves in a natural language via semantic relatedness or semantic similarity among constituents. ...
Article
Full-text available
The classical automata, fuzzy finite automata, and rough finite state automata are some formal models of computing used to perform the task of computation and are considered to be the input device. These computational models are valid only for fixed input alphabets for which they are defined and, therefore, are less user-friendly and have limited applications. The semantic computing techniques provide a way to redefine them to improve their scope and applicability. In this paper, the concept of semantically equivalent concepts and semantically related concepts in information about real-world applications datasets are used to introduce and study two new formal models of computations with semantic computing (SC), namely, a rough finite-state automaton for SC and a fuzzy finite rough automaton for SC as extensions of rough finite-state automaton and fuzzy finite-state automaton, respectively, in two different ways. The traditional rough finite-state automata can not deal with situations when external alphabet or semantically equivalent concepts are given as inputs. The proposed rough finite-state automaton for SC can handle such situations and accept such inputs and is shown to have successful real-world applications. Similarly, a fuzzy finite rough automaton corresponding to a fuzzy automaton is also failed to process input alphabet different from their input alphabet, the proposed fuzzy finite rough automaton for SC corresponding to a given fuzzy finite automaton is capable of processing semantically related input, and external input alphabet information from the dataset obtained by real-world applications and provide better user experience and applicability as compared to classical fuzzy finite rough automaton.
... SMART is semi-automatic because it requires the intervention of an expert for the validation of results. In [44], the authors propose an approach to determine the semantic similarity of terms using the knowledge present in the search history logs from Google. For this purpose, they exploit four techniques that evaluate: (i) frequent co-occurrences of terms in search patterns; (ii) relationships between search patterns; (iii) outlier coincidence on search patterns; (iv) forecasting comparisons. ...
Preprint
Full-text available
The knowledge of interschema properties (e.g., synonymies, homonymies, hyponymies, sub-schema similarities) plays a key role for allowing decision making in sources characterized by disparate formats. In the past, a wide amount and variety of approaches to derive interschema properties from structured and semi-structured data have been proposed. However, currently, it is esteemed that more than 80% of data sources are unstructured. Furthermore, the number of sources generally involved in an interaction is much higher than in the past. As a consequence, the necessity arises of new approaches to address the interschema property derivation issue in this new scenario. In this paper, we aim at providing a contribution in this setting by proposing an approach capable of uniformly extracting interschema properties from a huge number of structured, semi-structured and unstructured sources.
... According to the research done by the relevant researchers (Brenneke et al. 2011;Ferragina and Guli 2008;Martinez-Gil and Aldana-Montes 2013;Pushpa et al. 2011;Saxena et al. 2016;Verma et al. 2014), interactive search can be divided into two methods based on whether or not the user Table 1 shows the comparison of these two ways. ...
Article
Full-text available
In this paper, we develop an interactive hierarchical topic search system. In our system, the generation of topic names is mainly based on the N-gram statistical language model. The construction of hierarchical tree relationships between topics is mainly based on the concept of mathematical sets. In this study, the concept of mathematical sets not only helps the system to construct a topic hierarchy tree quickly, but also allows different users to use different binary operations to generate different interactive search results. In general, this study has the following three advantages. First, the generated topic names are presented in a hierarchical form rather than a flat form. Secondly, the interactive search for this study was achieved by non-stored user search and click history. Therefore, our approach can avoid personal privacy and large storage space issues. Finally, the concept of mathematical sets not only allows us to generate topic trees in linear time, but also allows users to run all possible binary operations to meet various interactive search needs.
... SMART is semi-automatic because it requires the intervention of an expert for the validation of results. In [37], the authors propose an approach to determine the semantic similarity of terms using the knowledge present in the search history logs from Google. For this purpose, they exploit four techniques that evaluate: (i) frequent co-occurrences of terms in search patterns; (ii) relationships between search patterns; (iii) outlier coincidence on search patterns; (iv) forecasting comparisons. ...
Article
The knowledge of interschema properties (e.g., synonymies, homonymies, hyponymies and subschema similarities) plays a key role for allowing decision-making in sources characterized by disparate formats. In the past, wide amount and variety of approaches to derive interschema properties from structured and semi-structured data have been proposed. However, currently, it is esteemed that more than 80% of data sources are unstructured. Furthermore, the number of sources generally involved in an interaction is much higher than in the past. As a consequence, the necessity arises of new approaches to address the interschema property derivation issue in this new scenario. In this paper, we aim at providing a contribution in this setting by proposing an approach capable of uniformly extracting interschema properties from a huge number of structured, semi-structured and unstructured sources.
... On the other hand, the methods of semantic search using WordNet are limited in scope and scalability. Especially, with the emergence of social networks or instant messaging systems [38,51], a lot of (sets of) concepts or terms (proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) are not included in domain ontologies and WordNet, therefore, semantic search that is based on these kinds of knowledge sources (i.e., domain ontologies or WordNet) cannot be used in these tasks. For another example, the methods based on similarity reasoning [18,53] also exploit ontologies or WordNet to compute similarity (or relatedness) between words. ...
Article
Full-text available
Classical or traditional Information Retrieval (IR) approaches rely on the word-based representations of query and documents in the collection. The specification of the user information need is completely based on words figuring in the original query in order to retrieve documents containing those words. Such approaches have been limited due to the absence of relevant keywords as well as the term variation in documents and user’s query. The purpose of this paper is to present a new method to Semantic Information Retrieval (SIR) to solve the limitations of existing approaches. Concretely, we propose a novel method SIRWWO (Semantic Information Retrieval using Wikipedia, WordNet, and domain Ontologies) for SIR by combining multiple knowledge sources Wikipedia, WordNet, and Description Logic (DL) ontologies. In order to illustrate the approach SIRWWO, we first present the notion of Labeled Dynamic Semantic Network (LDSN) by extending the notions of dynamic semantic network and extended semantic net based on WordNet (and DAML ontology library). According to the notion of LDSN, we obtain the notion of Weighted Dynamic Semantic Network (WDSN, intuitively, each edge in WDSN is assigned to a number in the [0, 1] interval) and give the WDSN construction method using Wikipedia, WordNet, and DL ontology. We then propose a novel metric to measure the semantic relatedness between concepts based on WDSN. Lastly, we investigate the approach SIRWWO by using semantic relatedness between users’ query keywords and digital documents. The experimental results show that our proposals obtain comparable and better performance results than other traditional IR system Lucene.
Article
Full-text available
Academic virtual community provides an environment for users to exchange knowledge, so it gathers a large amount of knowledge resources and presents a trend of rapid and disorderly growth. We learn how to organize the scattered and disordered knowledge of network community effectively and provide personalized service for users. We focus on analyzing the knowledge association among titles in an all-round way based on deep learning, so as to realize effective knowledge aggregation in academic virtual community. We take ResearchGate (RG) “online community” resources as an example and use Word2Vec model to realize deep knowledge aggregation. Then, principal component analysis (PCA) is used to verify its scientificity, and Wide & Deep learning model is used to verify its running effect. The empirical results show that the knowledge aggregation system of “online community” works well and has scientific rationality.
Chapter
This article describes how the traditional web search is essentially based on a combination of textual keyword searches with an importance ranking of the documents depending on the link structure of the web. However, one of the dimensions that has not been captured to its full extent is that of semantics. Currently, combining search and semantics gives birth to the idea of the semantic search. The purpose of this article is to present some new methods to semantic search to solve some shortcomings of existing approaches. Concretely, the authors propose two novel methods to semantic search by combining formal concept analysis, rough set theory, and similarity reasoning. In particular, the authors use Wikipedia to compute the similarity of concepts (i.e., keywords). The experimental results show that the authors' proposals perform better than some of the most representative similarity search methods and sustain the intuitions with respect to human judgements.
Article
Full-text available
Semantic relatedness between words is a core concept in natural language processing. While countless approaches have been proposed, measuring which one works best is still a challenging task. Thus, in this article, we give a comprehensive overview of the evaluation protocols and datasets for semantic relatedness covering both intrinsic and extrinsic approaches. One the intrinsic side, we give an overview of evaluation datasets covering more than 100 datasets in 20 different languages from a wide range of domains. To provide researchers with better guidance for selecting suitable dataset or even building new and better ones, we describe also the construction and annotation process of the datasets. We also shortly describe the evaluation metrics most frequently used for intrinsic evaluation. As for the extrinsic side, several applications involving semantic relatedness measures are detailed through recent research works and by explaining the benefit brought by the measures.
Article
Full-text available
The relation between Pearson’s correlation coefficient and Salton’s cosine measure is revealed based on the different possible values of the division of the -norm and the norm of a vector. These different values yield a sheaf of increasingly straight lines which form together a cloud of points, being the investigated relation. These theoretical results are tested against the author co-citation relations among 24 informetricians for who two matrices can be constructed, based on co-citations: the asymmetric occurrence matrix and the symmetric co-citation matrix. Both examples completely confirm the theoretical results. The results enable us to specify an algorithm which provides a threshold value for the cosine above which none of the corresponding Pearson correlations would be negative. Using this threshold value can be expected to optimize the visualization.
Book
Full-text available
This is a book, not a paper.
Article
Full-text available
In the past few years, a great attention has been received by web documents as a new source of individual opinions and experience. This situation is producing increasing interest in methods for automatically extracting and analyzing individual opinion from web documents such as customer reviews, weblogs and comments on news. This increase was due to the easy accessibility of documents on the web, as well as the fact that all these were already machine-readable on gaining. At the same time, Machine Learning methods in Natural Language Processing (NLP) and Information Retrieval were considerably increased development of practical methods, making these widely available corpora. Recently, many researchers have focused on this area. They are trying to fetch opinion information and analyze it automatically with computers. This new research domain is usually called Opinion Mining and Sentiment Analysis. . Until now, researchers have developed several techniques to the solution of the problem. This paper try to cover some techniques and approaches that be used in this area.
Article
Sentiment analysis or opinion mining is the computational study of people's opinions, appraisals, attitudes, and emotions toward entities, individuals, issues, events, topics and their attributes. The task is technically challenging and practically very useful. For example, businesses always want to find public or consumer opinions about their products and services. Potential customers also want to know the opinions of existing users before they use a service or purchase a product. With the explosive growth of social media (i.e., reviews, forum discussions, blogs and social networks) on the Web, individuals and organizations are increasingly using public opinions in these media for their decision making. However, finding and monitoring opinion sites on the Web and distilling the information contained in them remains a formidable task because of the proliferation of diverse sites. Each site typically contains a huge volume of opinionated text that is not always easily deciphered in long forum postings and blogs. The average human reader will have difficulty identifying relevant sites and accurately summarizing the information and opinions contained in them. Moreover, it is also known that human analysis of text information is subject to considerable biases, e.g., people often pay greater attention to opinions that are consistent with their own preferences. People also have difficulty, owing to their mental and physical limitations, producing consistent results when the amount of information to be processed is large. Automated opinion mining and summarization systems are thus needed, as subjective biases and mental limitations can be overcome with an objective sentiment analysis system. In the past decade, a considerable amount of research has been done in academia [58,76]. There are also numerous commercial companies that provide opinion mining services. In this chapter, we first define the opinion mining problem. From the definition, we will see the key technical issues that need to be addressed. We then describe various key mining tasks that have been studied in the research literature and their representative techniques. After that, we discuss the issue of detecting opinion spam or fake reviews. Finally, we also introduce the research topic of assessing the utility or quality of online reviews. © 2012 Springer Science+Business Media, LLC. All rights reserved.
Article
Procedures are given in the report for determining statistically whether the highest observation, or the lowest observation, or the highest and lowest observations, or the two highest observations, or the two lowest observations, or perhaps more of the observations in the sample may be considered to be outlying observations or discrepant values. Statistical tests of significance are useful in this connection either in the absence of assignable physical causes or to support a practical judgement that some of the experimental observations are aberrant. Both the statistical formulae and illustrative applications of the procedures to practical examples are given, thus representing a rather complete treatment of significance tests for outliers in single univariate samples.
Article
The relationship between semantic and contextual similarity is investigated for pairs of nouns that vary from high to low semantic similarity. Semantic similarity is estimated by subjective ratings; contextual similarity is estimated by the method of sorting sentential contexts. The results show an inverse linear relationship between similarity of meaning and the discriminability of contexts. This relation, is obtained for two separate corpora of sentence contexts. It is concluded that, on average, for words in the same language drawn from the same syntactic and semantic categories, the more often two words can be substituted into the same contexts the more similar in meaning they are judged to be.
Article
Notice of Violation of IEEE Publication Principles "The Anatomy of a Large-Scale Hyper Textual Web Search Engine" by Umesh Sehgal, Kuljeet Kaur, Pawan Kumar in the Proceedings of the Second International Conference on Computer and Electrical Engineering, 2009. ICCEE '09, December 2009, pp. 491-495 After careful and considered review of the content and authorship of this paper by a duly constituted expert committee, this paper has been found to be in violation of IEEE's Publication Principles. This paper contains significant portions of original text from the paper cited below. The original text was copied with insufficient attribution (including appropriate references to the original author(s) and/or paper title) and without permission. Due to the nature of this violation, reasonable effort should be made to remove all past references to this paper, and future references should be made to the following article: "The Anatomy of a Large-Scale Hypertextual Web Search Engine" by Sergey Brin and Lawrence Page Computer Networks and ISDN Systems, Volume 30, Issue 1-7, Elsevier, April 1998, pp. 107-117