Content uploaded by Diana Purwitasari
Author content
All content in this area was uploaded by Diana Purwitasari on Aug 13, 2014
Content may be subject to copyright.
A Study on Web Resources’ Navigation for e-Learning:
Usage of Fourier Domain Scoring on Web Pages Ranking Method
Diana Purwitasari, Yasuhisa Okazaki, Kenzi Watanabe
Saga University, Saga, Japan
diana@ai.is.saga-u.ac.jp
Abstract
Using existing web resources for e-Learning is a very
promising idea especially in reducing the cost of authoring.
Envisioned as open-source, completely free, and frequently
updated Wikipedia could become a good candidate. Even
though Wikipedia has been structured by categories, still
sometimes they are not dynamically updated when there are
modifications. As web resources for e-Learning, it is a ne-
cessity to provide a navigation path in Wikipedia which se-
mantically mapping the learning material and not merely
based on the structures. The desired learning material
could be provided as a request from search results. We in-
troduce in this paper the usage of Fourier Domain Scoring
(FDS) for ranking method in searching certain collection
of Wikipedia web pages. Unlike other methods that would
only recognize the occurrence numbers of query terms, FDS
could also recognize the spread of query terms throughout
the content of web pages. Based on the experiments, we
concluded that the not relevant results retrieved are mainly
influenced by the characteristic of Wikipedia. Given that
the changes of Wikipedia web pages could be done in any-
part by anyone, we concluded that it is possible if only some
parts of retrieved web pages strongly related to query terms.
1. Introduction
Using existing web resources for e-Learning is a very
promising idea especially in reducing the cost of authoring.
Envisioned as open-source, completely free, and frequently
updated Wikipedia could become a good candidate. In ad-
dition, Wikipedia is one of the 100 most popular Websites
worldwide 1. Although not all of its resources are extensive,
informative, and accurate, still Wikipedia could be the ideal
place to start and get a global picture of desired material.
1see http://www.alexa.com/site/ds/top sites?ts mode=global&lang=none,
retrieved March 14, 2007.
Wikipedia has been structured by categories [1] even
though some of them are not ordered or are not dynamically
updated every time modifications happened. It is possible
for someone who is willing to register a name and instantly
doing modifications to alter the structure of categories in
Wikipedia. As web resources for e-learning, it is a neces-
sity to provide a navigation path in Wikipedia that mapping
the owned learning material onto the desired learning ma-
terial semantically. Due to the characteristic of frequently
updated, the navigation path should be generated adaptively
based on the contents of material and not merely on the
structures of material.
The desired learning material could be provided as a re-
quest from search process. One of the challenges in search
process is how to interpret the results. In other words it
is about handling a large answer of web pages or ranking
web pages to provide the ones that really are of interest to
the learner [2]. Before creating the adaptively generated
navigation path of Wikipedia, we introduce in this paper
the usage of Fourier Domain Scoring (FDS) [3] for rank-
ing method in searching certain collection of Wikipedia web
pages.
Based on the experiments, we concluded that the not rel-
evant results retrieved are mainly influenced by the charac-
teristic of Wikipedia. The changes on Wikipedia web pages
could be done in anypart by anyone and sometimes those
changes are not relevant enough with main topic of web
pages. We concluded that it is possible for some parts of
retrieved web pages are strongly related to query terms but
the rest parts are not.
2. Fourier Domain Scoring for Web Pages
Ranking
2.1. Introduction of Fourier Domain Scor-
ing
Most of the systems that do search process use varia-
tions of Boolean or vector space model to do ranking. The
concept behind vector space model is to convert each web
page and the query into vectors so they can be easily com-
pared [2]. The problem with the techniques of vector space
model is that any spatial information contained in the web
pages is lost. Once the web pages are converted into vec-
tors, the number of times each term appears is represented
in the vector, but the positions of the terms are ignored.
In FDS rather than storing only the frequency of terms, it
stores the signal of terms that show how the terms are spread
throughout web pages. By comparing magnitude and phase
value of query term signals exist in web pages, FDS could
observe web pages which have query terms that should be
occurring a lot and appearing together [3].
The occurrence numbers and position of terms could be
called as magnitude and phase precision value of term sig-
nals in FDS. A relevant web page should have large magni-
tude and corresponding phase of terms equal to query terms
should be similar. Each web page would have one or more
of term signals where each signal contains information of
magnitude and phase precision. To calculate the effect of
the query terms signals on web page, the magnitude and
phase precision value of term signals is examined and com-
bined to create the web page score. Most of the top results
in FDS [3] use Sum Magnitudes (sub section 2.4) and Zero
Phase Precision (sub section 2.5) to examine term signals.
Then FDS would rank web pages by sorting the score value
in decreasing order.
2.2. Weighting of a web page
Commonly a web page contains at least hundred of terms
which eventually results to term signals in almost the same
amount. The terms should be grouped into bins to reduce
the size of term vector. If the number of bins is set to B,
a web page containing Wterms would have W/B terms in
each bin. Let tbe a certain index term in document dand B
is the chosen number of bins. Note, from now on the word
of document would also refer to the word of web page. A
weight ωd,t,b >0is associated with the occurrence numbers
of term tinside document din location bin b, where b∈
{0, . . . , B −1}. For each term tinsides document d, the
vector will be −→
ωd,t = [ωd,t,0, . . . , ωd,t,B−1]T.
Let Nbe the total number of web pages and ntbe the
number of web pages in which the index term tappears.
Then weighting schemes of ωd,t,b which derived from TF-
IDF schemes [2] will be:
ωd,t,b =freqd,t,b
max
i,j freqd,t=i,b=j·log N
nt
where freqd,t,b be the occurrence numbers of term tinside
document din location bin b. While maxi,j f reqd,t=i,b=j
is a maximum value of occurrence numbers for certain term
iin certain location bin jfrom all terms insides document
d.
2.3. Fourier transforming of a web page
The sequence number of ωd,t,0, . . . , ωd,t,B−1is still an
information in time or spatial domain and needs to be trans-
formed into frequency domain. The Fourier transform de-
fines a relationship of signals in the time or spatial domain
with its representation in the frequency domain known as
a (Fourier) spectrum. A spectrum is made up of a number
of frequency components with real and imaginary part for
each frequency component or in a different way has an as-
sociated phase, in addition to a magnitude to represent the
same information. The discrete form of Fourier transform
is of the form [3]:
vd,t,β =
B−1
X
b=0
ωd,t,b exp−2πi
Bβb, β ∈ {0, . . . , B −1}
where vd,t,β is the projection of the term signal −→
ωd,t onto
a sinusoidal wave of frequency β. The spectral component
number βis an element of the set {0, . . . , B −1}.
The Discrete Fourier Transform would produce the fol-
lowing mapping [3]:
ωd,t,b→vd,t,b =Hd,t,b exp(iφd,t,b)
where vd,t,b is the bth frequency component of term tin
document d,Hd,t,b, and φd,t,b are the magnitude and phase
of frequency component vd,t,b, respectively, and iis √−1.
2.4. Calculate magnitude value of a web
page
Sum Magnitudes Hm
d,b [3] takes into account only the
magnitude part, Hd,t,b, of the frequency component, vd,t,b
and ensures more weight is given to web pages with more
occurrences of query terms. Let T be the set of query terms,
and then the magnitude part of all terms equal to query
terms is multiplied with the weight ωq,t of query term t.
The magnitude value Hm
d,b of each component binsides doc-
ument dwill be:
Hm
d,b =X
t∈T
Hd,t,b ·ωq,t
2.5. Calculate phase precision value of a
web page
Zero Phase Precision Φz
d,b [3] only includes the phase of
frequency component with nonzero magnitude because the
term with zero magnitude value means that the term does
not exist and its phase value could be left out. For each fre-
quency component, the phase value of terms equal to query
terms in document will be summed and averaged with the
total number of query terms, #(T). The phase precision
value Φz
d,b of each frequency component binsides document
dwill be:
Φz
d,b =v
u
u
u
u
t X
t∈T
cos φd,t,b
#(T)!2
+ X
t∈T
sin φd,t,b
#(T)!2
2.6. Calculate score value of web page
After the magnitude and phase precision of frequency
components in web pages have been obtained, the next step
is to combine them to create the web page score vector by
multiplying their values. The score value for each frequency
component sd,b will be:
sd,b =Hm
d,b ·Φz
d,b
To get the document score Sd, each score of frequency
components in web page score vector will be summarized.
There are four methods selected from [3] and being studied
in this paper to do that calculation.
First method is called as Sum All Components which
combines score sd,b of all frequency components in docu-
ment d. However the Nyquist-Shannon sampling theorem
[4] states that the highest frequency component to be found
in a real signal is equal to half of the sampling rate; this im-
plies that, if there are Bfrequency components for the term
signal, then to analyze the signal would only need to exam-
ine frequency components of 0 to B/2[3]. Using that as-
sumption, only half of frequency component scores in web
page score vector will be needed and the web page score
will be:
Sd=
B/2
X
b=1
sd,b
If there are high values contained in any of the elements
of the web page score vector, resulted from either magni-
tude or phase precision value, then that web page should
be considered relevant the the query terms. That idea is
the concept behind other methods to calculate the web page
score which only considers the summation of two greatest
values of frequency component scores in web page score
vector.
Sd=sd,b1+sd,b2
The second method, Sum Largest Score Vector Compo-
nents selects the two largest score of frequency component
scores in web page score vector. Based on information of
magnitude and phase precision represented in the frequency
components, the query terms inside are occured a lot and
appeared together.
The condition of Sum Largest Score Vector Components
method is represented as:
sd,b1, sd,b2≥max∀b6=b1,b2(sd,b)
The third method, Sum Largest Phase Precision Com-
ponents selects score of two frequency components in web
page score vector which have the largest phase precision
values. The reason is, the frequency component with larger
phase precision value would have the most similar position
of term signals compare to query terms.
The condition of Sum Largest Phase Precision Components
method is represented as:
Φz
d,b1,Φz
d,b2≥max∀b6=b1,b2(Φz
d,b)
The fourth method, Sum Largest Magnitude Compo-
nents selects score of two frequency components in web
page score vector which have the largest magnitude values.
The reason is, the frequency component with larger magni-
tude value would have query terms that appear a lot inside.
The condition of Sum Largest Magnitude Components
method is represented as:
Hm
d,b1, Hm
d,b2≥max∀b6=b1,b2(Hm
d,b)
3. Experiments
The October 26, 2006, snapshot of English Wikipedia
articles without images was used as data set collection. We
selected 34 web pages which have linked directly from main
page of category Statistic 2. We used Oracle Text [5] to ex-
tract terms and create an inverted index to enable quick re-
trieval of term vectors. Before the web pages were indexed,
preprocessing was performed. This preprocessing consisted
of removing stop words, and stemming using Porter’s stem-
ming algorithm [6]. The stop word list contained about
700 words including stop word list provide by Oracle Text,
HTML tags and some words that common in Wikipedia web
pages.
In [3] the used data sets are part of the TREC English
document collection [7] which has been already provided
with sets of query and relevance list for testing purpose. On
the other hand, there are no sample queries or list of rel-
evance web pages for the Statistic data set collected from
Wikipedia. Therefore in this experiment, we selected the
words “Probability, Distribution” as query terms and de-
cided 10 out of 34 web pages as the relevant web pages.
The experiments in [3] using their own data showed that
about half of the results use Sum All Components and eight
bins would provide the best precision. Based on that in-
formation, this experiments used the combination methods
2See http://en.wikipedia.org/wiki/Statistic
Table 1. Results of experiments with respect
to the number of bins Bper term signal
PREC-N B
2 4 8 16 32 64
N=5 0.80 0.80 1.00 1.00 1.00 0.80
N=10 0.60 0.60 0.80 0.80 0.80 0.70
N=15 0.53 0.53 0.67 0.67 0.60 0.60
Table 2. Results of experiments with respect
to the scoring methods of web pages
PREC-
N
Sum all
components
Sum largest
score vector
components
Sum largest
phase
precision
Sum largest
magnitude
components
N=5 1.00 0.80 0.80 0.80
N=10 0.80 0.80 0.70 0.80
N=15 0.67 0.60 0.60 0.60
of Sum All Components, Sum Magnitudes and Zero Phase
Precision. As variations number of bins, the value of 2, 4,
8, 16, 32 and 64 is selected. The precision for relevant doc-
ument are showed in format PREC-Nwith the value of N
are 5, 10 and 15 (i.e. The value of PREC5 shows the ratio
of the number of relevant web pages from 5 retrieved web
pages.).
The results showed that 8 and 16 as the number of bins
would give a better precision (see Table 1). However execu-
tions times for calculating weighting and Fourier transform-
ing of web pages gives another consideration. The times
needed for 16 as number of bins take three times longer
than if eight is selected. Thus we decided that eight should
be selected as the number of bins for Statistic web pages of
Wikipedia in this experiment.
Then for the rest experiment to get the score of web
pages, eight is used as the number of bins. In general, Sum
All Components showed better result while Sum Largest
Phase Precision Components showed the worst compar-
ing to other methods (see Table 2). Sum All Components
considers how the query terms signals are appeared and
spreaded out through the entire contents of web page and
that makes the observation of relevance more accurate. In
the other hand, Sum Largest Phase Precision Components
takes greater account into part of web pages where query
terms are appeared together. Since the most basic fea-
ture of Wikipedia web pages is anyone can edit policy,
so that sometimes the changes might result into inconsis-
tency of topic and content inside web pages. Therefore
we concluded that the reason of undesirable results using
Sum Largest Phase Precision Components is because of
Wikipedia policy.
4. Conclusions and future works
Unlike other methods that only recognize occurrence
numbers of the query terms, FDS could recognize how the
query terms are spread throughout the content of web pages.
Based on the experiments, we concluded that the not rele-
vant results retrieved are mainly influenced by the charac-
teristic of Wikipedia. The changes of Wikipedia web pages
could be done in anypart by anyone and sometimes those
changes are not relevant enough with the main topic. For
that reason, we concluded it is possible for some parts of
retrieved web pages are strongly related to query terms but
the rest parts are not.
To avoid such problems, in the future we are interested
to extract important terms that really semantically similar to
main topic of the web page. It is also included into our con-
sideration of improving the quality of search results to use
a technique that exploited the additional information inher-
ent in the hyperlink structure of web pages. The analysis of
hyperlink structure could determine the popularity score of
web pages. Then, the content score, using FDS method, is
combined with the popularity score to determine an overall
score for each relevant web page.
References
[1] T. Holloway, M. Bo˘
ziˇ
cevic, and K. B ¨
orner. Analyzing
and Visualizing the Semantic Coverage of Wikipedia and its
Authors.http://arxiv.org/abs/cs.IR/0512085,
Dec. 2005.
[2] R.B. Yates and B. Ribeiro-Neto. Modern Information Re-
trieval. Addison-Wesley, 1999.
[3] L.A.F. Park, K. Ramamohanarao, and M. Palaniswami.
Fourier domain scoring: A novel document ranking method.
IEEE Transactions on Knowledge and Data Engineering,
16(5):529–539, May 2004.
[4] Wikipedia, http://en.wikipedia.org/wiki/
Nyquist-Shannon_sampling_theorem.Nyquist-
Shannon Sampling Theorem.
[5] Oracle Technology Network, http://www.oracle.
com/technology/products/text.Oracle Text in Or-
acle Database.
[6] W.B. Frakes and R.B. Yates. Information Retrieval: Data
Structures and Algorithms. Prentice Hall, 1992.
[7] National Institute of Standards and Technology, http:
//trec.nist.gov/data/docs_eng.html.Text Re-
trieval Conference: Data - English Documents.