ArticlePDF Available

A Study on Web Resources_ Navigation for e-Learning: Usage of Fourier Domain Scoring on Web Pages Ranking Method

Authors:

Abstract

Using existing web resources for e-Learning is a very promising idea especially in reducing the cost of authoring. Envisioned as open-source, completely free, and frequently updated Wikipedia could become a good candidate. Even though Wikipedia has been structured by categories, still sometimes they are not dynamically updated when there are modifications. As web resources for e-Learning, it is a ne- cessity to provide a navigation path in Wikipedia which se- mantically mapping the learning material and not merely based on the structures. The desired learning material could be provided as a request from search results. We in- troduce in this paper the usage of Fourier Domain Scoring (FDS) for ranking method in searching certain collection of Wikipedia web pages. Unlike other methods that would only recognize the occurrence numbers of query terms, FDS could also recognize the spread of query terms throughout the content of web pages. Based on the experiments, we concluded that the not relevant results retrieved are mainly influenced by the characteristic of Wikipedia. Given that the changes of Wikipedia web pages could be done in any- part by anyone, we concluded that it is possible if only some parts of retrieved web pages strongly related to query terms.
A Study on Web Resources’ Navigation for e-Learning:
Usage of Fourier Domain Scoring on Web Pages Ranking Method
Diana Purwitasari, Yasuhisa Okazaki, Kenzi Watanabe
Saga University, Saga, Japan
diana@ai.is.saga-u.ac.jp
Abstract
Using existing web resources for e-Learning is a very
promising idea especially in reducing the cost of authoring.
Envisioned as open-source, completely free, and frequently
updated Wikipedia could become a good candidate. Even
though Wikipedia has been structured by categories, still
sometimes they are not dynamically updated when there are
modifications. As web resources for e-Learning, it is a ne-
cessity to provide a navigation path in Wikipedia which se-
mantically mapping the learning material and not merely
based on the structures. The desired learning material
could be provided as a request from search results. We in-
troduce in this paper the usage of Fourier Domain Scoring
(FDS) for ranking method in searching certain collection
of Wikipedia web pages. Unlike other methods that would
only recognize the occurrence numbers of query terms, FDS
could also recognize the spread of query terms throughout
the content of web pages. Based on the experiments, we
concluded that the not relevant results retrieved are mainly
influenced by the characteristic of Wikipedia. Given that
the changes of Wikipedia web pages could be done in any-
part by anyone, we concluded that it is possible if only some
parts of retrieved web pages strongly related to query terms.
1. Introduction
Using existing web resources for e-Learning is a very
promising idea especially in reducing the cost of authoring.
Envisioned as open-source, completely free, and frequently
updated Wikipedia could become a good candidate. In ad-
dition, Wikipedia is one of the 100 most popular Websites
worldwide 1. Although not all of its resources are extensive,
informative, and accurate, still Wikipedia could be the ideal
place to start and get a global picture of desired material.
1see http://www.alexa.com/site/ds/top sites?ts mode=global&lang=none,
retrieved March 14, 2007.
Wikipedia has been structured by categories [1] even
though some of them are not ordered or are not dynamically
updated every time modifications happened. It is possible
for someone who is willing to register a name and instantly
doing modifications to alter the structure of categories in
Wikipedia. As web resources for e-learning, it is a neces-
sity to provide a navigation path in Wikipedia that mapping
the owned learning material onto the desired learning ma-
terial semantically. Due to the characteristic of frequently
updated, the navigation path should be generated adaptively
based on the contents of material and not merely on the
structures of material.
The desired learning material could be provided as a re-
quest from search process. One of the challenges in search
process is how to interpret the results. In other words it
is about handling a large answer of web pages or ranking
web pages to provide the ones that really are of interest to
the learner [2]. Before creating the adaptively generated
navigation path of Wikipedia, we introduce in this paper
the usage of Fourier Domain Scoring (FDS) [3] for rank-
ing method in searching certain collection of Wikipedia web
pages.
Based on the experiments, we concluded that the not rel-
evant results retrieved are mainly influenced by the charac-
teristic of Wikipedia. The changes on Wikipedia web pages
could be done in anypart by anyone and sometimes those
changes are not relevant enough with main topic of web
pages. We concluded that it is possible for some parts of
retrieved web pages are strongly related to query terms but
the rest parts are not.
2. Fourier Domain Scoring for Web Pages
Ranking
2.1. Introduction of Fourier Domain Scor-
ing
Most of the systems that do search process use varia-
tions of Boolean or vector space model to do ranking. The
concept behind vector space model is to convert each web
page and the query into vectors so they can be easily com-
pared [2]. The problem with the techniques of vector space
model is that any spatial information contained in the web
pages is lost. Once the web pages are converted into vec-
tors, the number of times each term appears is represented
in the vector, but the positions of the terms are ignored.
In FDS rather than storing only the frequency of terms, it
stores the signal of terms that show how the terms are spread
throughout web pages. By comparing magnitude and phase
value of query term signals exist in web pages, FDS could
observe web pages which have query terms that should be
occurring a lot and appearing together [3].
The occurrence numbers and position of terms could be
called as magnitude and phase precision value of term sig-
nals in FDS. A relevant web page should have large magni-
tude and corresponding phase of terms equal to query terms
should be similar. Each web page would have one or more
of term signals where each signal contains information of
magnitude and phase precision. To calculate the effect of
the query terms signals on web page, the magnitude and
phase precision value of term signals is examined and com-
bined to create the web page score. Most of the top results
in FDS [3] use Sum Magnitudes (sub section 2.4) and Zero
Phase Precision (sub section 2.5) to examine term signals.
Then FDS would rank web pages by sorting the score value
in decreasing order.
2.2. Weighting of a web page
Commonly a web page contains at least hundred of terms
which eventually results to term signals in almost the same
amount. The terms should be grouped into bins to reduce
the size of term vector. If the number of bins is set to B,
a web page containing Wterms would have W/B terms in
each bin. Let tbe a certain index term in document dand B
is the chosen number of bins. Note, from now on the word
of document would also refer to the word of web page. A
weight ωd,t,b >0is associated with the occurrence numbers
of term tinside document din location bin b, where b
{0, . . . , B 1}. For each term tinsides document d, the
vector will be
ωd,t = [ωd,t,0, . . . , ωd,t,B1]T.
Let Nbe the total number of web pages and ntbe the
number of web pages in which the index term tappears.
Then weighting schemes of ωd,t,b which derived from TF-
IDF schemes [2] will be:
ωd,t,b =freqd,t,b
max
i,j freqd,t=i,b=j·log N
nt
where freqd,t,b be the occurrence numbers of term tinside
document din location bin b. While maxi,j f reqd,t=i,b=j
is a maximum value of occurrence numbers for certain term
iin certain location bin jfrom all terms insides document
d.
2.3. Fourier transforming of a web page
The sequence number of ωd,t,0, . . . , ωd,t,B1is still an
information in time or spatial domain and needs to be trans-
formed into frequency domain. The Fourier transform de-
fines a relationship of signals in the time or spatial domain
with its representation in the frequency domain known as
a (Fourier) spectrum. A spectrum is made up of a number
of frequency components with real and imaginary part for
each frequency component or in a different way has an as-
sociated phase, in addition to a magnitude to represent the
same information. The discrete form of Fourier transform
is of the form [3]:
vd,t,β =
B1
X
b=0
ωd,t,b exp2πi
Bβb, β ∈ {0, . . . , B 1}
where vd,t,β is the projection of the term signal
ωd,t onto
a sinusoidal wave of frequency β. The spectral component
number βis an element of the set {0, . . . , B 1}.
The Discrete Fourier Transform would produce the fol-
lowing mapping [3]:
ωd,t,bvd,t,b =Hd,t,b exp(d,t,b)
where vd,t,b is the bth frequency component of term tin
document d,Hd,t,b, and φd,t,b are the magnitude and phase
of frequency component vd,t,b, respectively, and iis 1.
2.4. Calculate magnitude value of a web
page
Sum Magnitudes Hm
d,b [3] takes into account only the
magnitude part, Hd,t,b, of the frequency component, vd,t,b
and ensures more weight is given to web pages with more
occurrences of query terms. Let T be the set of query terms,
and then the magnitude part of all terms equal to query
terms is multiplied with the weight ωq,t of query term t.
The magnitude value Hm
d,b of each component binsides doc-
ument dwill be:
Hm
d,b =X
tT
Hd,t,b ·ωq,t
2.5. Calculate phase precision value of a
web page
Zero Phase Precision Φz
d,b [3] only includes the phase of
frequency component with nonzero magnitude because the
term with zero magnitude value means that the term does
not exist and its phase value could be left out. For each fre-
quency component, the phase value of terms equal to query
terms in document will be summed and averaged with the
total number of query terms, #(T). The phase precision
value Φz
d,b of each frequency component binsides document
dwill be:
Φz
d,b =v
u
u
u
u
t X
tT
cos φd,t,b
#(T)!2
+ X
tT
sin φd,t,b
#(T)!2
2.6. Calculate score value of web page
After the magnitude and phase precision of frequency
components in web pages have been obtained, the next step
is to combine them to create the web page score vector by
multiplying their values. The score value for each frequency
component sd,b will be:
sd,b =Hm
d,b ·Φz
d,b
To get the document score Sd, each score of frequency
components in web page score vector will be summarized.
There are four methods selected from [3] and being studied
in this paper to do that calculation.
First method is called as Sum All Components which
combines score sd,b of all frequency components in docu-
ment d. However the Nyquist-Shannon sampling theorem
[4] states that the highest frequency component to be found
in a real signal is equal to half of the sampling rate; this im-
plies that, if there are Bfrequency components for the term
signal, then to analyze the signal would only need to exam-
ine frequency components of 0 to B/2[3]. Using that as-
sumption, only half of frequency component scores in web
page score vector will be needed and the web page score
will be:
Sd=
B/2
X
b=1
sd,b
If there are high values contained in any of the elements
of the web page score vector, resulted from either magni-
tude or phase precision value, then that web page should
be considered relevant the the query terms. That idea is
the concept behind other methods to calculate the web page
score which only considers the summation of two greatest
values of frequency component scores in web page score
vector.
Sd=sd,b1+sd,b2
The second method, Sum Largest Score Vector Compo-
nents selects the two largest score of frequency component
scores in web page score vector. Based on information of
magnitude and phase precision represented in the frequency
components, the query terms inside are occured a lot and
appeared together.
The condition of Sum Largest Score Vector Components
method is represented as:
sd,b1, sd,b2maxb6=b1,b2(sd,b)
The third method, Sum Largest Phase Precision Com-
ponents selects score of two frequency components in web
page score vector which have the largest phase precision
values. The reason is, the frequency component with larger
phase precision value would have the most similar position
of term signals compare to query terms.
The condition of Sum Largest Phase Precision Components
method is represented as:
Φz
d,b1,Φz
d,b2maxb6=b1,b2z
d,b)
The fourth method, Sum Largest Magnitude Compo-
nents selects score of two frequency components in web
page score vector which have the largest magnitude values.
The reason is, the frequency component with larger magni-
tude value would have query terms that appear a lot inside.
The condition of Sum Largest Magnitude Components
method is represented as:
Hm
d,b1, Hm
d,b2maxb6=b1,b2(Hm
d,b)
3. Experiments
The October 26, 2006, snapshot of English Wikipedia
articles without images was used as data set collection. We
selected 34 web pages which have linked directly from main
page of category Statistic 2. We used Oracle Text [5] to ex-
tract terms and create an inverted index to enable quick re-
trieval of term vectors. Before the web pages were indexed,
preprocessing was performed. This preprocessing consisted
of removing stop words, and stemming using Porter’s stem-
ming algorithm [6]. The stop word list contained about
700 words including stop word list provide by Oracle Text,
HTML tags and some words that common in Wikipedia web
pages.
In [3] the used data sets are part of the TREC English
document collection [7] which has been already provided
with sets of query and relevance list for testing purpose. On
the other hand, there are no sample queries or list of rel-
evance web pages for the Statistic data set collected from
Wikipedia. Therefore in this experiment, we selected the
words “Probability, Distribution” as query terms and de-
cided 10 out of 34 web pages as the relevant web pages.
The experiments in [3] using their own data showed that
about half of the results use Sum All Components and eight
bins would provide the best precision. Based on that in-
formation, this experiments used the combination methods
2See http://en.wikipedia.org/wiki/Statistic
Table 1. Results of experiments with respect
to the number of bins Bper term signal
PREC-N B
2 4 8 16 32 64
N=5 0.80 0.80 1.00 1.00 1.00 0.80
N=10 0.60 0.60 0.80 0.80 0.80 0.70
N=15 0.53 0.53 0.67 0.67 0.60 0.60
Table 2. Results of experiments with respect
to the scoring methods of web pages
PREC-
N
Sum all
components
Sum largest
score vector
components
Sum largest
phase
precision
Sum largest
magnitude
components
N=5 1.00 0.80 0.80 0.80
N=10 0.80 0.80 0.70 0.80
N=15 0.67 0.60 0.60 0.60
of Sum All Components, Sum Magnitudes and Zero Phase
Precision. As variations number of bins, the value of 2, 4,
8, 16, 32 and 64 is selected. The precision for relevant doc-
ument are showed in format PREC-Nwith the value of N
are 5, 10 and 15 (i.e. The value of PREC5 shows the ratio
of the number of relevant web pages from 5 retrieved web
pages.).
The results showed that 8 and 16 as the number of bins
would give a better precision (see Table 1). However execu-
tions times for calculating weighting and Fourier transform-
ing of web pages gives another consideration. The times
needed for 16 as number of bins take three times longer
than if eight is selected. Thus we decided that eight should
be selected as the number of bins for Statistic web pages of
Wikipedia in this experiment.
Then for the rest experiment to get the score of web
pages, eight is used as the number of bins. In general, Sum
All Components showed better result while Sum Largest
Phase Precision Components showed the worst compar-
ing to other methods (see Table 2). Sum All Components
considers how the query terms signals are appeared and
spreaded out through the entire contents of web page and
that makes the observation of relevance more accurate. In
the other hand, Sum Largest Phase Precision Components
takes greater account into part of web pages where query
terms are appeared together. Since the most basic fea-
ture of Wikipedia web pages is anyone can edit policy,
so that sometimes the changes might result into inconsis-
tency of topic and content inside web pages. Therefore
we concluded that the reason of undesirable results using
Sum Largest Phase Precision Components is because of
Wikipedia policy.
4. Conclusions and future works
Unlike other methods that only recognize occurrence
numbers of the query terms, FDS could recognize how the
query terms are spread throughout the content of web pages.
Based on the experiments, we concluded that the not rele-
vant results retrieved are mainly influenced by the charac-
teristic of Wikipedia. The changes of Wikipedia web pages
could be done in anypart by anyone and sometimes those
changes are not relevant enough with the main topic. For
that reason, we concluded it is possible for some parts of
retrieved web pages are strongly related to query terms but
the rest parts are not.
To avoid such problems, in the future we are interested
to extract important terms that really semantically similar to
main topic of the web page. It is also included into our con-
sideration of improving the quality of search results to use
a technique that exploited the additional information inher-
ent in the hyperlink structure of web pages. The analysis of
hyperlink structure could determine the popularity score of
web pages. Then, the content score, using FDS method, is
combined with the popularity score to determine an overall
score for each relevant web page.
References
[1] T. Holloway, M. Bo˘
ziˇ
cevic, and K. B ¨
orner. Analyzing
and Visualizing the Semantic Coverage of Wikipedia and its
Authors.http://arxiv.org/abs/cs.IR/0512085,
Dec. 2005.
[2] R.B. Yates and B. Ribeiro-Neto. Modern Information Re-
trieval. Addison-Wesley, 1999.
[3] L.A.F. Park, K. Ramamohanarao, and M. Palaniswami.
Fourier domain scoring: A novel document ranking method.
IEEE Transactions on Knowledge and Data Engineering,
16(5):529–539, May 2004.
[4] Wikipedia, http://en.wikipedia.org/wiki/
Nyquist-Shannon_sampling_theorem.Nyquist-
Shannon Sampling Theorem.
[5] Oracle Technology Network, http://www.oracle.
com/technology/products/text.Oracle Text in Or-
acle Database.
[6] W.B. Frakes and R.B. Yates. Information Retrieval: Data
Structures and Algorithms. Prentice Hall, 1992.
[7] National Institute of Standards and Technology, http:
//trec.nist.gov/data/docs_eng.html.Text Re-
trieval Conference: Data - English Documents.
... There are some scenarios to calculate magnitude, phase and how to represent them as a content score of document [7]. We had studied about Fourier Domain Scoring in previous works using both standard data set in information retrieval reasearch fields [8] and real data set of Web pages [9]. Based on the studies, we will use number of bins = 8, Sum Magnitudes to calculate magnitude, Zero Phase Precision to calculate phase, and Sum All Components to produce content score for retrieving relevant documents in searching processes. ...
... It shows that most of Web pages in Harvard site have reference link to the main page as usually happened in any existing site in the Web which always gives opportunity for users to go back to the home of site. Began with preprocessing using Oracle Text [11] to extract terms and then creating an inverted index, our ranking system-FDS module [9], PRS module and FDS×PRS module-is developed using Java. Data collection, d 1 , is retrieved by crawling from main page Statistic 6 using link depth level = 1 and other data collections, d 2 , d 3 , d 4 , are using same main page but link depth level = 2. ...
Article
Full-text available
Ranking module is an important component of search process which sorts through relevant pages. Since collection of Web pages has additional information inherent in the hyperlink structure of the Web, it can be represented as link score and then combined with the usual information retrieval techniques of content score. In this paper we report our studies about ranking score of Web pages combined from link analysis, PageRank Scoring, and content analysis, Fourier Domain Scoring. Our experiments use collection of Web pages relate to Statistic subject from Wikipedia with objectives to check correctness and performance evaluation of combination ranking method. Evaluation of PageRank Scoring show that the highest score does not always relate to Statistic. Since the links within Wikipedia articles exists so that users are always one click away from more information on any point that has a link attached, it it possible that unrelated topics to Statistic are most likely frequently mentioned in the collection. While the combination method show link score which is given proportional weight to content score of Web pages does effect the retrieval results.
... When users click any topic, the system makes an inquiry using feature terms within current document as search keywords of selected topic. Our searching process applies combination ranking factors of link analysis, PageRank Scoring (PRS) [12], and content analysis that still retaining spatial information of search keywords, Fourier Domain Scoring (FDS) [13,14]. Afterward, inquiry results act as a new collection in which the system will produce the next suitable navigation of topics sequence for a collection of more focused subjects. ...
... In searching process, how to interpret the results such as determining ranking method is an important issue. We have studied FDS usage as ranking method [14]. FDS gives higher rank to a document if its content has more terms similar to query terms and occur close together than document where its query terms inside occur far apart. ...
Conference Paper
Full-text available
Users prefer to navigate subjects from organized topics in an abundance resources than to list pages retrieved from search engines. We propose a framework to cluster frequent itemsets (sets of common words) into topics, produce a hierarchical list, and then generate topics sequence from a collection of documents. The framework will regenerate a next sequence when users click a topic. Consider browsing to any topic as a kind of searching for that topic, the framework makes an inquiry using feature terms within the document representation of selected topic as query keywords. Our ranking method in searching process considers content analysis that still retaining spatial information of search keywords and link analysis of documents. Utilizing implementation of navigation generating system the experiments show that a navigation list from clustering results can be settled with regard to variance ratio of between and within distances. Agglomerative clustering is used in restructuring the extracted topics in order to produce a hierarchical navigation list.
... Especially for IR, DWTs are capable of analyzing key term patterns as being a vector of frequencies at different levels of resolution (Park et al. 2002). Also, DWT-based approaches were applied to retrieve Web page contents and to improve the results in search engines (Purwitasari et al. 2007;Purwitasari 2008). ...
Article
Social interactions take place in environments that influence people’s behaviours and perceptions. Nowadays, the users of Online Social Network (OSN) generate a massive amount of content based on social interactions. However, OSNs wide popularity and ease of access created a perfect scenario to practice malicious activities, compromising their reliability. To detect automatic information broadcast in OSN, we developed a wavelet-based model that classifies users as being human, legitimate robot, or malicious robot, as a result of spectral patterns obtained from users’ textual content. We create the feature vector from the Discrete Wavelet Transform along with a weighting scheme called Lexicon-based Coefficient Attenuation. In particular, we induce a classification model using the Random Forest algorithm over two real Twitter datasets. The corresponding results show the developed model achieved an average accuracy of 94.47% considering two different scenarios: single theme and miscellaneous one.
... Consider browsing to any topic as a kind of searching for that topic, such event will evoke an inquiry using feature terms within the document representation of selected topic as query keywords. We apply combination ranking factors of link analysis, Page Rank Scoring (PRS) [4], and content analysis that still retaining spatial information of search keywords, Fourier Domain Scoring (FDS) [11,12]. Information of term positions (spatial information) make sure that terms equal to search keywords which occur close together will be given higher weight. ...
Article
The rapid growth of the Internet and the availability of abundance information resources make Web-based learning applications become a new means to learn. We present an approach of generating system to build a sequence of topics from existing learning resources. The sequence list is defined as content-based navigation to represent hierarchically structured list of topics and sub topics which is created without relying on human power. We propose use of data mining techniques to extract topics, and then followed by use of hypergraph partitioning along with agglomerative clustering to produce hierarchical list of topics. Another issue is an emphasis to the term of adaptive in Web-based learning as adaptive information filtering which is to find items that are relevant to user interests. We interpret that content-based navigation should adapt when context of the user interests changes like in a time user clicks a topic. Navigation generating system makes an inquiry using feature terms within current Web page as search keywords of selected topic when users click any topic. Our searching process applies combination ranking factors of link analysis and content analysis which still retaining spatial information of search keywords. Afterward, inquiry results become new resources in which the system will produce the next suitable navigation for a collection of more focused subjects. In this paper we also show evaluation of our selected methods for generating navigation.
... Consider browsing to any topic as a kind of searching for that topic, such event will evoke an inquiry using feature terms within the document representation of selected topic as query keywords. We intend to apply combination ranking factors of link analysis [12] and content analysis that still retaining spatial information of search keywords [14,15]. Recording the positions of each term appeared in documents will be useful in keeping spatial information. ...
Article
Full-text available
Rapid growth of the Internet make Web-based applications becomes a new means to learn. Authors of Web-based learning applications with multiple resources provide navigation to help users understanding structured idea of learning topics. Since the increasing of resources is such a growing field, navigation map of learning topics would be out of date if it is manually constructed. We define content-based navigation as a sequence list of topics and sub topics hierarchically structured from a collection of documents which is created without relying on human power. Content-based navigation not only provides subjects domain exists in the collection but also offers guidance of relevant items for users who want to learn particular subjects. In this paper, we present a framework for generating content-based navigation from resources applied in Web-based learning applications. We use data mining techniques to extract existing but hidden topics. We do hypergraph partitioning to derive the topics from hyperedges of words correlation which connecting vertices of topics corresponding words. Then we employ agglomerative clustering to merge any overlapping subjects of topics and produce a hierarchical topics list of navigation.
Article
We developed a wavelet-based approach for account classification that detects textual dissemination by bots on an Online Social Network (OSN). Its main objective is to match account patterns with humans, cyborgs or robots, improving the existing algorithms that automatically detect frauds. With a computational cost suitable for OSNs, the proposed approach analyses the distribution of key terms. The descriptors, a wavelet-based feature vector for each user’s account, work in conjunction with a new weighting scheme, called Lexicon Based Coefficient Attenuation (LBCA) and serve as inputs to one of the classifiers tested: Random Forests and Multilayer Perceptrons. Experiments were performed using a set of posts crawled during the 2014 FIFA World Cup, obtaining accuracies within the range from 94 to 100%.
Article
Full-text available
Current document retrieval methods use a vector space similarity measure to give scores of relevance to documents when related to a specific query. The central problem with these methods is that they neglect any spatial information within the documents in question. We present a new method, called Fourier Domain Scoring (FDS), which takes advantage of this spatial information, via the Fourier transform, to give a more accurate ordering of relevance to a document set. We show that FDS gives an improvement in precision over the vector space similarity measures for the common case of Web like queries, and it gives similar results to the vector space measures for longer queries.
Article
Full-text available
Information retrieval (IR) has changed considerably in the last years with the expansion of the Web (World Wide Web) and the advent of modern and inexpensive graphical user interfaces and mass storage devices. As a result, traditional IR textbooks have become quite out-of-date which has led to the introduction of new IR books recently. Nevertheless, we believe that there is still great need of a book that approaches the field in a rigorous and complete way from a computer-science perspective (in opposition to a user-centered perspective). This book is an effort to partially fulfill this gap and should be useful for a first course on information retrieval as well as for a graduate course on the topic.
Book
Information retrieval is a sub-field of computer science that deals with the automated storage and retrieval of documents. Providing the latest information retrieval techniques, this guide discusses Information Retrieval data structures and algorithms, including implementations in C. Aimed at software engineers building systems with book processing components, it provides a descriptive and evaluative explanation of storage and retrieval systems, file structures, term and query operations, document operations and hardware. Contains techniques for handling inverted files, signature files, and file organizations for optical disks. Discusses such operations as lexical analysis and stoplists, stemming algorithms, thesaurus construction, and relevance feedback and other query modification techniques. Provides information on Boolean operations, hashing algorithms, ranking algorithms and clustering algorithms. In addition to being of interest to software engineering professionals, this book will be useful to information science and library science professionals who are interested in text retrieval technology.
Article
This paper presents a novel analysis and visualization of English Wikipedia data. Our specific interest is the analysis of basic statistics, the identification of the semantic structure and age of the categories in this free online encyclopedia, and the content coverage of its highly productive authors. The paper starts with an introduction of Wikipedia and a review of related work. We then introduce a suite of measures and approaches to analyze and map the semantic structure of Wikipedia. The results show that co-occurrences of categories within individual articles have a power-law distribution, and when mapped reveal the nicely clustered semantic structure of Wikipedia. The results also reveal the content coverage of the article's authors, although the roles these authors play are as varied as the authors themselves. We conclude with a discussion of major results and planned future work.
  • R B Yates
  • B Ribeiro-Neto
R.B. Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison-Wesley, 1999.