Automatically Generating an E-textbook on the Web
ABSTRACT Nowadays, people tend to learn from the Web because it is convenient, and rich of free information. The main means of learning
on the Web is by submitting a query to a search engine, and subsequently browsing through the returned results to find relevant
information. Although in many cases, a search engine such as Google works quite well, the results returned are often not appropriate
for the learning purpose. In this paper, we present a novel approach to automatically generate an E-textbook for a user specified topic hierarchy. Such a technology can ease the learning process to a great extent.
-
Citations (0)
-
Cited In (0)
Page 1
World Wide Web: Internet and Web Information Systems (2005)
c ?2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands.
DOI: 10.1007/s11280-005-1319-5
Automatically Generating an E-textbook on the Web
JING CHEN
QING LI
WEIJIA JIA
Department of Computer Engineering and Information Technology, City University of Hong Kong,
Hong Kong SAR, PR China
jerryjin@cityu.edu.hk
itqli@cityu.edu.hk
itjia@cityu.edu.hk
Published online: 25 August 2005
Abstract
Nowadays, people tend to learn from the Web because it is convenient, and rich of free information. The main
means of learning on the Web is by submitting a query to a search engine, and subsequently browsing through
the returned results to find relevant information. Although in many cases, a search engine such as Google works
quite well, the results returned are often not appropriate for the learning purpose. In this paper, we present a novel
approach to automatically generate an E-textbook for a user specified topic hierarchy. Such a technology can ease
the learning process to a great extent.
Keywords:
E-textbook,Weblearning,topichierarchy,conceptmining,weightingstrategy,expandingalgorithm
1.Introduction
The Web is a vast storehouse of information. People can find information on almost every
aspectoflife.Soisthecaseforteachersandstudents.Tothem,theWebplaysanincreasingly
crucial role in the processes of teaching and learning.
Learning through the Web typically involves the use of a search engine (e.g., Google
[1]). A user first chooses an interested topic as his query and submits it to the search engine.
Then he goes through the retrieved results and visits the Web pages that seem to match
his interest. However, the experience of learning from such search engine results can be
very difficult. If lucky enough, the top few returned results hit right at the target. But many
times, they are index pages containing links considered as relevant. Even if the contained
links are relevant indeed, the truly desired pages may still be several links away. If the
relevance of the links is not genuine, one can hardly get anything important at all. In this
near ‘groping’ process, one is likely to stray away from his initial intent. These obstacles
can make learning on the Web tedious and time-consuming, and even turn the learners
away.
One of the reasons that cause such diversity in the results is that Web pages are authored
for a large variety of purposes. For example, for a keyword/phrase query ‘data structures
and algorithms’, some of the search engine results are the syllabus of a real course ‘data
structuresandalgorithms’,andsuchpagesoftencontainlinkstolecturenotesofthecourse.
Page 2
CHEN, LI AND JIA
Otherresultsaredesignedtointroduceabooktitled‘datastructuresandalgorithms’,giving
only a table of content with no details and no extra links. Such pages are hardly helpful
in the case of learning. Still some others are devised to help people find useful links to
implementations of algorithms, which may help those who are familiar with algorithms.
Selecting different pages in the retrieved list can result in great difference in the experience
oflearning.Whilecomputerexpertsclaimthattheycanalwaysfindwhattheywant,novices
often complain that they hardly get anything reasonable at all.
An ideal solution would be to let people studying on the Web go through a similar
experience as the traditional way. The traditional process mainly contains the following
two steps. First, one gets to know what a topic mainly concerns. Then he reads chapter by
chapter about sub-topics and grabs a clearer idea of the main topic. In other words, it is like
reading through a textbook. Moreover, learners prefer Web pages with a higher quality of
content. High quality Web pages often share several common attributes1:
• Self-contained: A Web page is self-contained if it contains the maximum of information
about the concerned topic. In other words, a self-contained page should not only contain
elaborate information about itself, but also enumerate the sub-topics of the concerned
topic. With such kind of pages, users do not need to visit more pages in order to have a
deep understanding of the knowledge.
• Descriptive:Aninformativepageshouldgivedescriptionsand/ordefinitionsofthetopic.
These Web pages are especially favorable to users who have not yet had a concrete idea
of the topic.
• Authoritative: A Web page is regarded as authoritative if it is cited by many reputable
‘hub’ pages. Such popularity often implies good quality because people tend to refer to
pages that they consider to be ‘good’.
In this paper, we advocate a novel approach of automatically generating an E-textbook
on the Web for a user specified topic. Our approach starts from a concept tree of the target
topic. The root node of the concept tree is the target topic, and the rest are sub-topics
that explain and enrich it. The approach first generates a query for each node and gets the
corresponding candidate Web pages through a search engine. Through mining the URLs
and the content of the candidate pages, Web pages ‘suitable’ for the learning process are
saved and assigned to the proper node. After all nodes in the concept tree are processed,
the E-textbook is completed. With such a challenging yet feasible technology, teachers can
resort to the E-book when preparing for their teaching material and students can learn more
about interested topics and deepen their understanding.
The rest of the paper is organized as follows. In Section 2, we review the recent works
on document organization and Web concept mining. Then we describe in Section 3 our
method for constructing a concept tree to represent the knowledge structure of a selected
topic; the user interface designed to browse the concept tree is also presented there. Section
4 presents an experimental study, and we conclude our contributions in Section 5.
Page 3
AUTOMATICALLY GENERATING AN E-TEXTBOOK ON THE WEB
2.Related work
Many research efforts have been engaged to bring order to documents in a large document
collection. Halkidi [6] devised a system called THESUS using link semantics. They first
gather a set of popular Web documents according to Web directories like DMOZ [11],
and adopt a hierarchy of concepts (ontology) and a thesaurus to convert keywords from
all pages’ incoming links to semantics. Then they group up documents into semantically
coherent subsets, and label the subsets for easier browsing. In [9], the author introduced an
approach that automatically builds hierarchical word-based summaries for a given set of
documents. A language model is used first to characterize documents in the data set, and
then a graph-theoretic algorithm is applied to identify the key content-bearing words for
the hierarchical summary.
Compared to the achievements in the pursuit of organizing document sets, limited work
has been done to distinguish data of higher quality from the massive Web collection,
especially for Web learning. To our knowledge, Liu et al.’s approach [10] was the first to
explicitly point out the task of compiling a book on the Web. In their technique, they first
identify the sub-topics or salient concepts of a specified topic, and then find the informative
pagescontainingdefinitionsanddescriptionstopresenttotheuser.Theprocessisperformed
inaninteractiveway,inwhichuserschoosetheinterestedsalientconceptstofurtherexplore.
Theweaknessoftheirapproachisthatitlacksanexplicitdescriptionoftheconceptstructure
of the specified topic. In practice, when people read a book, they first look at the table of
content and then go on reading the detailed chapters. The table of content helps a reader to
measure his progress and decide on the subsequent learning plan.
In some occasions, users may know the skeleton of a topic, but they do not have a deep
understanding of it. What they need is therefore a tool that can fill the skeleton with the
most relevant and important documents automatically. As to be presented in this paper,
such a requirement can be addressed by allowing users to specify the topic hierarchy and
the system automatically generates the complete textbook for them.
3.
E-textbook construction
An E-textbook is based on a user specified topic hierarchy, which we call the concept
tree. The nodes of a concept tree are labeled with salient concepts of the topic and sub-
topics. The root node is the concerned topic, and its offspring are sub-topics, and then
childrenofthesenodesextendthesub-topicsfurther.Eachnodeintheconcepttreecontains
severalinformativepagesdiscussingthecorrespondingconcept.Inourapplication,wefavor
descriptiveWebpagescontainingdefinitionsofsalientconcepts.Figure1showsanexample
of a simplified concept tree corresponding to topic ‘data structures and algorithms’.
In the following, we first present the definitions and annotations of the concept tree; the
data structure and operations to be used in our later algorithms are also introduced.
• Nidenotes the ith node of the concept tree, where the tree is traversed in the breadth-first
order; the root node N0stands for the concerned topic;
Page 4
CHEN, LI AND JIA
Figure 1.
Example of the concept tree for ‘data structures and algorithms’.
• Liis the label of Ni, and it characterizes the central concept of a topic or sub-topic by
means of a few words;
• Qiis the query phrase generated for Ni, which is submitted to a search engine to collect
the candidate Web pages of Nifor the final E-textbook;
• CandidatePool is a buffer which stores all retrieved Web pages of Qi, awaiting further
mining;
• For Ni, there exists a descriptive Web page list called DescriptListi. The pages in
DescriptListi mainly come from CandidatePool, but not all the retrieved pages are
inserted into DescriptListi. Web pages inserted to leaf nodes are informative ones that
bear concept definitions of Li, while those for internal nodes may contain descriptions
or overview of Liand sub-topics of Li;
• For Ni, AnchListiis a sorted list of links contained in Web pages in CandidatePool. Each
iteminAnchListiisa3-tuple(URLj,AnchorTextj,Weightj),whereURLjistheURLofthe
jth link from a Web page in CandidatePool, AnchorTextjis the anchor text corresponding
to URLj, and Weightjis the weight we have calculated for the Web page that contains
URLj. The links in AnchListiare used to expand the descriptive page list if necessary;
• ConceptPool includes all labels of tree nodes in the concept tree;
• parent(Ni) denotes for the parent of node Ni;
• child(k,Ni) stands for the kth offspring of node Ni.
For a given concept tree, tree nodes are traversed in the breadth-first order. For each
node, ‘suitable’ Web pages for learning are selected from the retrieved document set of the
corresponding query. The process of building an E-textbook includes the following steps:
1. Dataset collection: to generate the query phrase for each node and gather the retrieved
pages by the search engine;
2. Mining: to preprocess Web pages in the initial retrieved set, and mine the results to find
‘suitable’ Web pages;
3. Expansion: to expand the result list for nodes that requires mores pages.
4. Result presentation: to display ‘suitable’ Web pages to users.
Page 5
AUTOMATICALLY GENERATING AN E-TEXTBOOK ON THE WEB
Figure 2.
The process of automatically constructing an E-textbook.
3.1. Dataset collection
The initial dataset is collected by generating a query phrase for each node in the concept
tree, and submitting the query to a search engine. The retrieved pages by the search engine
are then saved to a buffer called CandidatePool for further processing.
Typically, QueryExpansion techniques [10,13,14] areutilizedtogenerate query phrases.
Searching on the Web often suffers from low precision. One of the main reasons is that
query terms are often too general to determine the context in which they appear in, also
termed as the ambiguity problem in [10]. For instance, the word ‘kingdom’ in botani-
cal sciences means a major category in biological taxonomy. If ‘kingdom’ alone is sub-
mitted to a search engine, it would be impossible to decide on its context, and many
irrelevant Web pages would appear in the result. Different Query Expansion (QE) tech-
niques are introduced to cope with the ambiguity problem, aiming to accurately repre-
sent the needed information. We employ a typical solution in QE based upon our con-
cept tree. Internal and leaf nodes are treated with distinct strategies to generate query
phrases. The queries are then submitted to a search engine. For node Niin the concept
tree:
• If Niis a leaf node, the corresponding query Qiconsists of the label Liitself:
Qi= Li;
• If Niis an internal node, the query Qiis created by Liand the labels of its offspring:
Qi= Li∪ {Lj| Nj= child(j, Ni)}