A finegrained digestion of news webpages through Event Snippet Extraction

Conference Paper (PDF Available) · January 2011with23 Reads
DOI: 10.1145/1963192.1963272 · Source: DBLP
Conference: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, Hyderabad, India, March 28 - April 1, 2011 (Companion Volume)
Abstract
We describe a framework to digest news webpages in finer granularity: to extract event snippets from contexts. "Events" are atomic text snippets and a news article is constituted by more than one event snippet. Event Snippet Extraction (ESE) aims to mine these snippets out. The problem is important because its solutions may be applied to many information mining and retrieval tasks. The challenge is to exploit rich features to detect snippet boundaries, including various semantic, syntactic and visual features. We run experiments to present the effectiveness of our approaches.
A Fine-Grained Digestion of News Webpages
through Event Snippet Extraction
Rui Yan
Dept. of Computer Science
Peking University, China
r.yan@pku.edu.cn
Liang Kong
Dept. of Machine Intelligence
Peking University, China
kongliang@pku.edu.cn
Yu Li
School of Computer Science
Beihang University, China
carp84@gmail.com
Yan Zhang
Dept. of Machine Intelligence
Peking University, China
zhy@cis.pku.edu.cn
Xiaoming Li
Dept. of Computer Science
Peking University, China
lxm@pku.edu.cn
ABSTRACT
We describe a framework to digest news webpages in finer
granularity: to extract event snippets from contexts. “Events”
are atomic text snippets and a news article is constituted
by more than one event snipp et. Event Snippet Extrac-
tion (ESE) aims to mine these snippets out. The problem
is important because its solutions may be applied to many
information mining and retrieval tasks. The challenge is to
exploit rich features to detect snippet boundaries, includ-
ing various semantic, syntactic and visual features. We run
experiments to present the effectiveness of our approaches.
Categories and Subject Descriptors
H.4.m [Information Systems]: Miscellaneous
General Terms
Algorithms, Experimentation, Measurement
Keywords
News digestion, event snippet extraction, web mining
1. INTRODUCTION
With a large volume of news webpages on the Web, news
digestion increasingly becomes an essential component of
web contents analysis. We investigate the problem of Event
Snippet Extraction (ESE) to divide a news webpage into
event-centered snippets. ESE is highly motivated because
the event distilling improves retrieval experience by present-
ing only the relevant parts instead of the whole page and is
of potential use in applications like discourse analysis. News
clustering and classification can also be in accurate granular-
ity with less jeopardized noises and so be content extraction
and webpage deduplication. To sum up, fine-grained diges-
tions by ESE open doors to wide use on Web.
ESE is related to traditional text segmentation which of-
ten fails to be event-oriented [1]. [2] proposes an introductive
Corresponding author.
Copyright is held by the author/owner(s).
WWW 2011, March 28–April 1, 2011, Hyderabad, India.
ACM 978-1-4503-0637-9/11/03.
work but can still be polished: we consider more detailed el-
ements such as rich semantic, syntactic and visual features.
2. SNIPPET EXTRACTION
Based on the topic drift principle investigated in [2], we
treat the sentence s with a timestamp as a potential head
sentence (s
h
) of an event snippet (S). Assume in the news
document D (D = {s
1
, s
2
, . . . , s
|D|
}) each S can be repre-
sented as <t:{s}>, where t is the timestamp of s
h
and {s}
is the set of sentences that belong to S. Suppose there are
m snippets in D, (
m
k=1
S
k
) D and i ̸= j, S
i
S
j
= .
Original sentence sequence is preserved in snippets. Neigh-
boring contexts tend to describe the same event due to the
semantic consecutiveness of natural language discourse. A
snippet expands by absorbing texts pertinent to the event.
2.1 Semantic Relevance
Intuitively, semantic relevance Rel of a pending sentence
(s
p
) to S can be measured by the probability being generated
from the language model of the snippet (LM(S)), as defined
in Equation (1). Sentences with low probability are clearly
off-event and not related to the expanding snippet.
Rel(s
p
, S) = p(s
p
|LM(S)) =
ws
p
s
i
S
tf(w, s
i
) + λ
(1 + λ) ·
s
i
S
|s
i
|
1
|s
p
|
(1)
λ is empirically set at 0.01 as a smoothing factor and |s|
is the size of sentence s. tf(w, s) is the term frequency of
word w in s. Equation (1) assumes all sentences are equally
weighted while in fact some sentences have larger probability
to be on-event than others in S. We denote such probability
as sentence significance. Semantic, syntactic and visual
features distinguish significance and we exploit them next.
2.2 Weighted Semantic Relevance
Distance Decay (DD). The tendency for contexts to ag-
glomerate attenuates as distance becomes larger from head
sentence s
h
, i.e., a distance decay. According to our inves-
tigation and statistics in [2], the snippet length L follows a
Gaussian distribution N (µ, σ
2
). Given x = ||s
p
|| where ||s||
is the offset of sentence s from s
h
, distance decay f
d
(x) is:
f
d
(x) = P (x < L) =
+
x
N (t, µ, σ
2
)dt.
(2)
WWW 2011 – Poster
March 28–April 1, 2011, Hyderabad, India
157
Temporal Proximity (TP). Re-mention of adjacent
temporal information may strengthen event continuousness
and raise significance, but huge time gap indicates separate
events. Given t = |t
n
t| where t
n
is the new time con-
tained in sentence s
t
and t is from s
h
, T
D
as the time span
of D, temporal proximity f
t
(x) = e
α×
t
T
D
× f
d
(x ||s
t
||).
Named Entities (NE). Sentence with named entities
(s
e
) might indicate strong relevance if entities are connected
by existing knowledge databases (e.g. WordNet or Wikipedia),
but [2] assumed equal distance for all adjacent entities in
hierarchical taxonomy structures. Leaf/lower level entities
should b e closer than general concepts from higher levels.
Consider a fragment, <health [food safety, public health or-
ganization(Centers for Disease Control, World Health Or-
ganization)]>, (CDC, WHO) are closer than (food safety,
public health organization). We model synonyms, hyponyms
and hypernyms into entity distance. We assign a distance
weight (w
e
) to every entity w
e
= 1 +
e
k
H(e)
w
e
k
where
H(e) is the hyponym set of entity e. The distance from a
hyernym e to one of its hyponym e
k
is defined as:
dist =
e
k
H(e)
w
e
k
|H(e)|
. (3)
The weight of leaf node is set as 1. dist and weight are mea-
sured separately and penalization costs more for category
entities. Entity influence f
e
(x) = e
β×dist
× f
d
(x ||s
e
||).
α, β are scaling factors. f
d
(x), f
e
(x), f
t
(x) affect sentence
significance separately and there are more than one s
e
or
s
t
in S. For snippet completeness we choose the maximum
f
e
(x) and f
t
(x) and take the arithmetic average of the three.
Conjunctive Indicators (CI). Conjunctions such as
“however”, “so”, etc. reflect the author’s intention of a se-
mantic bridge between the adjacent sentences, which raises
sentence significance. For the sentence with these conjunc-
tive indicators, we assume it shares the same significance
with its neighboring sentence prior to it. The conjunctive
influence is local and not accumulative to following texts.
sig(x) = sig(x 1) if (s
x
s
x1
) CI. (4)
Layout Presentation (LP). The visual structure of the
news article in the webpage can give some clues to the event
atoms, since writing style implies event principles as well.
Line break. When meet the tag of <br> or <p>, the
line break as the author’s intention of topic drifting.
Visual Elements. An inserted image, table or hyper-
link (<img>, <a>, etc.) indicates similar effect as line
breaks due to news writing style.
The effects of line break and visual elements are accumu-
lative. After τ visual changes, the probability drops by
τ
(1 r
i
). r
i
are not equal due to specific contexts but
for simplicity we assume they are all r. Hence final sig(.) is:
sig(x) = (f
d
(x) + max{f
e
(x)} + max{f
t
(x)}) × (1 r)
τ
/3 (5)
Combining Significance. Each sentence in snippet af-
fects following sentences, either increasing or decreasing the
significance. We apply sig(.) in Equation (1) and obtain a
weighted relevance score from all sentence pairs between s
p
and sentences in the expanding snippet S. We add s
p
into
S when relevance exceeds a threshold.
p(s
p
|LM(S)) =
ws
p
s
i
S
sig(s
i
) · tf(w, s
i
) + λ
(1 + λ) ·
s
i
S
sig(s
i
) · |s
i
|
1
|s
p
|
(6)
3. EXPERIMENTS
In a 10-fold cross validation manner, we test our proposed
approaches on a corpus of 1000 webpages from the Xinhua
News website. There are on average 1.893 snippets per news
document and for all snippets, µ = 6.97, σ = 2. 11. Golden
standards are created by human annotators. α, β, r are set
experimentally at 0.6, 0.5, 0.174 corresp ondingly. We stick
to the precision/recall evaluation metrics in [2]. Figure 1
shows the experiment results of semantic relevance (SeRel)
and weighted semantic relevance (WSeRel) compared with
TextTiling proposed in [1], TTM and LGM proposed in [2].
The perfromance of different features is shown in Figure 2.
Figure 1: Performance Figure 2: Features
WSeRel generally outperforms others. TextTiling shows
significant weakness because it is not event-oriented. The
contribution of significance is obvious (+26.56%) by com-
paring WSeRel with SeRel. DD is the most essential for
snippet expansion. TP, NE, CI are also necessary. LP seems
not to perform well due to misleading line breaks and visual
noises. We present a system demonstration snapshot.
Figure 3: Fine-grained news digestion system demo.
4. CONCLUSIONS
We describe a fine-grained news digestion framework of
ESE, utilizing semantic, syntactic and visual features. ESE
is an on-going infrastructure work facilitating other researches.
We show that our approach outperforms rival methods.
5. ACKNOWLEDGMENTS
This work is partially supported by NSFC with Grant
No.60933004, 61050009 and 61073081.
6. REFERENCES
[1] M. A. Hearst. Texttiling: segmenting text into
multi-paragraph subtopic passages. Comput. Linguist.,
23(1):33–64, 1997.
[2] R. Yan, Y. Li, Y. Zhang, and X. Li. Event recognition
from news webpages through latent ingredients
extraction. In AIRS ’10, pages 490–501.
WWW 2011 – Poster
March 28–April 1, 2011, Hyderabad, India
158
    • "An approach was proposed to extract such an event, and its corresponding set of actions, and audience opinion. Yan et al. [39] interpreted a news article as consisting of event-centered " atomic text snippets " . They investigated the event snippet extraction problem and described a fine-grained news digestion framework for the extraction problem using semantic , syntactic, and visual features. "
    [Show abstract] [Hide abstract] ABSTRACT: The future-related information mining task for online web resources such as news articles and blogs has been getting more attention due to its potential usefulness in supporting individual's decision making in a world where massive new data are generated daily. Instead of building a data-driven model to predict the future, one extracts future events from these massive data with high probability that they occur at a future time and a specific geographic location. Such spatiotemporal future events can be utilized by a recommender system on a location-aware device to provide localized future event suggestions. In this paper, we describe a systematic approach for mining future spatiotemporal events from web; in particular, news articles. In our application context, a valid event is defined both spatially and temporally. The mining procedure consists of two main steps: recognition and matching. For the recognition step, we identify and resolve toponyms (geographic location) and future temporal patterns. In the matching step, we perform spatiotemporal disambiguation, de-duplication, and pairing. To provide more useful future event guidance, we attach to each event a sentiment linguistic variable: positive, negative, or neutral, so that one may use these extracted event information for recommendation purposes in the form of "avoid Event A" or "avoid geographic location L at time T" or "attend Event B" based on the event sentiment. The identified future event consists of its geographic location, temporal pattern, sentiment variable, news title, key phrase, and news article URL. Experimental results on 3652 news articles from 21 online new sources collected over a 2-week period in the Greater Washington area are used to illustrate some of the critical steps in our mining procedure.
    Full-text · Conference Paper · Nov 2012
    • "As ETTS faces with much larger corpus compared with traditional MDS, we apply further data preprocessing besides stemming and stop-word removal. We extract text snippets representing atomic " events " from all documents with a toolkit provided by Yan et al. (2010; 2011a), by which we attempt to assign more fine-grained and accurate timestamps for every sentence within the text snippets. After the snippet extraction procedure, we filter the corpora by discarding non-event texts. "
    [Show abstract] [Hide abstract] ABSTRACT: We investigate an important and challenging problem in summary generation, i.e., Evolutionary Trans-Temporal Summarization (ETTS), which generates news timelines from massive data on the Internet. ETTS greatly facilitates fast news browsing and knowledge comprehension, and hence is a necessity. Given the collection of time-stamped web documents related to the evolving news, ETTS aims to return news evolution along the timeline, consisting of individual but correlated summaries on each date. Existing summarization algorithms fail to utilize trans-temporal characteristics among these component summaries. We propose to model trans-temporal correlations among component summaries for timelines, using inter-date and intra-date sentence dependencies, and present a novel combination. We develop experimental systems to compare 5 rival algorithms on 6 instinctively different datasets which amount to 10251 documents. Evaluation results in ROUGE metrics indicate the effectiveness of the proposed approach based on trans-temporal information.
    Conference Paper · Jan 2011