Conference PaperPDF Available

Coherence-Oriented Crawling and Navigation Using Patterns for Web Archives

Authors:

Abstract and Figures

We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the coherence of archives is to prevent pages content from changing during the crawl of a complete collection. However, this is practically infeasible because web sites are autonomous and dynamic. We propose two solutions: a priori and a posteriori. As a priori solution, our idea is to crawl sites during the off-peak hours (i.e. the periods of time where very little changes is expected on the pages) based on patterns. A pattern models the behavior of the importance of pages changes during a period of time. As an a posteriori solution, based on the same patterns, we introduce a novel navigation approach that enables users to browse the most coherent page versions at a given query time. KeywordsWeb Archiving–Data Quality–Pattern–Navigation
Content may be subject to copyright.
Coherence-oriented Crawling and Navigation
using Patterns for Web Archives ?
Myriam Ben Saad, Zeynep Pehlivan, and St´ephane Gan¸carski
LIP6, University P. and M. Curie,
4 place Jussieu 75005, Paris, France
{myriam.ben-saad,zeynep.pehlivan,stephane.gancarski}@lip6.fr
Abstract. We point out, in this paper, the issue of improving the co-
herence of web archives under limited resources (e.g. bandwidth, storage
space, etc.). Coherence measures how much a collection of archived pages
versions reflects the real state (or the snapshot) of a set of related web
pages at different points in time. An ideal approach to preserve the co-
herence of archives is to prevent pages content from changing during
the crawl of a complete collection. However, this is practically infeasible
because web sites are autonomous and dynamic. We propose two solu-
tions: a priori and a posteriori. As a priori solution, our idea is to crawl
sites during the off-peak hours (i.e. the periods of time where very little
changes is expected on the pages) based on patterns. A pattern mod-
els the behavior of the importance of pages changes during a period of
time. As an a posteriori solution, based on the same patterns, we intro-
duce a novel navigation approach that enables users to browse the most
coherent page versions at a given query time.
Keywords: Web Archiving, Data Quality, Pattern, Navigation
1 Motivation
The major challenge of web archiving institutes (Internet Archive, etc.) is to
collect, preserve and enable future generations to browse off-line a rich part of
the Web even after it is no more reachable on-line. However, maintaining a good
quality of archives is not an easy task because the web is evolving over time and
allocated resources are usually limited (e.g. bandwidth, storage space, etc.). In
this paper, we focus on the coherence that measures how much a collection of
archived pages versions reflects the real web at different points in time. When
users navigate through the archive, they may want to browse a collection of re-
lated pages instead of individual pages. This collection is a set of linked pages
which may or not share the same context, topic, domain name, etc. It can be a
web site generally including a home page and located on the same server. But it
can be also a set of interconnected web pages belonging to different sites. Coher-
ence ensures that if users reach a page version, they can also reach to the versions
?This research is supported by the French National Research Agency ANR in the
CARTEC Project (ANR-07-MDCO-016).
of other pages of the same collection, corresponding to the same point in time.
In fact, during the navigation in the archive, users may browse a page version
which refers to another page, but the two page versions have never appeared
at the same time on the real web. This may lead to conflicts or inconsistencies
between page versions, and such pages versions are considered temporally inco-
herent. Figure 1 depicts an example of incoherence while brwosing at time tq.We
consider two pages P1and P2of a site, updated respectively at time t2and t5.
A and A’ are the two versions of the page P1captured at time t1and t3. B
is the only version of the page P2captured at t4. If the archive is queried at
time tq, we can obtain as result the two versions A’ and B because they are the
”closest“ from tq. These two versions are not coherent because they have never
appeared at the same time on the site (caused by pages update). However, the
two versions A and B are coherent because they appear “as of” time point t1.
Fig. 1. Archive Coherence
The problem of incoherence usually happens when pages change their content
during the crawl of an entire collection. This problem can be handled by two
solutions: a priori and a posteriori. The a priori solution aims to minimize pages
incoherence at the crawling time by adjusting crawlers strategy. The a posteriori
solution operates at the browsing time by enabling users to navigate through the
most coherent pages versions. The a priori solution adjusts crawlers strategy to
maximize the coherence of collected pages versions independently of other quality
measures (e.g. completeness, freshness, etc.). As it is impossible to obtain 100%
of coherence in the archive due to the limited resources, an a posteriori solution
is also needed. Thus, our challenge is to optimize browsing to enable users to
navigate through the most coherent page versions at a given query time. In [2],
we have discovered periodic patterns from TV channels pages which describe the
behavior of (regular) changes over time. By exploiting these patterns, we propose,
in this paper, novel coherence-oriented approaches of crawling and browsing to
improve archives quality and users navigation.
This paper is structured as follows. In Section 2, related works are discussed.
Section 3 defines a coherence measure. Section 4 describes web archiving model
based on pattern. Section 5 proposes a crawling strategy to maximize archives
coherence. Section 6 introduces a coherence-oriented approach to improve archive
navigation. Section 7 presents preliminary results. Section 8 concludes.
2 Related Works
In recent years, there has been an increasing interest in improving coherence of
web archives. In [13], authors propose a crawling strategy to improve coherence
of crawled sites. However, they do not mention in which order sites should be vis-
ited. In [14], they present visualization strategies to help archivists to understand
the nature of coherence defects in the archive. In another study, they define two
quality measures (blur and sharp) and propose a framework, coined SHARC, to
optimize pages captures. The two policies [13, 9] are based on multiple revisits
of web pages. However, in our work, we assume that web crawlers have limited
resources which prevent from revisiting pages too often. Other studies are also
closely related to our work in the sense that they aim at optimizing crawlers. To
guess at which frequency each page should be visited, crawl policies are based on
three major factors: (i) the relevance /importance of pages (e.g Page rank) [7],
(ii) information longevity [11] and (iii) frequency of changes [5,9]. A factor that
has been ignored so far is the importance of changes between pages versions.
Moreover, the frequency of changes used by most policies is estimated based
on homogenous poisson process which is not valid when pages are updated fre-
quently as demonstrated in [3]. Our research is applied on the archive of French
National Institute (INA) which preserves national radio and TV channels pages.
These pages are updated several times a day and, hence, the poisson model can
not be used as explained above. In [2], we discovered periodic patterns from TV
channels pages by using statistical analysis technique. Based on patterns, we
propose, in this paper, a crawl policy to improve the coherence of archives.
This paper also presents a new navigation method that takes into account
the temporal coherence between the source page and the destination page. Al-
though there are several browsers proposed to navigate over historical web data
[10, 15], they are only interested in navigation between versions of the same pages
by showing the changes over versions. As far as we know,no approach proposes
to improve the navigation in web archives by taking into account temporal co-
herence. The reason, as explained in [4], can be that temporal coherence only
impacts the very regular users who spend lots of time navigating in the web
archives. Even though, today the archive initiatives do not have many users, we
believe that, popular web archives (e.g Internet Archive, Google News Archive)
will get the attention of more and more regular users over web archives.
3 Coherence Measure
We define in this section a quality measure inspired by [13] which assesses the
coherence of archives. The following notations are used in the paper.
Siis a collection of linked web pages Pj
i.
ASiis a (historical) archive of Si.
Pj
i[t] is a version of a page Pj
i(Pj
icollection Si) captured at time t.
Q(tq, ASi) is a query which asks for the closest versions (or snapshot) of ASi
to the time tq.
R(Q(tq, ASi)) is a set of versioned pages obtained as a result of querying ASi
at time tq. A’ and B in Figure 1 both belong to R(Q(tq, AS)).
ω(Pj
i[t]) is the importance of the version Pj
i[t]. It depends on (i) the weight of
the page Pj
i(e.g. PageRank) and on (ii) the importance of changes between
Pj
i[t] and its last archived version. The importance of changes between two
pages versions can be evaluated based the estimator proposed in [1].
Definition 1 Coherent Versions
The Niversions of R(Q(tq, ASi)) are coherent, if there is a time point (or
an interval) called tcoherence, so that it exists a non-empty intersection among
the invariance interval [µj, µj]of all versions.
Pj
i[t]R(Q(tq, ASi)),tcoherence :tcoherence
Ni
\
j=1
[µj, µj]6=(1)
where µjand µ
jare respectively the time points of the previous and the next
changes following the capture of the version Pj
i[t].
As shown in Figure 2, the three versions P1
i[t1], P2
i[t2] and P3
i[t3] are coherent
because there is an interval tcoherence that satisfies the coherence constraint (1).
However, the three page versions at the right are not coherent because there is
no point in time satisfying the coherence constraint (1).
Fig. 2. Coherence Example [13]
Definition 2 Query-Result Coherence
The coherence of the query result R(Q(tq, ASi), also called weighted coherence,
is the weight of the largest number of coherent versions divided by the total
weight of the Niversions of R(Q(tq, ASi)). We assume that {P1
i[t1],..., Pρ
i[tρ]}
R(Q(tq, ASi)) are the ρcoherent versions, i.e satisfying the constraint (1). ρ
is the largest number of coherent versions composing R(Q(tq, ASi)).
The coherence of R(Q(tq, ASi)) is
Coherence(R(Q(tq, ASi)) = Pρ
k=1 ω(Pk
i[tk])
PNi
k=1 ω(Pk
i[tk])
where ω(Pk
i[tk]) is the importance of the version Pk
i[tk].
Instead of evaluating the coherence of all the versions composing the query
result R(Q(tq, ASi)), we can restrict the coherence measure to only the η-top
pages of ASiwhich are the most relevant ones. Such measure is useful to pre-
serve particularly the coherence of the most browsed (important) pages like
home pages and their related pages. Coherence of rarely browsed pages can be
considered less important.
4 Pattern Model
Our work aims at improving the coherence of archived web collections by using
patterns. We describe here the pattern model.
4.1 Pattern
A pattern models the behavior of page’s changes over periods of time, during
for example a day. It is periodic and may depend on the day of the week and of
the hour within a day. Pages with similar changes behavior can be grouped to
share a common pattern.
Definition 3 Pattern
A pattern of a page Pj
iwith an interval length l is a nonempty sequence
Patt(Pj
i) = {(ω1,T1); ...; (ωk,Tk);...; (ωNT,TNT)}, where NTis the total number
of periods in the pattern and ωkis the estimated importance of changes in the
period Tk.
4.2 Pattern-based Archiving
As shown in Figure 3, patterns are discovered from archived page versions by
using an analyzer. The first step of the analyzer consists on segmenting each
captured pages into blocks that describe the hierarchical structure of the page.
Then, successive versions of a same page are compared to detect structural1and
content2changes by using Vi-DIFF algorithm [12]. Afterwards, the importance
of changes between two successive versions is evaluated based on the estimator
proposed in [1]. This estimator returns a normalized value between 0 and 1.
An importance value near one (respectively near 0) denotes that changes be-
tween versions are very important (respectively irrelevant e.g. advertisements
or decoration). After that, a periodic pattern which models changes importance
behavior is discovered for each page based on statistical analysis. In [2], we have
presented, through a case study, steps and algorithms used to discover patterns
from French TV channels pages. Discovered patterns are periodically updated
to always reflect the current behavior. They can be used to improve the coher-
ence of archives. Also, they can be exploited by the browser to enable users to
navigate through the most coherent page versions as shown in Figure 3.
Fig. 3. Pattern-based Archiving
5 Coherence-oriented Crawling
An ideal approach to preserve the coherence of archives is to prevent pages
content from changing during the crawl of a complete collection. As this is prac-
tically impossible, we have the idea to crawl each colection during the periods of
1The changes that affect the structure of blocks composing the page
2The changes that modify links, images and texts inside blocks of the pages
time where very little (or useless) changes are expected to occur on pages. Such
periods are named off-peak periods. Based on discovered patterns, these periods
can be predicted for each page and grouped to share a common off-peak period
for the collection as shown in Figure 4. To improve the coherence, it is better to
start by crawling S2before S1in order to coincide with their off-peak periods
(Figure 4).
Fig. 4. Crawling collections at off-peak periods
5.1 Crawling Strategy
Given a limited amount of resources, our challenge is to schedule collections
according to their off-peak periods in a such way that it improves the coherence of
the archive. We define an urgency function that computes the priority of crawling
a collection Siat time t. The urgency U(Si,t,η)of crawling the collection Siat
time t is
U(Si, t, η) = [1 ϕ(Si, Tk, η )] (ttlastRefresh )
- t is the current time (t Tk),
-tlastRef resh is the last time of refreshing the collection Si.
-ηis the number of pages considered to evaluate the coherence of ASi
-ϕ(Si, Tk, η) is the average of the importance of the changes predicted by pat-
terns during the period Tkfor the collection Si.
ϕ(Si, Tk, η) = Pη
k=1 ωk
η
where ωkis the importance of changes defined in Patt(Pj
i) (1 jη) at Tk.
The urgency of a collection depends on the importance of changes predicted
by patterns and also on the duration between the current and the last refresh
time. Less important changes occur in period Tk, higher is the priority given
to crawl the collection Si. Only the M-top collections with the highest priority
are downloaded at each period Tk. The value Mis fixed according to avail-
able resources (e.g. bandwidth, etc.). Once the Mcollections to be crawled are
selected, the different pages are downloaded in descending order of their im-
portance changes predicted by their patterns in period Tk. It is better to start
by crawling pages with the highest changes importance because the risk of ob-
taining an incoherence heavily depends on the time of downloading each page.
Capturing static pages at the end of crawl period does not affect the coherence
of archived collection. A pseudo code of the implementation of this strategy is
depicted by Algorithm 1.
Algorithm 1 Coherence-oriented Crawling
Input:
S1, S2,..,Si,..., SN- list of collections
Patt(P1
i),Patt(P2
i),..., Patt(Pj
i),..., Patt(Pn
i) - list of Page patterns
Begin
1. for each collection Si, i=1...,Nin period Tkdo
2. compute U(Si, t, η) = [1 ϕ(Si, Tk, η)] (ttlastRef resh )
3. collectionList.add(Si,U(Pi, t)) /* in descending order of urgency */
4. end for
5. for i=1...,M do
6. SicollectionList.select(i)
7. tlastRefr esh t
8. pageListgetPagesofCollection(Si)
9. reorder(pageList, wk) /* in descending order of changes importance */
10. for each page Pj
iin pageList do
11. download page Pj
i
12. end for
13. end for
End
6 Coherence-oriented Navigation
In web archives, navigation, also known as surfing, is enriched with the temporal
dimension. In [10], web archive navigation is represented in two different cate-
gories: horizontal navigation and vertical navigation. Horizontal navigation lets
users to browse chronologically among different versions of a page, while vertical
navigation lets users to browse by following hyperlinks between pages like in the
web. In this paper, we are interested in vertical navigation. Although it looks
like navigation in the real web, the issues induced by web archives ( temporal
coherence and incompleteness) lead to broken or defected links which disable the
complete navigation. As we know, it is impossible to obtain 100 % of coherence
in the archive because allocated resources are usually limited and pages are too
dynamic. If the system does not hold a version that was crawled exactly at the
requested time, it usually returns the nearest (or recent) version. Even by finding
the nearest version from the multi archives view, like in Memento framework [8],
it is not sure that thuis version reflects the navigation like it was in real web.
We introduce here a navigation approach that enables users to navigate through
the most coherent versions.
In the remainder of the paper, the notion of collections of pages are not used
anymore because while navigating in the archive, we focus on the coherence of
two linked pages: (i) the source page and (ii) the destination page pointed by an
hyperlink from the source page. In the following, a page is denoted by Pjand a
version of the page crawled at instant t is denoted by Pj[t] .
6.1 Informal Overview
A simple example is given in Figure 5 to better explain our coherence-oriented
navigation approach. Consider a user who starts to navigate in the archive from
the version of the page P1captured at tq(P1[tq]). This user wants to follow the
hyperlink to browse the page P2. The closest version of P2before tqis P2[t1] and
the closest version of P2after tqis P2[t2]. They are the candidate destination
versions. We assume that the patterns of P1and P2which describe the behavior
of changes are known. These patterns are used to decide which version is the most
coherent. As shown in Figure 5, the subpatterns, defined according to the periods
[t1, tq] (in red) and [tq, t2] (in green), are extracted from patterns of P1and P2. To
find the most coherent version to P1[tq], we estimate the importance of changes
for each subpattern. Smaller the importance of changes predicted by subpatterns
is, smaller the risk of incoherence. Thus, the group of subpatterns (a) and (b) is
compared to other group of subpatterns (c) and (d) by using the importance of
changes. The group of subpatterns which has the smallest total importance of
changes is selected. This means that the navigation through the corresponding
page versions in the selected group is more coherent. In the example, the group
of subpatterns (a) and (b) has smaller importance of changes than the group of
(c) and (d). Thus, the most coherent version P2[t1] (corresponding to subpattern
(b)) is returned to the user.
Fig. 5. Coherence-oriented Navigation
6.2 Formal Definitions
In this section, we give the formal definitions of our approach explained in the
previous section.
Definition 4 SubPattern
Given a pattern Patt(Pj) = {(ω1,T1); ...; (ωk,Tk);...; (ωNT,TNT)}, the sub-
pattern SubP att(P att(Pj),[tx, ty]) is a part of the pattern valid for a given period
[tx, ty].
SubP att(P att(Pj),[tx, ty]) = {(ωk, Tk); (ωk+1 , Tk+1); ...; (ωl, Tl)}
where 1klNTand txTkand tyTl
Definition 5 Pattern Changes Importance
The function Ψ(P att(Pj)) estimates the total importance of changes defined
in the given pattern P att(Pj). It is the sum of all changes importance ωiof
P att(Pj).
Ψ(P att(Pj)) =
il
X
i=k
ωi
Definition 6 Navigational Incoherence
Let Ps[tq]be the source version where the navigation starts. Let Pd[tx]be the des-
tination version (Pd[tx]) pointed by an hyperlink. The navigational incoherence
(Υ) between the two versions Ps[tq]and Pd[tx]is the sum of changes importance
predicted by their corresponding subpatterns during the period [tq, tx].tqand tx
are respectively the instants of capturing the source and the destination versions.
Υ(Ps[tq], Pd[tx]) = Ψ(SubP att(P att(Ps),[tq, tx]))+Ψ(S ubP att(P att(Pd),[tq, tx]))
where tqtx
Definition 7 Most Coherent Version
To find out the most coherent destination version, the navigational incoherence
(Υ) between the source version and the set of the candidate destination versions
are compared and the destination version with the smallest Υis returned. The
reason to choose the smallest Υis that the probability of being incoherent depends
on the importance of changes of subpatterns. In other words, if there are less
changes, the source and the destination version are expected to be more coherent.
The most coherent version is described as follows:
MCoherent(Ps[tq],{Pd[tx], Pd[ty]}) = Pd[tx]if Υ(Ps[tq], Pd[tx]) < Υ (Ps[tq], Pd[ty])
Pd[ty]otherwise
Example 1 We take the same example of Figure 5 to explain the process. We
assume that the importance of changes for the four subpatterns a, b, c, d are
respectively 0.6, 0.1, 0.7, 0.6.
The most coherent version MCoherent(Ps[tq],{Pd[tx], Pd[ty]})is P2[t1]because
Υ(P1[tq], P2[t1]) is smaller than Υ(P1[tq], P2[t2]) where
Υ(P1[tq], P2[t1]) = 0.6+0.1 = 0.7and Υ(P1[tq], P2[t2]) = 0.7+0.6 = 1.3
7 Experimental Evaluation
We evaluate here the effectiveness of the coherence-oriented crawling and naviga-
tion. As it is impossible to capture exactly all page changes occurred on web sites
to measure the coherence, we have conducted simulations experiments based on
real patterns obtained from French TV channels pages [2]. Experiments, written
in Java, were conducted on PC running Linux over a 3.20 GHz Intel Pentium 4
processor with 1.0 GB of RAM. Each page is described by its real pattern and
the corresponding importance of changes is generated according to this pattern.
In addition, the following parameters are set: the number of pages per collec-
tion, the duration of simulation, the number of periods in patterns, the number
of allocated resources (i.e. the maximum number of sites (or pages) that can be
captured per each time period).
7.1 Coherence-oriented Crawling Experiments
We have evaluated the coherence obtained by our Pattern strategy (cf. Algo-
rithm 1) compared to the following related crawl policies: Relevance [7] which
downloads first the most important sites and pages in a fixed order based on
PageRank, SHARC [9] which repeatedly selects in a fixed order the sites to
be crawled then downloads the entire site by ensuring that the most changing
pages are downloaded close to the middle of the capture interval, Coherence [13]
which repeatedly downloads sites in a circular order. Within a site, it starts by
crawling the pages that have the lowest probability to cause incoherence in the
archive and Frequency [5] which selects sites in circular order and crawls pages
according to their frequency of changes estimated by a Poisson model [6].
All experiments that have been conducted to evaluate those strategies are
done under the same conditions (i.e. a maximum of Msites can be captured at
each period T).
Fig. 6. Weighted Coherence
Figure 6 shows the weighted coherence (cf. Section 3) obtained by the differ-
ent strategies with respect to the percentage of sites crawled per period M=[10%-
50%]. We varied the number ηof top-pages considered to evaluate the coherence
of the site (η=50%,100%). As we can see, our Pattern strategy, which crawls col-
lections according to their off-peak periods, outperforms its competitors SHARC,
Coherence,Relevance and Frequency. It improves the coherence by around 10
% independently of the percentage of sites crawled per period. This improve-
ment can be observed even better if the patterns of collections are significantly
different from one another.
7.2 Coherence-oriented Navigation Experiments
Similarly to the crawling experiments, we implemented our navigation approach
(cf. Section 6) over a simulated archive based on the real patterns obtained from
France TV channels pages. The experiment consists in simulating the navigation
from a page source Psto different destination pages Pdby following all outgoing
links from Ps. In addition, we implemented two related navigation strategies:
Nearest and Recent. The Nearest policy enables to navigate through the closest
versions to the query time tq. The Recent policy enables to navigate through the
closest versions before the query time tq. The coherence of our navigation policy
Pattern is compared to Nearest and Recent strategies based on the definition 1
of Section 3. As we use a simulator, we know which version of the destination
page is the most coherent at the beginning of the experiments. For each strategy,
we count how many times the most coherent version (i.e. the version satisfying
the coherence constraint (1)) is chosen and then this number is divided by the
total number of outgoing links in the source page.
Fig. 7. Coherence-oriented Navigation
Figure 7 shows the percentage of coherent versions obtained by different
strategies (Pattern,Nearest,Recent) with respect to the total number of out-
going links of the page source. As presented in the horizontal axis, the number
of outgoing links from the page source Psis varying from 10 to 100. We have
included in brackets the percentage of cases where the nearest destination page
version Pd[t] to the query time tqis incoherent with the page source version Ps[t].
It is important to point out that the percentage of incoherence cases presented
in brackets is computed as an average obtained through several executions of
simulated experiments. For example, the page source with 70 outgoing links at
time tqhas about 20,7% of links where the nearest version of the destination
pages are incoherent. As seen in Figure 7, our navigation policy based on pat-
terns outperforms its competitors Nearest and Recent. It improves the coherence
by around 10 % compared to Nearest and by around 40 % compared to Recent.
These results are not only significant but also important since the navigation is
one of the main tools used by archive users such as historians, journalists etc.
8 Conclusion and Future work
This paper addresses an important issue of improving coherence of archives under
limited resources. We proposed two solutions : a priori and a posteriori. The a
priori solution adjusts crawlers strategy to improve archive coherence by using
patterns. We have demonstrated that reordering collections of web pages to crawl
according to their off-peak periods can improve archives coherence by around
10 % compared to current policies in use. Moreover, as an a posteriori solution,
we proposed a novel browsing approach using patterns that enables users to
navigate through the most coherent pages versions. Results of experiments have
shown that our approach can improve the coherence during the navigation by
around 10 % compared to related policies Nearest and Recent. To the best of
our knowledge, this work is the first to exploit patterns to improve coherence
of crawling and navigation. As a future direction, we intend to test the two
proposed solutions over real data. Our challenge is to enable users to navigate
through the most coherent versions at a reasonable time. Further study needs
to be done to evaluate how far users can perceive coherence improvements when
they navigate in the archive.
References
1. M. Ben Saad and S. Gan¸carski. Using visual pages analysis for optimizing web
archiving. In EDBT/ICDT PhD Workshops, Lausanne, Switzerland, 2010.
2. M. Ben Saad and S. Gan¸carski. Archiving the Web using Page Changes Pattern:
A Case Study. In ACM/IEEE Joint Conference on Digital Libraries (JCDL ’11),
Ottawa, Canada, 2011.
3. B. Brewington and G. Cybenko. How dynamic is the web? In WWW ’00: Pro-
ceedings of the 9th international conference on World Wide Web, pages 257–276,
2000.
4. A. Brokes, L. Coufal, Z. Flashkova, J. Masan`es, J. Oomen, R. Pop, T. Risse, and
H. Smulders. Requirement analysis report living web archive. Technical Report
FP7-ICT-2007-1, 2008.
5. J. Cho and H. Garcia-Molina. Effective page refresh policies for web crawlers.
ACM Trans. Database Syst., 28(4):390–426, 2003.
6. J. Cho and H. Garcia-Molina. Estimating frequency of change. ACM Trans. Interet
Technol., 3(3):256–290, 2003.
7. J. Cho, H. Garcia-molina, and L. Page. Efficient crawling through url ordering. In
Computer Networks and ISDN Systems, pages 161–172, 1998.
8. H. V. de Sompel, M. L. Nelson, R. Sanderson, L. Balakireva, S. Ainsworth, and
H. Shankar. Memento: Time travel for the web. CoRR, abs/0911.1112, 2009.
9. D. Denev, A. Mazeika, M. Spaniol, and G. Weikum. SHARC: framework for
quality-conscious web archiving. Proc. VLDB Endow., 2(1):586–597, 2009.
10. A. Jatowt, Y. Kawai, S. Nakamura, Y. Kidawara, and K. Tanaka. A browser for
browsing the past web. In Proceedings of the 15th international conference on
World Wide Web, WWW ’06, pages 877–878, New York, NY, USA, 2006.
11. C. Olston and S. Pandey. Recrawl scheduling based on information longevity. In
Proceeding of the 17th international conference on World Wide Web, WWW ’08,
pages 437–446, New York, NY, USA, 2008.
12. Z. Pehlivan, M. Ben Saad, and S. Gan¸carski. Vi-diff: Understanding web pages
changes. In 21st International Conference on Database and Expert Systems Appli-
cations (DEXA’10), Bilbao, Spain, 2010.
13. M. Spaniol, D. Denev, A. Mazeika, G. Weikum, and P. Senellart. Data quality in
web archiving. In WICOW ’09: Proceedings of the 3rd workshop on Information
credibility on the web, pages 19–26, New York, NY, USA, 2009.
14. M. Spaniol, A. Mazeika, D. Denev, and G. Weikum. ”catch me if you can”: Visual
analysis of coherence defects in web archiving. In 9th International Web Archiving
Workshop (IWAW 2009), pages 27–37, Corfu, Greece, 2009.
15. J. Teevan, S. T. Dumais, D. J. Liebling, and R. L. Hughes. Changing how people
view changes on the web. In UIST ’09: Proceedings of the 22nd annual ACM
symposium on User interface software and technology, pages 237–246, 2009.
... Conventional web crawling and archiving, as done by applications such as Heritrix [32] and wget [17], downloads an HTML page, extracts the URLs in the page (both links and embedded resources), adds those URLs to a crawl frontier, and then crawls the next URL in the frontier. The frontier can be indexed as breadth-first or depth-first, with most crawlers configured to breadth-first crawling in order to improve site-level consistency [10,11,16,37]. For example, all of the pages linked from the top-level page (level 0) are crawled, then all the pages linked from those pages (level 1) are crawled, etc. ...
... "800": [{" id ": " homepage -injection -zone -1" , 9 " uri ": " _homepage -zone -injection / index . html "} , 10 {" id ": " homepage1 -zone -1"} , 11 {" id ": " homepage -injection -zone -2" , 12 " uri ": " _homepage -zone -injection / index . html "} , 13 {" id ": " homepage2 -zone -1"} , 14 {" id ": " homepage3 -zone -1"} , 15 {" id ": " homepage4 -zone -1"} , 16 {" id ": " homepage4 -zone -2"} , 17 {" id ": " homepage4 -zone -3"} , 18 {" id ": " homepage4 -zone -4"} , 19 {" id ": " homepage4 -zone -5"} , 20 {" id ": " homepage4 -zone -6"} , 21 {" id ": " homepage4 -zone -7"}] 22 }}}; to the left of each section, and the data-zone-label (found in the HTML) is shown on the right. ...
Preprint
Full-text available
Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.
... Spaniol et al. [12], Denev [5], and Ben Saad et al. [3,2] all introduce strategies to improve the quality of future web crawler-based captures. If implemented, these strategies should reduce spread in new captures. ...
... Pattern 1RB is prima facie coherent. 3 Left and right come from the appearance of the charts. The right pattern depicted in figure 5(b), and specified by predicate (4), represents an embedded memento, m i,1 , which was both modified and captured after the root memento, m 0 , was captured. ...
Article
Full-text available
Most archived HTML pages embed other web resources, such as images and stylesheets. Playback of the archived web pages typically provides only the capture date (or Memento-Datetime) of the root resource and not the Memento-Datetime of the embedded resources. In the course of our research, we have discovered that the Memento-Datetime of embedded resources can be up to several years in the future or past, relative to the Memento-Datetime of the embedding root resource. We introduce a framework for assessing temporal coherence between a root resource and its embedded resource depending on Memento-Datetime, Last-Modified datetime, and entity body.
... Web sites change faster than crawls can acquire their content, which leads to temporal incoherence. Ben Saad et al. [6] note that quality and completeness require different methods and measures a priori or a posterior, that is during acquisition or during post-archival access respectively. ...
... Ben Saad et al. [6] address both a priori and a posteriori quality. Like Denev et al. [10], the a priori solution is designed to optimize the crawling process for archival quality. ...
Article
Full-text available
When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to less than 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
... Néanmoins, certaines contraintes de politesse peuvent obliger à traiter en parallèle une grande quantité de sites. Mais si cela est possible, ne crawler que lorsqu'un site est le moins susceptible de subir des changements peut garantir une cohésion temporelle au corpus (Saad et al., 2011). S'appuyant sur toutes ces réflexions, les équipes d'Internet Archive présentent en 2004 un crawler open-source, l'Heritrix (Mohr et al., 2004) capable de s'adapter à divers type de collecte : large (broad crawling), en continu (continuous crawling) ou focalisée (focused crawling). ...
Thesis
Le Web est un environnement éphémère. Alors que de nouveaux sites Web émergent chaque jour,il arrive que certaines communautés disparaissent entièrement de la surface de la toile, ne laissant derrièreelles que des traces incomplètes voire inexistantes. Face à la volatilité du Web vivant, plusieurs initiativesd’archivage cherchent malgré tout à préserver la mémoire du Web passé. Mais aujourd’hui, force est deconstater qu’un mystère demeure : Pourquoi, alors qu’elles n’ont jamais été aussi vastes et aussi nombreuses,les archives Web ne font-elles pas déjà l’objet de multiples recherches historiques ? Initialement construitespour inscrire la mémoire de la toile sur un support durable, ces archives ne doivent pourtant pas êtreconsidérées comme une représentation fidèle du Web vivant. Elles sont les traces directes des outils de collectequi les arrachent à leur temporalité d’origine. Partant de là, cette thèse ambitionne de redonner aux chercheursles moyens théoriques et techniques d’une plus grande maniabilité du Web passé, en définissant une nouvelleunité d’exploration des archives Web : le fragment Web, un sous-ensemble cohérent et auto-suffisant d’unepage Web. Pour ce faire, nous nous inscrirons dans l’héritage des travaux pionniers de l’Atlas e-Diasporas quipermit, dans les années 2000, de cartographier et d’archiver plusieurs milliers de sites Web migrants. Sourceprincipale de données à partir desquelles nous déploierons nos réflexions, c’est à travers l’angle particulierdes représentations en ligne des diasporas que nous chercherons à explorer les archives Web de l’Atlas.
... The quality in this case is measured by the coherence metrics. Saad et al. [82,83,11] propose an alternative model which strives for coherence with a single download per page. ...
Article
Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency. This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an “undistorted” capture of the ever-changing Web. We express the quality of such the resulting capture with the “blur” quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the “coherence” measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with pagespecific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives. A fully functional prototype demonstrates the practical viability of our approach.
Conference Paper
When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts “this is how the page looked at a that datetime.” However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal co- herence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1–4.1% but also reduces temporal coherence.
Conference Paper
Full-text available
Web archives do not capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These em-bedded resources have varying historic, utility, and impor-tance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and as-sign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users' perceptions of damage are not accurately estimated by the proportion of missing embedded resources. The proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agree-ment with users on memento damage by 17% and an im-provement by 51% if the mementos are not similarly dam-aged. We use our algorithm to measure damage in the Inter-net Archive, showing that it is getting better at mitigating damage over time (going from 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important em-bedded resources (2.05 per memento on average) are missing over time. 978-1-4799-5569-5/14/$31.00 c IEEE.
Article
Full-text available
In this paper we study in what order a crawler should visit the URLs it has seen, in order to obtain more “important” pages first. Obtaining important pages rapidly can be very useful when a crawler cannot visit the entire Web in a reasonable amount of time. We define several importance metrics, ordering schemes, and performance evaluation measures for this problem. We also experimentally evaluate the ordering schemes on the Stanford University Web. Our results show that a crawler with a good ordering scheme can obtain important pages significantly faster than one without.
Conference Paper
Full-text available
Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future generation. A major problem encountered by archiving systems is to understand what happened between two versions of web pages. In this paper, we address this requirement by proposing a new change detection approach that computes the semantic differences between two versions of HTML web pages. Our approach, called Vi-DIFF, detects changes on the visual representation of web pages. It detects two types of changes: content and structural changes. Content changes include modifications on text, hyperlinks and images. In contrast, structural changes alter the visual appearance of the page and the structure of its blocks. Our Vi-DIFF solution can serve for various applications such as crawl optimization, archive maintenance, web changes browsing, etc. Experiments on Vi-DIFF were conducted and the results are promising.
Conference Paper
Full-text available
Web archives preserve the history of Web sites and have high long-term value for media and business analysts. Such archives are maintained by periodically re-crawling entire Web sites of interest. From an archivist's point of view, the ideal case to ensure highest possible data quality of the archive would be to "freeze" the complete contents of an entire Web site during the time span of crawling and capturing the site. Of course, this is practically infeasible. To comply with the politeness specification of a Web site, the crawler needs to pause between subsequent http requests in order to avoid unduly high load on the site's http server. As a consequence, capturing a large Web site may span hours or even days, which increases the risk that contents collected so far are incoherent with the parts that are still to be crawled. This paper introduces a model for identifying coherent sections of an archive and, thus, measuring the data quality in Web archiving. Additionally, we present a crawling strategy that aims to ensure archive coherence by minimizing the diffusion of Web site captures. Preliminary experiments demonstrate the usefulness of the model and the effectiveness of the strategy.
Conference Paper
Full-text available
We describe a browser for the past web. It can retrieve data from multiple past web resources and features a passive browsing style based on change detection and presentation. The browser shows past pages one by one along a time line. The parts that were changed between consecutive page versions are animated to reflect their deletion or insertion, thereby drawing the user's attention to them. The browser enables automatic skipping of changeless periods and filtered browsing based on user specified query.
Conference Paper
Full-text available
A pattern is a model or a template used to summarize and describe the behavior (or the trend) of a data having generally some recurrent events. Patterns have received a considerable attention in recent years and were widely studied in the data mining field. Various pattern mining approaches have been proposed and used for different applications such as network monitoring, moving object tracking, financial or medical data analysis, scientific data processing, etc. In these different contexts, discovered patterns were useful to detect anomalies, to predict data behavior (or trend), or more generally, to simplify data processing or to improve system performance. However, to the best of our knowledge, patterns have never been used in the context of web archiving. Web archiving is the process of continuously collecting and preserving portions of the World Wide Web for future generations. In this paper, we show how patterns of page changes can be useful tools to efficiently archive web sites. We first define our pattern model that describes the changes of pages. Then, we present the strategy used to (i) extract the temporal evolution of page changes, to (ii) discover patterns and to (iii) exploit them to improve web archives. We choose the archive of French public TV channels « France Télévisions » as a case study in order to validate our approach. Our experimental evaluation based on real web pages shows the utility of patterns to improve archive quality and to optimize indexing or storing.
Conference Paper
Full-text available
Due to the growing importance of the World Wide Web, archiving it has become crucial for preserving useful source of information. To maintain a web archive up-to-date, crawlers harvest the web by iteratively downloading new versions of documents. However, it is frequent that crawlers retrieve pages with unimportant changes such as advertisements which are continually updated. Hence, web archive systems waste time and space for indexing and storing useless page versions. Also, querying the archive can take more time due to the large set of useless page versions stored. Thus, an effective method is required to know accurately when and how often important changes between versions occur in order to efficiently archive web pages. Our work focuses on addressing this requirement through a new web archiving approach that detects important changes between page versions. This approach consists in archiving the visual layout structure of a web page represented by semantic blocks. This work seeks to describe the proposed approach and to examine various related issues such as using the importance of changes between versions to optimize web crawl scheduling. The major interesting research questions that we would like to address in the future are introduced.
Article
Web archives preserve the history of born-digital content and offer great potential for sociologists, business analysts, and legal experts on intellectual property and compliance issues. Data quality is crucial for these purposes. Ideally, crawlers should gather sharp captures of entire Web sites, but the politeness etiquette and completeness requirement mandate very slow, long-duration crawling while Web sites undergo changes. This paper presents the SHARC framework for assessing the data quality in Web archives and for tuning capturing strategies towards better quality with given resources. We define quality measures, characterize their properties, and derive a suite of quality-conscious scheduling strategies for archive crawling. It is assumed that change rates of Web pages can be statistically predicted based on page types, directory depths, and URL names. We develop a stochastically optimal crawl algorithm for the offline case where all change rates are known. We generalize the approach into an online algorithm that detect information on a Web site while it is crawled. For dating a site capture and for assessing its quality, we propose several strategies that revisit pages after their initial downloads in a judiciously chosen order. All strategies are fully implemented in a testbed, and shown to be effective by experiments with both synthetically generated sites and a daily crawl series for a medium-sized site.
Article
Recent experiments and analysis suggest that there are about 800 million publicly-indexable Web pages. However, unlike books in a traditional library, Web pages continue to change even after they are initially published by their authors and indexed by search engines. This paper describes preliminary data on and statistical analysis of the frequency and nature of Web page modifications. Using empirical models and a novel analytic metric of `up-to-dateness', we estimate the rate at which Web search engines must re-index the Web to remain current.
Conference Paper
It is crucial for a web crawler to distinguish between ephemeral and persistent content. Ephemeral content (e.g., quote of the day) is usually not worth crawling, because by the time it reaches the index it is no longer representative of the web page from which it was acquired. On the other hand, con- tent that persists across multiple page updates (e.g., recent blog postings) may be worth acquiring, because it matches the page's true content for a sustained period of time. In this paper we characterize the longevity of information found on the web, via both empirical measurements and a generative model that coincides with these measurements. We then develop new recrawl scheduling policies that take longevity into account. As we show via experiments over real web data, our policies obtain better freshness at lower cost, compared with previous approaches.
Conference Paper
The Web is a dynamic information environment. Web content changes regularly and people revisit Web pages frequently. But the tools used to access the Web, including browsers and search engines, do little to explicitly support these dynamics. In this paper we present DiffIE, a browser plug-in that makes content change explicit in a simple and lightweight manner. DiffIE caches the pages a person visits and highlights how those pages have changed when the person returns to them. We describe how we built a stable, reliable, and usable system, including how we created compact, privacy-preserving page representations to support fast difference detection. Via a longitudinal user study, we explore how DiffIE changed the way people dealt with changing content. We find that much of its benefit came not from exposing expected change, but rather from drawing attention to unexpected change and helping people build a richer understanding of the Web content they frequent.