ArticlePDF Available

Abstract and Figures

In this paper, we propose a Web page archiving system that combines state-of-the-art comparison methods based on the source codes of Web pages, with computer vision techniques. To detect whether successive versions of a Web page are similar or not, our system is based on: (1) a combination of structural and visual comparison methods embedded in a statistical discriminative model, (2) a visual similarity measure designed for Web pages that improves change detection, (3) a supervised feature selection method adapted to Web archiving. We train a Support Vector Machine model with vectors of similarity scores between successive versions of pages. The trained model then determines whether two versions, defined by their vector of similarity scores, are similar or not. Experiments on real archives validate our approach.
Content may be subject to copyright.
Structural and Visual Comparisons for Web Page Archiving
Marc Teva Law Nicolas Thome Stéphane Gançarski
Matthieu Cord
LIP6, UPMC - Sorbonne University, Paris, France
{Marc.Law, Nicolas.Thome, Stephane.Gancarski, Matthieu.Cord}@lip6.fr
ABSTRACT
In this paper, we propose a Web page archiving system that
combines state-of-the-art comparison methods based on the
source codes of Web pages, with computer vision techniques.
To detect whether successive versions of a Web page are
similar or not, our system is based on: (1) a combination
of structural and visual comparison methods embedded in a
statistical discriminative model, (2) a visual similarity mea-
sure designed for Web pages that improves change detection,
(3) a supervised feature selection method adapted to Web
archiving. We train a Support Vector Machine model with
vectors of similarity scores between successive versions of
pages. The trained model then determines whether two ver-
sions, defined by their vector of similarity scores, are similar
or not. Experiments on real archives validate our approach.
Categories and Subject Descriptors
H.3.7 [Information Storage and Retrieval]: Digital Li-
braries
Keywords
Web archiving, Digital preservation, Change detection algo-
rithms, Pattern recognition, Support vector machines
1. INTRODUCTION
With the explosion of available information on the World
Wide Web, archiving the Web is a cultural necessity in pre-
serving knowledge. Most of the time, Web archiving is per-
formed by Web crawlers (bots) that capture Web pages and
the associated media (images, videos...). To update archives,
crawlers have to regularly revisit pages, but they generally
do not know if or when changes appeared. The crawlers
cannot constantly revisit a site and download a new version
of a page because they usually have limited resources (such
as bandwidth, space storage...) with respect to the huge
amount of pages to archive. Hence, it is technically impossi-
ble to maintain a complete history of all the versions of Web
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DocEng’12, September 4–7, 2012, Paris, France.
Copyright 2012 ACM 978-1-4503-1116-8/12/09 ...$15.00.
pages of the Web, or even an important part of it. The prob-
lem of archivists is then how to optimize crawling so that
new versions are captured and/or kept only when changes
are important while limiting the loss of useful information.
A way to optimize crawling is to estimate the behavior of a
site in order to guess when or with which frequency it must
be visited, and thus to study the importance of changes be-
tween successive versions [1]. For instance, the change of
an advertisement link, illustrated in Figures 1(a,b), is not
related to the main information shared by the page. In con-
trast, changes in Figure 1(c) are significant. The crawling of
the second version was thus necessary. In this paper, simi-
larity functions for Web page comparison are investigated.
Most archivists only take into account the Web page source
code (code string, DOM tree...) [2] and not the visual ren-
dering [3, 4, 1]. However, the code may not be sufficient to
describe the content of Web pages, e.g. images are usually
defined only by their URL addresses, or scripts may be coded
in many different languages that make them hard to com-
pare. Ben Saad et al. [1] propose to use the tree obtained by
running the VIPS [3] algorithm on the rendered page. They
obtain a rich semantic segmentation into blocks and then
estimate a function of the importance of changes between
page versions by comparing the different blocks. The VIPS
structure of a Web page is a segmentation tree based on its
DOM tree. It detects visual structures in the rendering of a
Web page (e.g. tables) and tries to keep nodes (blocks) as
homogeneous as possible. Two successive paragraphs with-
out html tags will tend to be kept in the same node, whereas
table elements with different background colors will be sep-
arated in different nodes. Image processing methods have
been proposed for Web page segmentation. Cao et al. [5]
preprocess the rendering of Web pages by an edge detection
algorithm, and iteratively divide zones until all blocks are
indivisible. They do not take the source code of Web pages
into account. In the context of phishing detection, Fu et
al. [6] compute similarities between Web pages using color
and spatial visual feature vectors. However, they are only
interested in the detection of exact copies.
We propose to investigate in this paper structural and vi-
sual features to carry out an efficient page comparison sys-
tem for Web archiving. We claim that both structural and
visual informations are fundamental to get a powerful se-
mantical similarity [7]: structural to catch the dissimilarity if
different scripts have the same rendering or if the hyperlinks
are changed, visual if the codes of the versions of a Web page
are unchanged but a loaded image was updated. Methods
combining structural and visual features have been proposed
(a) Similar versions (b) zoom over the only difference
between the versions of (a) (c) Dissimilar versions
Figure 1: Similar and dissimilar versions of Web pages. The versions of (a) share the same information, they
are exactly similar except the links in (b), they do not need to be crawled twice. The versions of (c) have the
same banner and menus but the main information of the page is changed, a second crawling is then necessary.
for content extraction [8], they use the relative positions be-
tween elements of pages but no visual appearance features.
Additionally, we propose a machine learning framework to
set all the similarity parameters and combination weights.
We claim to get in this manner a semantic similarity close
to archivists’ attempts. Our contribution is three-fold: (1) a
complete hybrid Web page comparison framework combin-
ing computer vision and structural comparison methods, (2)
a new measure dedicated to Web archiving that only consid-
ers the visible part of pages without scrolling, (3) a machine
learning based approach for supervised feature selection to
increase prediction accuracy by eliminating noisy features.
2. WEB PAGE COMPARISON SCHEME
Two versions of a given Web page are considered similar if
the changes that occurred between them are not important
enough to archive both of them. They are dissimilar other-
wise (see Figure 1). To compare versions of Web pages, we
first extract features from them as described below.
2.1 Visual descriptors
Important changes between page versions will often pro-
duce differences between the visual rendering of those ver-
sions. We propose to quantify these differences by com-
puting and comparing the visual features in each page ver-
sion. Each version is described as an image of its rendering
capture (snapshot). We compute a visual signature on this
captured image for each page. Images are first described by
color descriptors, because they seem appropriate for Web
page changes and are already used in Phishing Web page
detection [6]. We also incorporate powerful edge-based de-
scriptors with SIFT descriptors [9] because they give state-
of-the-art performances in real image classification tasks.
For image representation, we follow the well-known Bag
of Words (BoW) representation [10, 11]. The vector repre-
sentation of the rendered Web page is computed based on a
sampling of local descriptors, coding and pooling over a vi-
sual dictionary. Recent comparisons for image classification
point out the outstanding performances of a regular dense
sampling [12, 13]. We apply a first strategy called whole Web
page feature, that samples regularly the visual representa-
tion of the whole page. However, the most significant infor-
mation is certainly not equally distributed over the whole
captured Web page. As noted in [14], the most important
information is generally located in the visible part of pages
without scrolling. A second strategy called Top of Web page
feature, provides a visual vector using only the features lo-
cated in the visible part of the page without scrolling.
Since the visible part of a Web page without scrolling de-
pends on the browser window size, we take a generic window
height of 1000 pixels, greater than 90% of users’ browser res-
olutions to ensure we do not miss information directly visible
by most users. In the next sections, we will denote the visible
part of Web pages without scrolling by top of Web pages.
2.2 Structural descriptors
We extract various features directly from the code of Web
pages. For instance, we extract Jaccard indices [2], a similar-
ity value that indicates the preservation between versions of
hyperlinks and of URL addresses of images. We assume that
similar pages tend to keep the same hyperlinks and images.
We also extract some features from the difference tree re-
turned by the VI-DIFF algorithm [4] that detects some op-
erations between the VIPS structures of versions, e.g. inser-
tions, deletions or updates of VIPS blocks, or even a boolean
value returning whether two versions have the same VIPS
structure. The more operations are detected, the less simi-
lar versions are assumed to be. We denote the features ex-
tracted from the VI-DIFF algorithm by VI-DIFF features.
2.3 Similarity between versions
Let VAbe the last archived version of a Web page and
VNthe new version of the same Web page. We extract sev-
eral visual and structural descriptors (see sections 2.1 and
2.2), and use different metrics (Euclidian, χ2distances, etc)
to compare them. Heuristics may be used to set them in-
dividually and to select the best similarity function with a
manually-tuned threshold to discriminate dissimilar pairs of
Web pages from the similar ones.
We propose here an alternate scheme embedding all the
similarity functions into a learning framework. Let the M
visual feature/metric associations and the N structural sim-
ilarities be aggregated in a vector x. We can write xTas:
[s1
v(VA, V N). . . sM
v(VA, V N), s1
s(VA, V N). . . sN
s(VA, V N)].
We observed that none of the similarities we experimen-
tally extracted presented a trivial individual decision bound-
ary. However, all of them did seem to follow certain ex-
pected patterns, some of them working better than others.
Instead of using them individually, we propose to combine
those different similarities in a binary classification scheme
that returns whether a couple of versions are similar or not
by using x, the vector of their similarity scores. Combining
both approaches then seems appropriate to have a better
understanding of the changes as perceived by human users.
Learning combinations of complementary descriptors also
makes the categorization task more efficient [15]. We in-
vestigate in the next section a statistical learning strategy
based on a labeled dataset to classify the vectors x.
3. CLASSIFICATION FRAMEWORK
We are interested in learning distances [16] between ver-
sions in a supervised framework to determine whether two
versions are similar or not. However, it is not a version classi-
fication problem as in many distance learning problems [17].
Indeed, we do not want to classify samples (versions) but
similarities. Moreover, our similarities are based on human
judgement and allow subtilities as shown in Figure 1.
We then propose to express the learning of the combi-
nation of similarities as a binary classification in similarity
space: for any couple of versions (VA, V N)i, let their class
yi= 1 iff VAand VNare similar, 1 otherwise. Let xibe a
vector derived from heterogeneous similarities between VA
and VN(as defined in subsection 2.3). We train a linear Sup-
port Vector Machine (SVM) to determine w=Pjαjyjxj
such that hw,xii=Pjαjyjhxj,xiigives us the class of
(VA, V N)i. The similarity vectors xjof training couples
(VA, V N)jare used to train an SVM. For any test couple
(VA, V N)i, the trained SVM returns (1) whether yi= 1 or
yi=1, (2) whether VAand VNare similar or dissimi-
lar, (3) whether VNneeds to be archived or not, with VA
already archived. Those three propositions are equivalent.
To study the contributions of the different types of fea-
tures in the discrimination task, we first train a linear SVM
with all the features. Each element wkof wcorresponds
to the weight associated to the k-th similarity feature of x.
Therefore, if the learned wkare close or equal to 0, the k-th
similarity features of xare not determinant for categoriza-
tion. Such similarities are considered noisy, irrelevant (not
discriminant) in determining whether two versions are sim-
ilar or not. To go one step further, we also propose a more
explicit feature selection method based on the automatic
normal based feature selection [18] that uses the fact that a
feature kwith the weight wkclose to 0 has a smaller effect
on the prediction than features with large absolute values of
wk. Then features with small |wk|are good candidates for
removal. The number of selected features may be set based
on data storage and calculation constraints, or iteratively
reduced using a validation set.
4. EXPERIMENT RESULTS
4.1 Dataset and settings
We work on a dataset of about 1000 pairs of Web pages
manually annotated “similar” or “dissimilar” provided by
The Internet Memory Foundation1. The pages are captured
from many different governmental Web sites from the United
Kingdom about education, health, sport, justice, industry,
security... The identical couples of versions are removed and
not taken into account in the evaluation. Finally, 202 pairs
of Web pages were extracted: 147 and 55 (72.8% and 27.2%)
couples of similar and dissimilar versions, respectively.
To compute visual similarities, we use SIFT and HSV
(Hue Saturation Value) color descriptors with visual code-
books of sizes 100 and 200. These are relatively small com-
pared to the sizes used on large image databases but con-
sistent with the size of our base. Bigger codebook sizes
did not improve our classification task. The BoWs of page
versions are computed using the two strategies described
in section 2.1: (1) over the rendering of whole Web pages
1http://internetmemory.org/
and (2) the top of Web pages. Euclidian and χ2distances
are then computed between the BoWs of successive page
versions normalized using L2-norm and L1-norm, respec-
tively. We also compute for each couple of page versions,
the VIPS structures [3] and the VI-DIFF difference trees [4]
from which we extract structural similarity values, e.g. the
(symmetrized) ratio of identical nodes, boolean values on
some criteria such as an identical VIPS structure. In the
end, we have 16 visual and 25 structural features.
4.2 Binary classification
We use leave-one-out cross-validation (on the 202 pairs) to
evaluate our model. We compare our results to the random
classifier which automatically predicts the most represented
class in the dataset, yielding a baseline accuracy of 72.8%.
Evaluation of visual features.
Selected Visual Features Accuracy (%)
Whole Web page Top of Web page
None SIFT 84.2
None color 82.7
None SIFT + color 87.1
SIFT None 79.7
color None 80.7
SIFT + color None 83.2
SIFT + color SIFT + color 85.1
Table 1: Visual feature classification performances.
We first use only the visual information of pages. Struc-
tural similarities of xare ignored. The accuracies when
selecting different subsets of local descriptors (SIFT and
color) sampled on whole pages or top of pages are presented
in Table 1. SIFT and color descriptors achieve good per-
formances for Web page change detection. Using the top
of pages (87.1%) is also a lot more discriminant than us-
ing whole pages (83.2%). Combining both of them gives
even worse results (85.1%) than using only the top of pages
(87.1%). Important changes are more likely to be directly
observable whereas changes at the bottom of Web pages, of-
ten advertisements, are more likely to be less important and
noisy. The accuracies obtained validate our approach.
Evaluation of structural features.
Selected Structural Features Accuracy (%)
Jaccard Indices VI-DIFF
Yes No 85.1
No Yes 76.7
Yes Yes 87.6
Table 2: Structural feature classif. performances.
We study in Table 2 the accuracies when different subsets
of structural similarities only are used. Jaccard Indices of
links are the most discriminant structural features (85.1%)
but the other structural features extracted from VI-DIFF
are still informative, 4% better than the random classifier.
Structural and visual feature combination evaluation.
We investigate the combination of structural and visual
features in Table 3. The accuracy when combining all of
them (90.1%) is better than when using only structural
(87.6%) or visual (87.1%) features. Visual and structural
features are then complementary.
Selected Feature similarities Acc. (%)
Structural Visual
All All 90.1
All Top of Web page 92.1
Jaccard indices All 91.6
Jaccard indices Top of Web page 93.1
Table 3: Structural and visual feature classification
performances.
Furthermore, we propose to combine in Table 3 the vi-
sual and structural features that gave the best accuracies in
previous sections. An exhaustive manual selection among all
the 41 structural and visual features to find the set that max-
imizes prediction would be too time-consuming. The accu-
racy is improved up to 93.1% when combining only Jaccard
indices of links and the top of page visual representations.
Concerning misclassified examples, we observed that many
dissimilar pairs of versions that were predicted as similar
were news pages in which old news were shifted towards
the bottom of the page by more recent news. The shifts of
these news do not impact the BoW distances since we do not
take the spatial information of image patches into account.
Many similar pairs of versions predicted as dissimilar had
a lot of new irrelevant hyperlinks (significantly more than
in Figure 1 (b)). A better detection of important regions
and their shifts in position could improve the decision by
ignoring their related visual and structural comparisons.
Figure 2: Feature selection performances.
We also investigate the automatic normal based feature
selection method described in section 3 corresponding to
the blue curve in Figure 2. The best accuracy obtained with
that automatic method is 92.6% when the 13 to 15 features
with the highest absolute values in ware selected. It is
comparable to our best accuracy of 93.1% (Table 3 and red
cross of Figure 2) with 10 features selected.
5. CONCLUSION
We have proposed a complete Web page comparison frame-
work effective for Web archiving. We combine structural and
visual features to understand the behavior of Web sites and
estimate when or with which frequency they must be visited.
We confirm that both structural and visual informations
are useful for change detection. We explore several features
and similarities. One of the main results is that important
changes generally appear at the visible part of Web pages
without scrolling. Moreover, we propose a new scheme to
learn an optimal similarity combination as a classification
problem. Experiments on real Web pages are presented to
validate our strategy. A large set of pages with associated
labels performed by archivists has been used for a quality
evaluation of our visual and structural similarity method.
Acknowledgments. This work was partially supported
by the SCAPE Project. The SCAPE Project is co-funded
by the European Union under FP7 ICT-2009.4.1.
6. REFERENCES
[1] M. Ben Saad, S. Gan¸carski, and Z. Pehlivan, “A novel
web archiving approach based on visual pages
analysis,” in IWAW 2009.
[2] M. Oita and P. Senellart, “Deriving dynamics of web
pages: A survey,” in TWAW, March 2011.
[3] D. Cai, S. Yu, J. Wen, and W. Ma, “Vips: a
vision-based page segmentation algorithm,” Microsoft
Technical Report, MSR-TR-2003-79-2003, 2003.
[4] Z. Pehlivan, M. Ben Saad, and S. Gan¸carski,
“Vi-DIFF: Understanding Web Pages Changes,” in
DEXA 2010.
[5] J. Cao, B. Mao, and J. Luo, “A segmentation method
for web page analysis using shrinking and dividing,”
JPEDS, vol. 25, 2010.
[6] A.Y. Fu, L. Wenyin, and X. Deng, “Detecting phishing
web pages with visual similarity assessment based on
earth mover’s distance (emd),” TDSC, vol. 3, 2006.
[7] N. Thome, D. Merad, and S. Miguet, “Learning
articulated appearance models for tracking humans: A
spectral graph matching approach,” Signal Processing:
Image Communication, vol. 23, no. 10, 2008.
[8] A. Spengler and P. Gallinari, “Document structure
meets page layout: Loopy random fields for web news
content extraction,” in DocEng, 2010.
[9] D. Lowe, “Distinctive image features from
scale-invariant keypoints,” IJCV, vol. 60, 2004.
[10] W.Y. Ma and B.S. Manjunath, “Netra: A toolbox for
navigating large image databases,” in ICIP 1997.
[11] J. Fournier, M. Cord, and S. Philipp-Foliguet, “Retin:
A content-based image indexing and retrieval system,”
PAA, vol. 4, no. 2, pp. 153–173, 2001.
[12] S. Avila, N. Thome, M. Cord, E. Valle, and A. Ara´ujo,
“Bossa: Extended bow formalism for image
classification,” in ICIP 2011.
[13] K. Chatfield, V. Lempitsky, A. Vedaldi, and
A. Zisserman, “The devil is in the details: an
evaluation of recent feature encoding methods,”
BMVC, 2011.
[14] R. Song, H. Liu, J.R. Wen, and W.Y. Ma, “Learning
block importance models for web pages,” in WWW
2004.
[15] D. Picard, N. Thome, and M. Cord, “An efficient
system for combining complementary kernels in
complex visual categorization tasks,” in ICIP 2010.
[16] L. Yang and R. Jin, “Distance metric learning: A
comprehensive survey,” Michigan State University, pp.
1–51, 2006.
[17] A. Frome, Y. Singer, and J. Malik, “Image retrieval
and classification using local distance functions,” in
NIPS 2006.
[18] D. Mladeni´c, J. Brank, M. Grobelnik, and
N. Milic-Frayling, “Feature selection using linear
classifier weights: interaction with classification
models,” in SIGIR 2004.
... The method considers screen captures of page versions as images. Only the visible part of pages without scrolling is considered since it generally contains the main information shared by the page [41, 55] . Our proposed method computes the GIST [47] descriptors of screen captures. ...
... D H (or D U ) is the Jaccard distance between hyperlinks (or image URLs) of v i and v j . D H and D U were shown to be discriminative for semantic change detection [41] 5 4 We use the publicly available code of Oliva and Torralba [47] in MATLAB to compute GIST descriptors. In particular, we choose the following setting: 8 oriented edge responses at 4 different scales. ...
... Active strategies can be performed to minimize integrated human supervision. Table 3compares the accuracies obtained with the change detection method proposed in [41] and with our method that combines a learned visual metric with structural distances. To the best of our knowledge, the approach in [41] is the only machine learning method proposed for change detection in the context of Web archiving. ...
Article
Full-text available
This paper is concerned with the problem of learning a distance metric by considering meaningful and discriminative distance constraints in some contexts where rich information between data is provided. Classic metric learning approaches focus on constraints that involve pairs or triplets of images. We propose a general Mahalanobis-like distance metric learning framework that exploits distance constraints over up to four different images. We show how the integration of such constraints can lead to unsupervised or semi-supervised learning tasks in some applications. We also show the benefit on recognition performance of this type of constraints, in rich contexts such as relative attributes, class taxonomies and temporal webpage analysis.
... Digital preservation content profiles and an automated rendering and comparison tool (Law et al., 2012). A prioritization approach is taken to target first and foremost those aspects that are perceived most critical. ...
... Similarly, experiment automation in Plato and, equally important, the feasibility of large-scale preservation operations in general, is entirely dependent on the existence of well-tested, efficient and effective mechanisms for quality assurance. Recent work is showing promising advances (Jurik and Nielsen, 2012; Bauer and Becker, 2011; Law et al., 2012), but there is still a wide gap to be addressed for preservation operations to be broadly supported. It seems crucial that this gap is made explicit and shared with a wide community so that efforts to close it can be based on a solid assessment of the shortcomings of existing tools rather than isolated ad hoc identification of application scenarios within single institutions, as is often practiced today. ...
Article
Full-text available
Purpose – Preservation environments such as repositories need scalable and context-aware preservation planning and monitoring capabilities to ensure continued accessibility of content over time. This article identifies a number of gaps in the systems and mechanisms currently available and presents a new, innovative architecture for scalable decision-making and control in such environments. Design/methodology/approach – The paper illustrates the state of the art in preservation planning and monitoring, highlights the key challenges faced by repositories to provide scalable decision-making and monitoring facilities, and presents the contributions of the SCAPE Planning and Watch suite to provide such capabilities. Findings – The presented architecture makes preservation planning and monitoring context-aware through a semantic representation of key organizational factors, and integrates this with a business intelligence system that collects and reasons upon preservation-relevant information. Research limitations/implications – The architecture has been implemented in the SCAPE Planning and Watch suite. Integration with repositories and external information sources provide powerful preservation capabilities that can be freely integrated with virtually any repository. Practical implications – The open nature of the software suite enables stewardship organizations to integrate the components with their own preservation environments and to contribute to the ongoing improvement of the systems. Originality/value – The paper reports on innovative research and development to provide preservation capabilities. The results enable proactive, continuous preservation management through a context-aware planning and monitoring cycle integrated with operational systems.
... The reference implementation of DOR is based on Fedora Commons 6 repositories, while the Execution Platform is composed of Cloudera distribution of Hadoop Apache 7 , the Taverna 8 server for executing preservation workflows, plus specific tools and packages for preservation tasks, such as: pagelyzer [6] for Web pages comparison, jpylyzer [16] for validation and feature extraction for JP2 images and xcorrSound [2] for audio comparison. SCOUT 9 is used as Automated Watch component and the preferred choice within the project for Automated Planning component is PLATO 10 planning tool. ...
Conference Paper
Cloud computing is adopted nowadays by various activity sectors. Libraries implementing Cloud-based solutions for their digital preservation environment can also benefit from the advantages offered by combining private Cloud deployments with public Clouds usage. This paper particularly addresses the deployment of digital preservation solutions over Multi-Cloud environments. After discussing different Cloud deployment strategies, we are presenting the overall architecture of a digital preservation environment and how its components can be assigned to different Clouds.
Article
Think tanks have been proved helpful for decision-making in various communities. However, collecting information manually for think tank construction implies too much time and labor cost as well as inevitable subjectivity. A probable solution is to retrieve webpages of renowned experts and institutes similar to a given example, denoted as query by webpage (QBW). Considering users’ searching behaviors, a novel QBW model based on webpages’ visual and textual features is proposed. Specifically, a visual feature extraction module based on pre-trained neural networks and a heuristic pooling scheme is proposed, which bridges the gap that existing extractors fail to extract snapshots’ high-level features and are sensitive to the noise effect brought by images. Moreover, a textual feature extraction module is proposed to represent textual content in both term and topic grains, while most existing extractors merely focus on the term grain. In addition, a series of similarity metrics are proposed, including a textual similarity metric based on feature bootstrapping to improve model’s robustness and an adaptive weighting scheme to balance the effect of different types of features. The proposed QBW model is evaluated on expert and institute introduction retrieval tasks in academic and medical scenarios, in which the average value of MAP has been improved by 10% compared to existing baselines. Practically, useful insights can be derived from this study for various applications involved with webpage retrieval besides think tank construction.
Conference Paper
Although migrating library applications to Cloud environment is not an easy task, many libraries are interested in using Cloud infrastructure services broadly across their businesses, whether is about a Public, Private or Hybrid Cloud. One of the migration expectations is the scalability of digital preservation architectures in Cloud environments. In this paper we address the scalability and portability of storage and compute platforms, which combine storage of large datasets and their processing. Concretely, we propose a toolkit developed using Puppet configuration management system that facilitate the deployment of complex digital preservation platforms over heterogeneous Cloud environments and we present, as a use case, its integration with SCAPE platform.
Article
Full-text available
This thesis focuses on distance metric learning for image and webpage comparison. Distance metrics are used in many machine learning and computer vision contexts such as k-nearest neighbors classification, clustering, support vector machine, information/image retrieval, visualization etc. In this thesis, we focus on Mahalanobis-like distance metric learning where the learned model is parametered by a symmetric positive semidefinite matrix. It learns a linear tranformation such that the Euclidean distance in the induced projected space satisfies learning constraints. First, we propose a method based on comparison between relative distances that takes rich relations between data into account, and exploits similarities between quadruplets of examples. We apply this method on relative attributes and hierarchical image classification. Second, we propose a new regularization method that controls the rank of the learned matrix, limiting the number of independent parameters and overfitting. We show the interest of our method on synthetic and real-world recognition datasets. Eventually, we propose a novel Webpage change detection framework in a context of archiving. For this purpose, we use temporal distance relations between different versions of a same Webpage. The metric learned in a totally unsupervised way detects important regions and ignores unimportant content such as menus and advertisements. We show the interest of our method on different Websites.
Conference Paper
In this paper we present results from an experiment conducted on over 27 900 web pages gathered every 2 hours over 22 days from 16 forums (4256 independent crawls), to investigate how these web pages evolve over time. The results of the experiment became a basis for design choices for a focused incremental crawler, that will be specialized for efficient gathering of documents from web forums, maintaining high freshness of the local collection of obtained pages. The data analysis shows, that forums differ from generic web portals and identifying places in the source navigational structure, where new documents occur more often, would allow to improve the crawler’s performance and the collection freshness.
Article
Full-text available
A new web content structure analysis based on visual representation is proposed in this paper. Many web applications such as information retrieval, information extraction and automatic page adaptation can benefit from this structure. This paper presents an automatic top-down, tag-tree independent approach to detect web content structure. It simulates how a user understands web layout structure based on his visual perception. Comparing to other existing techniques, our approach is independent to underlying documentation representation such as HTML and works well even when the HTML structure is far different from layout structure. Experiments show satisfactory results.
Article
Full-text available
The World Wide Web is dynamic by nature: content is continuously added, deleted, or changed, which makes it challenging for Web crawlers to keep up-to-date with the current version of a Web page, all the more so since not all apparent changes are significant ones. We review major approaches to change detection in Web pages and extraction of temporal properties (especially, timestamps) of Web pages. We focus our attention on techniques and systems that have been proposed in the last ten years and we analyze them to get some insight into the practical solutions and best practices available. We aim at providing an analytical view of the range of methods that can be used, distinguishing them on several dimensions, especially, their static or dynamic nature, the modeling of Web pages, or, for dynamic methods relying on comparison of successive versions of a page, the similarity metrics used. We advocate for more comprehensive studies of the effectiveness of Web page change detection methods, and finally highlight open issues.
Conference Paper
Full-text available
Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future generation. A major problem encountered by archiving systems is to understand what happened between two versions of web pages. In this paper, we address this requirement by proposing a new change detection approach that computes the semantic differences between two versions of HTML web pages. Our approach, called Vi-DIFF, detects changes on the visual representation of web pages. It detects two types of changes: content and structural changes. Content changes include modifications on text, hyperlinks and images. In contrast, structural changes alter the visual appearance of the page and the structure of its blocks. Our Vi-DIFF solution can serve for various applications such as crawl optimization, archive maintenance, web changes browsing, etc. Experiments on Vi-DIFF were conducted and the results are promising.
Article
Tracking an unspecified number of people in real-time is one of the most challenging tasks in computer vision. In this paper, we propose an original method to achieve this goal, based on the construction of a 2D human appearance model. The general framework, which is a region-based tracking approach, is applicable to any type of object. We show how to specialize the method for taking advantage of the structural properties of the human body. We segment its visible parts by using a skeletal graph matching strategy inspired by the shock graphs. Only morphological and topological information is encoded in the model graph, making the approach independent of the pose of the person, the viewpoint, the geometry or the appearance of the limbs. The limbs labeling makes it possible to build and update an appearance model for each body part. The resulting discriminative feature, that we denote as an articulated appearance model, captures both color, texture and shape properties of the different limbs. It is used to identify people in complex situations (occlusion, field of view exit, etc.), and maintain the tracking. The model to image matching has proved to be much more robust and better-founded than with existing global appearance descriptors, specifically when dealing with highly deformable objects such as humans. The only assumption for the recognition is the approximate viewpoint correspondence between the different models during the matching process. The method does not make use of skin color detection, which allows us to perform tracking under any viewpoint. Occlusions can be detected by the generic part of the algorithm, and the tracking is performed in such cases by means of a particle filter. Several results in complex situations prove the capacity of the algorithm to learn people appearance in unspecified poses and viewpoints, and its efficiency for tracking multiple humans in real-time using the specific updated descriptors. Finally, the model provides an important clue for further human motion analysis process.
Conference Paper
In this paper we introduce and experiment with a framework for learning local perceptual distance functions for visual recognition. We l earn a distance func- tion for each training image as a combination of elementary distances between patch-based visual features. We apply these combined local distance functions to the tasks of image retrieval and classification of novel imag es. On the Caltech 101 object recognition benchmark, we achieve 60.3% mean recognition across classes using 15 training images per class, which is better t han the best published performance by Zhang, et al.
Conference Paper
Web content extraction is concerned with the automatic identification of semantically interesting web page regions. To generalize to pages from unknown sites, it is crucial to exploit not only the local characteristics of a particular web page region, but also the rich interdependencies that exist between the regions and their latent semantics. We therefore propose a loopy conditional random field which combines semantic intra-page dependencies derived from both document structure and page layout, uses a realistic set of local and relational features and is efficiently learnt in the tree-based reparameterization framework. The results of our empirical analysis on a corpus of real-world news web pages from 177 distinct sites with multiple annotations on DOM node level demonstrate that our combination of document structure and layout-driven interdependencies leads to a significant error reduction on the semantically interesting regions of a web page.