Fig 8 - uploaded by Stéphane Gançarski
Content may be subject to copyright.
Change Detection Execution Time 

Change Detection Execution Time 

Source publication
Conference Paper
Full-text available
Nowadays, many applications are interested in detecting and discovering changes on the web to help users to understand page updates and more generally, the web dynamics. Web archiving is one of these fields where detecting changes on web pages is important. Archiving institutes are collecting and preserving different web site versions for future ge...

Context in source publication

Context 1
... all the process is executed for 10000 times and the results obtained are normalized (average). The results are given in Figure 8. ...

Citations

... It compares the web pages by first creating the visual semantic block based on the segmentation of web pages followed by comparing the block nodes of two documents. It then applies a change detection technique that finds delta between the two semantic versions of the web pages and thus calculates the changes which have occurred [12]. ...
Article
Full-text available
In this era of the competitive world, one needs to stay updated with all the information that is required for their professional and personal growth. But due to vast information, it is difficult to cope up with the ever-changing data. WWW is the main source of data. It consists of thousands and millions of web pages, out of which some are static pages while the other is dynamic pages. The contents of the dynamic web pages such as news websites, stock prices, weather broadcast change frequently. The changes include the insertion of some new data, deletion of data, or modification of existing data. To detect the changes that are variedly occurring in these web pages, various change detection tools and algorithms have been developed. This paper will discuss the categories of changes occurring on the structured document and their corresponding algorithms and tools which can be applied to retrieve the relevant changes.
... Es una adaptación del formato V-DIFF a nuestro contexto [13]. • Obtener segmentación en formato MoB HTML: Retorna el HTML original con la información de la segmentación incluida. ...
Article
Full-text available
A Web page segmentation is an important task in Web page analysis. The objective is to divide a Web page into blocks, each one representing a coherent part (or segment) of the content. In this work we describe the development of the Manual-design of Blocks (MoB). At the same time we describe how to get a ground truth of segmentations and how to compute the"best manual segmentation". The best manual segmentation is defined based on our experience and the data obtained, in this investigation we define one way to obtain it, but we do not consider there's only one way to achieve this. The best segmentation is then available to be used on the evaluation process of segmentation algorithm using the Block-o-Matic framework. Also, a Web API and a Web repository for managing the data. Acceptance test results are presented in this document.
... X-Diff: an effective change detection algorithm for XML documents [Wang et al. 2003] Vi-DIFF [Pehlivan et al. 2010] Detects content and structural changes including the visual representation of webpages. ...
... Vi-DIFF: Understandin g Webpages Changes [Pehlivan et al. 2010] Level order traversal [Yadav et al. 2007] This is a breadth first traversal algorithm and considers the changes in the document tree to detect the changes. ...
Preprint
Full text can be found here: https://arxiv.org/abs/1901.02660 -- The majority of currently available webpages are dynamic in nature and are changing frequently. New content gets added to webpages, and existing content gets updated or deleted. Hence, people find it useful to be alert for changes in webpages that contain information that is of value to them. In the current context, keeping track of these webpages and getting alerts about different changes have become significantly challenging. Change Detection and Notification (CDN) systems were introduced to automate this monitoring process, and to notify users when changes occur in webpages. This survey classifies and analyzes different aspects of CDN systems and different techniques used for each aspect. Furthermore, the survey highlights the current challenges and areas of improvement present within the field of research.
... In another study, Law et al. (2012) presented a method to learn visual similarity in order to detect whether successive versions of web pages were similar or not. They investigated structural and visual similarities with the help of a vision-based dense SIFT Lowe (2004) descriptors on page screenshots and difference trees returned by the VI-DIFF ( Pehlivan et al., 2010 ) algorithm. Furthermore, they utilized the top most 1000 pixels of the screenshots with the assumption that most of the Internet users make their similarity judgment just by looking over the top of the web pages. ...
Article
In this paper, we propose a ranking approach which considers visual similarities among web pages by using structure and vision-based features. Throughout the study, we aim to understand and represent the web page visual structure as in the way people do by focusing on the layout similarity through the wireframe design. The conducted study is composed of two parts. In the first part, structural similarities are analyzed with the proposed concept of "layout components" along with visual inspection of DOM trees. In this way, five types of structural layout components are proposed and revealed. Moreover, whitespaces are also utilized since they are important visual cues in the visual perception of web pages. In the second part, a computer-vision based method named histogram of oriented gradients (HOG) is employed to reveal local visual cues in terms of edge orientations. Following the feature extraction phases, extracted feature histograms are mapped on spatial information preserving multilevel and multi-resolution bag of features representation method named spatial pyramid matching. In this way, three goals were achieved: (1) the visual layout of web pages were mapped and compared in a multi-resolution schema; (2) the intermediate process of visual segmentation was removed; and (3) efficient and easily comparable web page layout signatures were generated. We also conducted a questionnaire study covering 312 subjects. This helped us to create a benchmark dataset involving similarity scores collected from individuals. So far, there exists no web page layout similarity ranking oriented corpus in the literature. Our suggested approach achieved a remarkable ranking performance at top-5 and top-10 retrieval results. According to the findings of the comparative study, our approach outperforms some structure and vision-based studies in the literature. With this achievement, web pages could be employed as a query item to find other, similar web pages by taking into consideration that they are web pages, instead of images or anything else.
... In the context of Web archiving, segmentation can be used to extract interesting parts to be stored. By giving relative weights to blocks according to their importance, it also allows for detecting important changes (changes in important blocks) between pages versions [12]. This is useful for crawling optimization, as it permits tuning crawlers so that they will revisit pages with important changes more often [15]. ...
Conference Paper
Full-text available
Web archives are not exempt of format obsolescence. In the near future Web pages written in HTML4 format, could be obsolete. We will have to choose between two preservation strategies: emulation or migration. The first option is the most evident, however due to the size of the Web and the amount of information that Web archives handle it is not practical. In the other hand migration to HTML5 format seems plausible. This is a challenge because we need to modify a page (in HTML4 format) and include elements that not even exists in this format (as the HTML5 semantic elements). Using the Web page segmentation we show that, with the appropriate granularity, blocks look alike these semantic elements. We present the use our segmentation tool, BoM (Block-o-Matic), for helping achieve the migration of Web pages from HTML4 format to HTML5 format in the context of Web archives. We also present an evaluation framework for Web page segmentation, that helps to produce metrics needed to compare the original and migrated version. If both versions are similar the migration has been successful. We show the experiments and results obtained on a sample of 40 pages. We made the manual segmentations for each page using our MoB tool. Results shows that in the migration process there is no data loss but in the migrated version (after adding the semantic elements) the margin is changed. This is, it adds whitespace that change the elements position, shifting elements slightly on the page. While this is imperceptible to the human eye, for systems it is difficult to handle without previous knowledge of this situation.
... Request permissions from Permissions@acm.org. (changes in important blocks) from distinct versions of a page [12]. This is useful for crawling optimization, as it allows tuning of crawlers so that they will revisit pages with important changes more often [15]. ...
Conference Paper
Full-text available
In this paper, we present a framework for evaluating seg-mentation algorithms for Web pages. Web page segmenta-tion consists in dividing a Web page into coherent fragments, called blocks. Each block represents one distinct information element in the page. We define an evaluation model that includes different metrics to evaluate the quality of a segmen-tation obtained with a given algorithm. Those metrics compute the distance between the obtained segmentation and a manually built segmentation that serves as a ground truth. We apply our framework to four state-of-the-art segmenta-tion algorithms (BOM, Block Fusion, VIPS and JVIPS) on several categories (types) of Web pages. Results show that the tested algorithms usually perform rather well for text extraction, but may have serious problems for the extraction of geometry. They also show that the relative quality of a segmentation algorithm depends on the category of the segmented page.
... In the context of Web archiving, segmentation can be used to extract interesting parts to be stored. By giving relative weights to blocks according to their importance, it also allows for detecting important changes (changes in important blocks) between pages versions [PBSG10]. This is useful for crawling optimization, as it permits tuning crawlers so that they will revisit pages with important changes more often [SG10]. ...
... They are the input for the ViDIFF.jar which produces a Delta le, describing the changes between the to versions according to [PBSG10]. This delta le, together with both xml les, is the input to the Marcalizer component which gives the nal score. ...
... At the end of this step, two XML trees, representing the segmented Web pages are returned. The XML format of such trees is called ViXML[PBSG10]. The Web page segmentation is considered only for the structure and hybrid comparison types. ...
Article
Full-text available
Web pages are becoming more complex than ever, as they are generated by Content Management Systems (CMS). Thus, analyzing them, i.e. automatically identifying and classifying different elements from Web pages, such as main content, menus, among others, becomes difficult. A solution to this issue is provided by Web page segmentation which refers to the process of dividing a Web page into visually and semantically coherent segments called blocks.The quality of a Web page segmenter is measured by its correctness and its genericity, i.e. the variety of Web page types it is able to segment. Our research focuses on enhancing this quality and measuring it in a fair and accurate way. We first propose a conceptual model for segmentation, as well as Block-o-Matic (BoM), our Web page segmenter. We propose an evaluation model that takes the content as well as the geometry of blocks into account in order to measure the correctness of a segmentation algorithm according to a predefined ground truth. The quality of four state of the art algorithms is experimentally tested on four types of pages. Our evaluation framework allows testing any segmenter, i.e. measuring their quality. The results show that BoM presents the best performance among the four segmentation algorithms tested, and also that the performance of segmenters depends on the type of page to segment.We present two applications of BoM. Pagelyzer uses BoM for comparing two Web pages versions and decides if they are similar or not. It is the main contribution of our team to the European project Scape (FP7-IP). We also developed a migration tool of Web pages from HTML4 format to HTML5 format in the context of Web archives.
... Several applications and domains that want to keep track of those changes focus on temporal aspects of (usually textual) information on the Web. Some examples of such applications are large-scale information monitoring and delivery systems [Douglis et al., 1998, Liu et al., 2000, Lim and Ng, 2001, Flesca and Masciari, 2003, Jacob et al., 2004, active databases [Jacob et al., 2004], servicing of continuous queries [Abiteboul, 2002], Web cache optimization [Cho and Garcia-Molina, 2000], and Web archiving [Ben Saad andGançarski, 2011, Pehlivan et al., 2010]. All these applications use change detection methods at semi-structured data level. ...
... Change detection for webpage archiving had already been investigated at LIP6 [Pehlivan et al., 2010, Ben Saad andGançarski, 2011] to compare pages via their DOM trees after rendering. In order to extend previous works that exploited visual content via the structural architecture of pages, we propose to integrate computer vision methods in this webpage analysis task. ...
Article
Full-text available
This thesis focuses on distance metric learning for image and webpage comparison. Distance metrics are used in many machine learning and computer vision contexts such as k-nearest neighbors classification, clustering, support vector machine, information/image retrieval, visualization etc. In this thesis, we focus on Mahalanobis-like distance metric learning where the learned model is parametered by a symmetric positive semidefinite matrix. It learns a linear tranformation such that the Euclidean distance in the induced projected space satisfies learning constraints. First, we propose a method based on comparison between relative distances that takes rich relations between data into account, and exploits similarities between quadruplets of examples. We apply this method on relative attributes and hierarchical image classification. Second, we propose a new regularization method that controls the rank of the learned matrix, limiting the number of independent parameters and overfitting. We show the interest of our method on synthetic and real-world recognition datasets. Eventually, we propose a novel Webpage change detection framework in a context of archiving. For this purpose, we use temporal distance relations between different versions of a same Webpage. The metric learned in a totally unsupervised way detects important regions and ignores unimportant content such as menus and advertisements. We show the interest of our method on different Websites.
... The next step is that the user is asked to indicate the semantic order of blocks within the page (flow of the layout). With block and layout flow done, a ViXML [3] document is created and sent to the collector server in conjunction with the web page URL. Thereupon, the capture component receives the URL and gets the rendered source code, the visual cues and the screenshot of the web page. ...
... Web page segmentation is also employed to detect changes in the web pages. The Vi-Diff (Pehlivan et al., 2010) has proposed an approach for change detection in web pages by incorporating web page segmentation. The Vi-Diff model extends the VIPS algorithm for segmentation. ...
Article
Full-text available
Users who visit a web page repeatedly at frequent intervals are more interested in knowing the recent changes that have occurred on the page than the entire contents of the web page. Because of the increased dynamism of web pages, it would be difficult for the user to identify the changes manually. This paper proposes an enhanced model for detecting changes in the pages, which is called CaSePer (Change detection based on Segmentation with Personalization). The change detection is micro-managed by introducing web page segmentation. The web page change detection process is made efficient by having it perform a dual-step process. The proposed method reduces the complexity of the change-detection by focusing only on the segments in which the changes have occurred. The user-specific personalized change detection is also incorporated in the proposed model. The model is validated with the help of a prototype implementation. The experiments conducted on the prototype implementation confirm a 77.8% improvement and a 97.45% accuracy rate.