Fig 2 - uploaded by Stéphane Gançarski
Content may be subject to copyright.
Coherence Example [13] 

Coherence Example [13] 

Source publication
Conference Paper
Full-text available
We point out, in this paper, the issue of improving the coherence of web archives under limited resources (e.g. bandwidth, storage space, etc.). Coherence measures how much a collection of archived pages versions reflects the real state (or the snapshot) of a set of related web pages at different points in time. An ideal approach to preserve the co...

Context in source publication

Context 1
... following the capture of the version P i [ t ] . As shown in Figure 2, the three versions P i 1 [ t 1 ], P i 2 [ t 2 ] and P i 3 [ t 3 ] are coherent because there is an interval t coherence that satisfies the coherence constraint (1). However, the three page versions at the right are not coherent because there is no point in time satisfying the coherence constraint ...

Similar publications

Article
Full-text available
Circular RNAs (circRNAs), a new class of regulatory noncoding RNAs, play important roles in human diseases. While a growing number of circRNAs have been characterized with biological functions, it is necessary to integrate all the information to facilitate studies on circRNA functions and regulatory networks in human diseases. Circ2Disease database...
Article
Full-text available
Lexical platform – the first step towards user-centred integration of lexical resources Lexical platform – the first step towards user-centred integration of lexical resources The paper describes the Lexical Platform - a means for lightweight integration of independent lexical resources. Lexical resources (LRs) are represented as web components th...
Article
Full-text available
Video resources are pervasive nowadays. Since it is time-consuming for the user to download and browse videos, video summarization techniques, which aim at pro-viding a way for the user to grasp the major content of the video without browsing the whole video, have become more and more important. In this paper, we present a feature-based video summa...

Citations

... Conventional web crawling and archiving, as done by applications such as Heritrix [32] and wget [17], downloads an HTML page, extracts the URLs in the page (both links and embedded resources), adds those URLs to a crawl frontier, and then crawls the next URL in the frontier. The frontier can be indexed as breadth-first or depth-first, with most crawlers configured to breadth-first crawling in order to improve site-level consistency [10,11,16,37]. For example, all of the pages linked from the top-level page (level 0) are crawled, then all the pages linked from those pages (level 1) are crawled, etc. ...
... "800": [{" id ": " homepage -injection -zone -1" , 9 " uri ": " _homepage -zone -injection / index . html "} , 10 {" id ": " homepage1 -zone -1"} , 11 {" id ": " homepage -injection -zone -2" , 12 " uri ": " _homepage -zone -injection / index . html "} , 13 {" id ": " homepage2 -zone -1"} , 14 {" id ": " homepage3 -zone -1"} , 15 {" id ": " homepage4 -zone -1"} , 16 {" id ": " homepage4 -zone -2"} , 17 {" id ": " homepage4 -zone -3"} , 18 {" id ": " homepage4 -zone -4"} , 19 {" id ": " homepage4 -zone -5"} , 20 {" id ": " homepage4 -zone -6"} , 21 {" id ": " homepage4 -zone -7"}] 22 }}}; to the left of each section, and the data-zone-label (found in the HTML) is shown on the right. ...
Preprint
Full-text available
Many web sites are transitioning how they construct their pages. The conventional model is where the content is embedded server-side in the HTML and returned to the client in an HTTP response. Increasingly, sites are moving to a model where the initial HTTP response contains only an HTML skeleton plus JavaScript that makes API calls to a variety of servers for the content (typically in JSON format), and then builds out the DOM client-side, more easily allowing for periodically refreshing the content in a page and allowing dynamic modification of the content. This client-side rendering, now predominant in social media platforms such as Twitter and Instagram, is also being adopted by news outlets, such as CNN.com. When conventional web archiving techniques, such as crawling with Heritrix, are applied to pages that render their content client-side, the JSON responses can become out of sync with the HTML page in which it is to be embedded, resulting in temporal violations on replay. Because the violative JSON is not directly observable in the page (i.e., in the same manner a violative embedded image is), the temporal violations can be difficult to detect. We describe how the top level CNN.com page has used client-side rendering since April 2015 and the impact this has had on web archives. Between April 24, 2015 and July 21, 2016, we found almost 15,000 mementos with a temporal violation of more than 2 days between the base CNN.com HTML and the JSON responses used to deliver the content under the main story. One way to mitigate this problem is to use browser-based crawling instead of conventional crawlers like Heritrix, but browser-based crawling is currently much slower than non-browser-based tools such as Heritrix.
... Néanmoins, certaines contraintes de politesse peuvent obliger à traiter en parallèle une grande quantité de sites. Mais si cela est possible, ne crawler que lorsqu'un site est le moins susceptible de subir des changements peut garantir une cohésion temporelle au corpus (Saad et al., 2011). S'appuyant sur toutes ces réflexions, les équipes d'Internet Archive présentent en 2004 un crawler open-source, l'Heritrix (Mohr et al., 2004) capable de s'adapter à divers type de collecte : large (broad crawling), en continu (continuous crawling) ou focalisée (focused crawling). ...
Thesis
Le Web est un environnement éphémère. Alors que de nouveaux sites Web émergent chaque jour,il arrive que certaines communautés disparaissent entièrement de la surface de la toile, ne laissant derrièreelles que des traces incomplètes voire inexistantes. Face à la volatilité du Web vivant, plusieurs initiativesd’archivage cherchent malgré tout à préserver la mémoire du Web passé. Mais aujourd’hui, force est deconstater qu’un mystère demeure : Pourquoi, alors qu’elles n’ont jamais été aussi vastes et aussi nombreuses,les archives Web ne font-elles pas déjà l’objet de multiples recherches historiques ? Initialement construitespour inscrire la mémoire de la toile sur un support durable, ces archives ne doivent pourtant pas êtreconsidérées comme une représentation fidèle du Web vivant. Elles sont les traces directes des outils de collectequi les arrachent à leur temporalité d’origine. Partant de là, cette thèse ambitionne de redonner aux chercheursles moyens théoriques et techniques d’une plus grande maniabilité du Web passé, en définissant une nouvelleunité d’exploration des archives Web : le fragment Web, un sous-ensemble cohérent et auto-suffisant d’unepage Web. Pour ce faire, nous nous inscrirons dans l’héritage des travaux pionniers de l’Atlas e-Diasporas quipermit, dans les années 2000, de cartographier et d’archiver plusieurs milliers de sites Web migrants. Sourceprincipale de données à partir desquelles nous déploierons nos réflexions, c’est à travers l’angle particulierdes représentations en ligne des diasporas que nous chercherons à explorer les archives Web de l’Atlas.
... Spaniol et al. [12], Denev [5], and Ben Saad et al. [3,2] all introduce strategies to improve the quality of future web crawler-based captures. If implemented, these strategies should reduce spread in new captures. ...
... Pattern 1RB is prima facie coherent. 3 Left and right come from the appearance of the charts. The right pattern depicted in figure 5(b), and specified by predicate (4), represents an embedded memento, m i,1 , which was both modified and captured after the root memento, m 0 , was captured. ...
Article
Full-text available
Most archived HTML pages embed other web resources, such as images and stylesheets. Playback of the archived web pages typically provides only the capture date (or Memento-Datetime) of the root resource and not the Memento-Datetime of the embedded resources. In the course of our research, we have discovered that the Memento-Datetime of embedded resources can be up to several years in the future or past, relative to the Memento-Datetime of the embedding root resource. We introduce a framework for assessing temporal coherence between a root resource and its embedded resource depending on Memento-Datetime, Last-Modified datetime, and entity body.
... Web sites change faster than crawls can acquire their content, which leads to temporal incoherence. Ben Saad et al. [6] note that quality and completeness require different methods and measures a priori or a posterior, that is during acquisition or during post-archival access respectively. ...
... Ben Saad et al. [6] address both a priori and a posteriori quality. Like Denev et al. [10], the a priori solution is designed to optimize the crawling process for archival quality. ...
Article
Full-text available
When a user views an archived page using the archive's user interface (UI), the user selects a datetime to view from a list. The archived web page, if available, is then displayed. From this display, the web archive UI attempts to simulate the web browsing experience by smoothly transitioning between archived pages. During this process, the target datetime changes with each link followed; drifting away from the datetime originally selected. When browsing sparsely-archived pages, this nearly-silent drift can be many years in just a few clicks. We conducted 200,000 acyclic walks of archived pages, following up to 50 links per walk, comparing the results of two target datetime policies. The Sliding Target policy allows the target datetime to change as it does in archive UIs such as the Internet Archive's Wayback Machine. The Sticky Target policy, represented by the Memento API, keeps the target datetime the same throughout the walk. We found that the Sliding Target policy drift increases with the number of walk steps, number of domains visited, and choice (number of links available). However, the Sticky Target policy controls temporal drift, holding it to less than 30 days on average regardless of walk length or number of domains visited. The Sticky Target policy shows some increase as choice increases, but this may be caused by other factors. We conclude that based on walk length, the Sticky Target policy generally produces at least 30 days less drift than the Sliding Target policy.
... The quality in this case is measured by the coherence metrics. Saad et al. [82,83,11] propose an alternative model which strives for coherence with a single download per page. ...
Article
Web archives offer a rich and plentiful source of information to researchers, analysts, and legal experts. For this purpose, they gather Web sites as the sites change over time. In order to keep up to high standards of data quality, Web archives have to collect all versions of the Web sites. Due to limited resuources and technical constraints this is not possible. Therefore, Web archives consist of versions archived at various time points without guarantee for mutual consistency. This thesis presents a model for assessing the data quality in Web archives as well as a family of crawling strategies yielding high-quality captures. We distinguish between single-visit crawling strategies for exploratory and visit-revisit crawling strategies for evidentiary purposes. Single-visit strategies download every page exactly once aiming for an “undistorted” capture of the ever-changing Web. We express the quality of such the resulting capture with the “blur” quality measure. In contrast, visit-revisit strategies download every page twice. The initial downloads of all pages form the visit phase of the crawling strategy. The second downloads are grouped together in the revisit phase. These two phases enable us to check which pages changed during the crawling process. Thus, we can identify the pages that are consistent with each other. The quality of the visit-revisit captures is expressed by the “coherence” measure. Quality-conscious strategies are based on predictions of the change behaviour of individual pages. We model the Web site dynamics by Poisson processes with pagespecific change rates. Furthermore, we show that these rates can be statistically predicted. Finally, we propose visualization techniques for exploring the quality of the resulting Web archives. A fully functional prototype demonstrates the practical viability of our approach.
Conference Paper
When a user retrieves a page from a web archive, the page is marked with the acquisition datetime of the root resource, which effectively asserts “this is how the page looked at a that datetime.” However, embedded resources, such as images, are often archived at different datetimes than the main page. The presentation appears temporally coherent, but is composed from resources acquired over a wide range of datetimes. We examine the completeness and temporal coherence of composite archived resources (composite mementos) under two selection heuristics. The completeness and temporal co- herence achieved using a single archive was compared to the results achieved using multiple archives. We found that at most 38.7% of composite mementos are both temporally coherent and that at most only 17.9% (roughly 1 in 5) are temporally coherent and 100% complete. Using multiple archives increases mean completeness by 3.1–4.1% but also reduces temporal coherence.
Conference Paper
Full-text available
Web archives do not capture every resource on every page that they attempt to archive. This results in archived pages missing a portion of their embedded resources. These em-bedded resources have varying historic, utility, and impor-tance values. The proportion of missing embedded resources does not provide an accurate measure of their impact on the Web page; some embedded resources are more important to the utility of a page than others. We propose a method to measure the relative value of embedded resources and as-sign a damage rating to archived pages as a way to evaluate archival success. In this paper, we show that Web users' perceptions of damage are not accurately estimated by the proportion of missing embedded resources. The proportion of missing embedded resources is a less accurate estimate of resource damage than a random selection. We propose a damage rating algorithm that provides closer alignment to Web user perception, providing an overall improved agree-ment with users on memento damage by 17% and an im-provement by 51% if the mementos are not similarly dam-aged. We use our algorithm to measure damage in the Inter-net Archive, showing that it is getting better at mitigating damage over time (going from 0.16 in 1998 to 0.13 in 2013). However, we show that a greater number of important em-bedded resources (2.05 per memento on average) are missing over time. 978-1-4799-5569-5/14/$31.00 c IEEE.