Herbert Van de Sompel’s research while affiliated with Data Archiving and Networked Services and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (197)


Transparant and Trustworthy Artifact Life Cycle
  • Conference Paper

September 2024

·

3 Reads

Patrick Hochstenbach

·

Martin Klein

·

Herbert Van de Sompel

·

Ruben Verborgh

Verifying the fixity of a file in a GitHub repository
A third-party verifying the fixity of a file in a GitHub repository can directly download the file and access the associated server-generated hash to compare with a locally-generated hash.
Verifying the fixity of a composite memento
A third-party verifying the fixity of an archived web page cannot directly download the WARC or the server-generated hash, but has to access the resource via replay software. We can compare multiple locally-generated hashes to each other, but there is no single server-generated hash available for comparison.
The representation of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
The rewritten HTML of the memento https://web.archive.org/web/20190725212938/https://maturban.github.io/playground/index.html
The code marked in red was added by the archive. The archive also modifies the names of original headers by adding x-archive-orig at the beginning of these headers. M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhD dissertation, 2020.
The raw HTML from requesting the memento https://web.archive.org/web/20190725212938id_/https://maturban.github.io/playground/index.html
M. Aturban, A Framework for Verifying the Fixity of Archived Web Resources, PhDdissertation, 2020.

+28

Hashes are not suitable to verify fixity of the public archived web
  • Article
  • Full-text available

June 2023

·

105 Reads

·

1 Citation

·

Martin Klein

·

Herbert Van de Sompel

·

[...]

·

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.

Download

Event Notifications in Value-Adding Networks

September 2022

·

16 Reads

·

2 Citations

Lecture Notes in Computer Science

Linkages between research outputs are crucial in the scholarly knowledge graph. They include online citations, but also links between versions that differ according to various dimensions and links to resources that were used to arrive at research results. In current scholarly communication systems this information is only made available post factum and is obtained via elaborate batch processing. In this paper we report on work aimed at making linkages available in real-time, in which an alternative, decentralised scholarly communication network is considered that consists of interacting data nodes that host artifacts and service nodes that add value to artifacts. The first result of this work, the “Event Notifications in Value-Adding Networks” specification, details interoperability requirements for the exchange real-time life-cycle information pertaining to artifacts using Linked Data Notifications. In an experiment, we applied our specification to one particular use-case: distributing Scholix data-literature links to a network of Belgian institutional repositories by a national service node. The results of our experiment confirm the potential of our approach and provide a framework to create a network of interacting nodes implementing the core scholarly functions (registration, certification, awareness and archiving) in a decentralized and decoupled way.KeywordsScholarly communicationDigital librariesOpen science


Fig. 1. Data nodes and service node exchanging notification messages.
Number of artifact URLs resolved for the data-literature network of each Belgian institution and time required to resolve PID-URLs to their landing page.
Event Notifications in Value-Adding Networks

August 2022

·

73 Reads

Linkages between research outputs are crucial in the scholarly knowledge graph. They include online citations, but also links between versions that differ according to various dimensions and links to resources that were used to arrive at research results. In current scholarly communication systems this information is only made available post factum and is obtained via elaborate batch processing. In this paper we report on work aimed at making linkages available in real-time, in which an alternative, decentralised scholarly communication network is considered that consists of interacting data nodes that host artifacts and service nodes that add value to artifacts. The first result of this work, the "Event Notifications in Value-Adding Networks" specification, details interoperability requirements for the exchange real-time life-cycle information pertaining to artifacts using Linked Data Notifications. In an experiment, we applied our specification to one particular use-case: distributing Scholix data-literature links to a network of Belgian institutional repositories by a national service node. The results of our experiment confirm the potential of our approach and provide a framework to create a network of interacting nodes implementing the core scholarly functions (registration, certification, awareness and archiving) in a decentralized and decoupled way.



Interoperability for Accessing Versions of Web Resources with the Memento Protocol

July 2021

·

33 Reads

·

8 Citations

The Internet Archive pioneered web archiving and remains the largest publicly accessible web archive hosting archived copies of web pages (Mementos) going back as far as early 1996. Its holdings have grown steadily since, and it hosts more than 881 billion URIs as of September 2019. However, the landscape of web archiving has changed significantly over the last two decades. Today we can freely access Mementos from more than 20 web archives around the world, operated by for-profit and nonprofit organisations, national libraries and academic institutions, as well as individuals. The resulting diversity improves the odds of the survival of archived records but also requires technical standards to ensure interoperability between archival systems. To date, the Memento Protocol and the WARC file format are the main enablers of interoperability between web archives. We describe a variety of tools and services that leverage the broad adoption of the Memento Protocol and discuss a selection of research efforts that would likely not have been possible without these interoperability standards. In addition, we outline examples of technical specifications that build on the ability of machines to access resource versions on the Web in an automatic, standardised and interoperable manner.


A 25 Year Retrospective on D-Lib Magazine

August 2020

·

633 Reads

In July, 1995 the first issue of D-Lib Magazine was published as an on-line, HTML-only, open access magazine, serving as the focal point for the then emerging digital library research community. In 2017 it ceased publication, in part due to the maturity of the community it served as well as the increasing availability of and competition from eprints, institutional repositories, conferences, social media, and online journals -- the very ecosystem that D-Lib Magazine nurtured and enabled. As long-time members of the digital library community and authors with the most contributions to D-Lib Magazine, we reflect on the history of the digital library community and D-Lib Magazine, taking its very first issue as guidance. It contained three articles, which described: the Dublin Core Metadata Element Set, a project status report from the NSF/DARPA/NASA-funded Digital Library Initiative (DLI), and a summary of the Kahn-Wilensky Framework (KWF) which gave us, among other things, Digital Object Identifiers (DOIs). These technologies, as well as many more described in D-Lib Magazine through its 23 years, have had a profound and continuing impact on the digital library and general web communities.


The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

September 2019

·

20 Reads

Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.


The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

August 2019

·

12 Reads

·

12 Citations

Lecture Notes in Computer Science

Web archiving frameworks are commonly assessed by the quality of their archival records and by their ability to operate at scale. The ubiquity of dynamic web content poses a significant challenge for crawler-based solutions such as the Internet Archive that are optimized for scale. Human-driven services such as the Webrecorder tool provide high-quality archival captures but are not optimized to operate at scale. We introduce the Memento Tracer framework that aims to balance archival quality and scalability. We outline its concept and architecture and evaluate its archival quality and operation at scale. Our findings indicate quality is on par or better compared against established archiving frameworks and operation at scale comes with a manageable overhead.


Signposting for Repositories

July 2019

·

11 Reads

Digital repositories can often easily be navigated by humans but not by machines. We introduce Signposting, a mechanism to show machines how to maneuver repositories’ objects and how to interpret their relationships. Signposting is based on standard and widely adopted web technologies - typed links and HTTP link headers.


Citations (64)


... Our requirements assume that archive services can create authentic mementos of Event Logs so that fixity information can be verified. Aturban et al. [23] show that, in general cases, current web archives, such as the Internet Archive, routinely fail to offer authentic mementos to external applications when replaying archived web pages. The mementos that are presented to typical users of web archives are often not the raw data that was archived but a processed version that presents the archive's best effort to create human-interpretable past versions of the web. ...

Reference:

Transparant and Trustworthy Artifact Life Cycle
Hashes are not suitable to verify fixity of the public archived web

... In dat toekomstige scenario wordt het overzicht van onderzoeksoutput in Biblio meer automatisch up-to-date gehouden, wat de administratieve last doet afnemen en de garantie op volledigheid laat toenemen. Om dat mogelijk te maken, werken we met de vernieuwde Biblio aan het fundament voor een meer gedecentraliseerde infrastructuur waarbij gegevens en dataverrijkingen automatisch over en weer kunnen stromen, onder andere met behulp van notification-gebaseerde protocollen voor institutionele repositories (Hochstenbach et al., 2022). Deze protocollen maken het mogelijk om op een geautomatiseerde manier te weten te komen welke publicaties en datasets van een auteur wereldwijd gekend zijn met als doel data over publicaties op een betrouwbare manier te aggregeren. ...

Event Notifications in Value-Adding Networks
  • Citing Chapter
  • September 2022

Lecture Notes in Computer Science

... The openness of a standard, while critical, is arguably not sufficient for widespread adoption. Achieving successful interoperability often depends on having a significant platform, either through commercial influence or through community involvement as articulated by Nelson and Van de Sompel (2022): ...

D-lib magazine pioneered web-based scholarly communication
  • Citing Conference Paper
  • June 2022

... The concept of aggregation goes beyond the Memento specification by leveraging a similar structure to TimeMaps but allowing the URIs contained within the aggregated TimeMap to identify resources at multiple archives instead of a single archive. The Research Library at Los Alamos National Laboratory (LANL) deployed the original Memento aggregator [9,18], currently accessible through a web interface via the Time Travel service at https://timetravel. mementoweb.org/. ...

Interoperability for Accessing Versions of Web Resources with the Memento Protocol
  • Citing Chapter
  • July 2021

... The preservation of digital ads requires not just storing files but also ensuring they can be replayed in future browsers, which may not support older media formats or web standards. Klein et al. (2019) highlight the increasing difficulty in maintaining archival quality due to the proliferation of dynamic web content, particularly content that is only accessible through the activation of JavaScript-based features. [8] Online ads often depend on external scripts, tracking pixels, and third-party services to function correctly. ...

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
  • Citing Chapter
  • August 2019

Lecture Notes in Computer Science

... For example, investigations by Klein and Balakireva (2022) of DOIs -arguably the most ubiquitous PID type --suggests widespread DOI request failures and inconsistent machine responses from organizations using them. Members of the same research team have also proposed their Memento 'Robust Links' approach as a means of improving the reliability of URL and URI-based referencing on the web, including with respect to PIDs (Klein et al., 2018). PIDs are therefore only persistent insofar as a PID registration service commits to resolving them, or insofar as a publisher commits to updating a PID registry with the current location of a web resource. ...

Robust Links in Scholarly Communication
  • Citing Conference Paper
  • May 2018

... Klein et al. [28] (2018) undertake a study to determine whether or not it is possible to carry out targeted crawls on the archive web. They are able to efficiently buy 22 web archives that contribute to the overall creation of event collections by using the Memento architecture. ...

Focused Crawl of Web Archives to Build Event Collections
  • Citing Conference Paper
  • May 2018

... Most research involving Memento aggregation relates to usage of the aggregator rather than enhancement of the aggregation process. In the same way that prior to MemGator, researchers would state "we requested URIs from the Time Travel Service", this statement was transformed to "we used MemGator to request URIs", indicative that it was useful for researchers to utilize their own aggregator instance [4,14,21]. A facet of this use case is the ability for researchers to customize the set of web archives to be used as the basis for querying, which is performed prior to running MemGator by modifying a configuration file. ...

Impact of URI Canonicalization on Memento Count

... Biological and Bio Medical sciences subject's lead with 35% records in the ORCID system. ORCID ID is supportive as it maintains the machine-readable Researchers profiles (Klein & Van de Sompel, 2017) and unique identification. However, the ORCID ID has a low number of profiles in comparison with the other Academic Scholarly Networking Site i.e. ...

Discovering Scholarly Orphans Using ORCID
  • Citing Conference Paper
  • June 2017