Lab

Web Science and Digital Libraries Research Group


Featured research (2)

Web archives, such as the Internet Archive, preserve the web and allow access to prior states of web pages. We implicitly trust their versions of archived pages, but as their role moves from preserving curios of the past to facilitating present day adjudication, we are concerned with verifying the fixity of archived web pages, or mementos, to ensure they have always remained unaltered. A widely used technique in digital preservation to verify the fixity of an archived resource is to periodically compute a cryptographic hash value on a resource and then compare it with a previous hash value. If the hash values generated on the same resource are identical, then the fixity of the resource is verified. We tested this process by conducting a study on 16,627 mementos from 17 public web archives. We replayed and downloaded the mementos 39 times using a headless browser over a period of 442 days and generated a hash for each memento after each download, resulting in 39 hashes per memento. The hash is calculated by including not only the content of the base HTML of a memento but also all embedded resources, such as images and style sheets. We expected to always observe the same hash for a memento regardless of the number of downloads. However, our results indicate that 88.45% of mementos produce more than one unique hash value, and about 16% (or one in six) of those mementos always produce different hash values. We identify and quantify the types of changes that cause the same memento to produce different hashes. These results point to the need for defining an archive-aware hashing function, as conventional hashing functions are not suitable for replayed archived web pages.
Certain HTTP Cookies on certain sites can be a source of content bias in archival crawls. Accommodating Cookies at crawl time, but not utilizing them at replay time may cause cookie violations, resulting in defaced composite mementos that never existed on the live web. To address these issues, we propose that crawlers store Cookies with short expiration time and archival replay systems account for values in the Vary header along with URIs.

Lab head

Michael L. Nelson
Department
  • Department of Computer Science
About Michael L. Nelson
  • Web preservation & archiving, digital libraries, social media, information retrieval, scholarly communication. See: http://www.cs.odu.edu/~mln/

Members (7)

Michele Weigle
  • Old Dominion University
Lyudmila Balakireva
  • Los Alamos National Laboratory
Shawn Morgan Jones
  • Los Alamos National Laboratory
Sawood Alam
  • Old Dominion University
Mohamed Aturban
  • Columbia College
Scott Ainsworth
  • Old Dominion University
Hussam Hallak
  • Old Dominion University
Joan A. Smith
Joan A. Smith
  • Not confirmed yet
Charles L. Cartledge
Charles L. Cartledge
  • Not confirmed yet

Alumni (9)

Mat Kelly
  • Drexel University
Frank Edward McCown
  • Harding University Main Campus
Alexander Nwala
  • Old Dominion University
Yasmin Alnoamany
  • Old Dominion University