Table 1 - uploaded by Max Kaiser
Content may be subject to copyright.
Source publication
The Internet has turned into an important aspect of our information infrastructure and society, with the Web forming a part of our cultural heritage. Several initiatives thus set out to preserve it for the future. The resulting Web archives are by no means only a collection of historic Web pages. They hold a wealth of information that waits to be e...
Context in source publication
Similar publications
The absence of benchmarks for Web sites with dynamic content has been a major impediment to research in this area. We describe three benchmarks for evaluating the performance of Web sites with dynamic content. The benchmarks model three common types of dynamic-content Web sites with widely varying application characteristics: an online bookstore, a...
Citations
... En sus primeros inicios el Internet Archive (IA) sólo se dedicaba a la captura de la mayor parte de Internet, dando prioridad a lo que se suponía que se encontraba en mayor riesgo o se consideraba lo más importante. No obstante, aunque no es el único archivo de la Web (PANDORA Archive 34 , Austrian On-Line Archive 35 …), es el más grande, y es considerado especialmente importante para la comunicación académica, ya que el creciente uso de los recursos alojados en él se centra a la enseñanza y a la investigación académica (Masanès, 2002;Rauber et al., 2002). ...
The present essay is a research work about the scientific production in Digital Preservation through the use of the Web of Science database. A state of the art is
established in which the most important milestones in Digital Preservation are collected, as well as the content análisis of definitions and their influence in Spain. The scientific production is analysed by treating the scientific articles retrieved from the Web of Science. This analysis is based on the application of bibliometric laws, the Impact
Factor, the SJR, the index-h and trends according to lenguaje, country, institutions, research areas, document typology, abstract and descriptors. Finally, a series of
conclusions are drawn together with a prospective for future work related to this study.
... ); Dirksen, Huizing & Smit (2010) envision a "connective ethnography" sensitive to the imbrications of information resources across systems and texts; Geiger & Ribes (2011) explore the possibilities offered by "trace ethnography" to the study of user interactions and activities on collaborative digital media platforms; Pink & Postill's "social media ethnography" (2012) zooms into the affective intensities of interactions online and offline; Hochman & Manovich's idea of "data ethnography" extends the dream of big data analytics to the possibility of following individual users throughout large-scale aggregations of content(2013); Collins & Durington's "networked anthropology" (2014) emphasizes a processual engagement with ecologies of actors. The overlaps and similarities between these ethnographic brands and other (not explicitly ethnographic) methodological proposals such as "global technography"(Kien, 2009) or "web archaeology"(Leung et al. 2001;Rauber et al. 2002;Foot & Schneider 2006;Foot & Schneider 2007; Harper & Chen 2012, p. 67) further compound the confusion. ...
This special issue collects the confessions of five digital ethnographers laying bare their methodological failures, disciplinary posturing, and ethical dilemmas. The articles are meant to serve as a counseling stations for fellow researchers who are approaching digital media ethnographically. On the one hand, this issue’s contributors acknowledge the rich variety of methodological articulations reflected in the lexicon of “buzzword ethnography”. On the other, they evidence how doing ethnographic research about, on, and through digital media is most often a messy, personal, highly contextual enterprise fraught with anxieties and discomforts. Through the four “private messages from the field” collected in this issue, we acknowledge the messiness, open-endedness and coarseness of ethnographic research in-the-making. In order to do this, and as a precise editorial choice made in order to sidestep the lexical turf wars and branding exercises of ‘how to’ methodological literature, we propose to recuperate two forms of ethnographic writing: Confessional ethnography (Van Maanen 2011) and self-reflection about the dilemmas of ethnographic work (Fine 1993). Laying bare our fieldwork failures, confessing our troubling epistemological choices and sharing our ways of coping with these issues becomes a precious occasion to remind ourselves of how much digital media, and the ways of researching them, are constantly in the making.
... In summary, anchor texts are related to real queries, and target documents' titles. In addition to this, anchor text is available not only for pages in the archive, but also for pages that have not been archived when there are pointers to them from pages in the Web archive [29,33,42]. ...
A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document’s retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries’ timestamps and the documents’ timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.
... This additional information includes server-side metadata of harvested pages (such as timestamps and HTML response codes), and information embedded in pages (for instance their hyperlinks and associated anchor text). Rauber et al. [29] have recognized the wealth of additional information contained in web archives which can be used for analytical purposes. Gomes and Silva [10] used data obtained from the domain crawl of the Portugese web archive to develop criteria for characterizing the Portugese web. ...
Web archives attempt to preserve the fast changing web, yet they will always be incomplete. Due to restrictions in crawling depth, crawling frequency, and restrictive selection policies, large parts of the Web are unarchived and, therefore, lost to posterity. In this paper, we propose an approach to uncover unarchived web pages and websites and to reconstruct different types of descriptions for these pages and sites, based on links and anchor text in the set of crawled pages. We experiment with this approach on the Dutch Web Archive and evaluate the usefulness of page and host-level representations of unarchived content. Our main findings are the following: First, the crawled web contains evidence of a remarkable number of unarchived pages and websites, potentially dramatically increasing the coverage of a Web archive. Second, the link and anchor text have a highly skewed distribution: popular pages such as home pages have more links pointing to them and more terms in the anchor text, but the richness tapers off quickly. Aggregating web page evidence to the host-level leads to significantly richer representations, but the distribution remains skewed. Third, the succinct representation is generally rich enough to uniquely identify pages on the unarchived web: in a known-item search setting we can retrieve unarchived web pages within the first ranks on average, with host-level representations leading to further improvement of the retrieval effectiveness for websites.
... The Archive's objective is to store in perpetuity huge collections of digital information, an important and highly significant mission (Chavez-Demoulin, Roehrl, Roehrl, & Weinberg, 2000;Council on Library and Information Resources, 2002;Featherstone, 2000). Although it is not the only archive on the Web (Masanès, 2002;Rauber, Bruckner, Aschenbrenner, Witvoet & Kaiser, 2002), it is the biggest. The Archive is particularly important for scholarly communication, because of the increasing use of Web resources in teaching and academic research (Kenney, McGovern, Botticelli, Entlich, Lagoze & Payette, 2002). ...
The Internet Archive, an important initiative that maintains a record of the evolving Web, has the promise of being a key resource for historians and those who study the Web itself. The archive's goal is to index the whole Web without making any judgments about which pages are worth saving. The potential importance of the archive for longitudinal and historical Web research leads to the need to evaluate its coverage. This article focuses upon whether there is an international bias in its coverage. The results show that there are indeed large national differences in the archive's coverage of the Web. A subsequent statistical analysis found differing national average site ages and hyperlink structures to be plausible explanations for this uneven coverage. Although the bias is unintentional, researchers using the archive in the future need to be aware of this problem.
This selective bibliography presents over 500 English-language articles, books, and technical reports. It covers digital curation and preservation copyright issues, digital formats (e.g., data, media, and e-journals), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns. Most sources have been published from 2000 through February 2011. It is under a under a Creative Commons Attribution License. It is also available as a website with a Google Translate link (https://tinyurl.com/24avtyuu). "This tremendous resource is. . . an excellent place to survey much of the available research on a topic related to data curation". - Julia Flanders and Trevor Muñoz. "An Introduction to Humanities Data Curation." In DH Curation Guide: A Community Resource Guide to Data Curation in the Digital Humanities, 2012.
This article is an attempt to build a quantitative panorama of the Polish country code top-level domain (ccTLD) in the years 1996–2001 on the basis of data generously provided by the Internet Archive. The purpose of analyzing over 72 million captures is to show that these resources have limited potential in reconstructing the early Polish Web. The availability of historical Web resources and tools for their easy exploration in no way determines their potential value and usefulness in research, even if we do not have access to alternative sources. Czy to był prawdziwy Web? Ilościowy przegląd polskiej domeny krajowej w zbiorach Internet Archive (1996–2001) Artykuł przedstawia ilościowy opis zasobów polskiej domeny krajowej (country code top-level domain, ccTLD) z lat 1996–2001, dostępnych w zbiorach Wayback Machine, archiwum Webu prowadzonym przez Internet Archive. Celem analizy ponad 72 mln archiwizacji (captures) jest wykazanie, że zasoby te mają ograniczony potencjał w rekonstruowaniu polskiego wczesnego Webu. Dostępność historycznych zasobów WWW i narzędzi do ich łatwej eksploracji w żaden sposób nie przesądza o ich potencjalnej wartości i przydatności w badaniach, nawet jeśli nie mamy dostępu do alternatywnych źródeł.
Archives are evolving. Analog archives are becoming increasingly digitized and linked with other cultural heritage institutions and information sources. Diverse forms of born-digital archives are appearing. This diversity asks for systematic ways to characterize existing archives managing physical or digital records. We conducted a systematic review to identify and understand how archives are characterized. From the 885 identified articles, only 15 were focused on archives’ characterization and, therefore, included in the study. We found several characterization features organized in three main groups: archival materials, provided services, and internal processes.
This bibliography presents over 650 English-language articles, books, and technical reports. It covers digital curation and preservation copyright issues, digital formats (e.g., data, media, and e-journals), metadata, models and policies, national and international efforts, projects and institutional implementations, research studies, services, strategies, and digital repository concerns. Most sources were published from 2000 through 2011. It is available as a EPUB file, a low-cost paperback, a paperback PDF file, a website with a Google Translate link, and a website PDF with live links (http://digital-scholarship.org/dcbw/dcb.htm). It is under a under a Creative Commons Attribution License. "Librarians and scholars who are concerned with managing digital resources and preserving them for future use will find a crash course on the subject in this bibliography. . . . This book is recommended for librarians working with original digital resources, scholars interested in digital repositories, and students in the field." - Paul M. Blobaum, Journal of the Medical Library Association 101, no. 2 (2013): 158.
Web archives preserve the fast changing Web by repeatedly crawling its content. The crawling strategy has an influence on the data that is archived. We use link anchor text of two Web crawls created with different crawling strategies in order to compare their coverage of past popular topics. One of our crawls was collected by the National Library of the Netherlands (KB) using a depth-first strategy on manually selected websites from the .nl domain, with the goal to crawl websites as completes as possible. The second crawl was collected by the Common Crawl foundation using a breadth-first strategy on the entire Web, this strategy focuses on discovering as many links as possible. The two crawls differ in their scope of coverage, while the KB dataset covers mainly the Dutch domain, the Common Crawl dataset covers websites from the entire Web. Therefore, we used three different sources to identify topics that were popular on the Web; both at the global level (entire Web) and at the national level (.nl domain): Google Trends, WikiStats, and queries collected from users of the Dutch historic newspaper archive. The two crawls are different in terms of their size, number of included websites and domains. To allow fair comparison between the two crawls, we created sub-collections from the Common Crawl dataset based on the .nl domain and the KB seeds. Using simple exact string matching between anchor texts and popular topics from the three different sources, we found that the breadth-first crawl covered more topics than the depth-first crawl. Surprisingly, this is not limited to popular topics from the entire Web but also applies to topics that were popular in the .nl domain.