Table 3 - uploaded by Miguel Costa
Content may be subject to copyright.
Source publication
This paper introduces the Portuguese Web Archive initiative, pre-senting its main objectives and work in progress. Term search over web archives collections is a desirable feature that raises new chal-lenges. It is discussed how the terms index size could be reduced without significantly decreasing the quality of search results. The results obtaine...
Similar publications
An image texture was defined in terms of pixel intensities and directionality. However, most of current texture representation methods did not consider the two key factors simultaneously. To effectively capture directional and pixel intensity information of texture, in this paper, we propose a novel and robust local descriptor, named locally direct...
When the elastic properties of structured materials become direction-dependent, the number of their descriptors increases. For example, in two-dimensions, the anisotropic behavior of materials is described by up to 6 independent elastic stiffness parameters, as opposed to only 2 needed for isotropic materials. Such high number of parameters expands...
Mel-frequency cepstral coefficients are used as an abstract representation of the spectral envelope of a given signal. Although they have been shown to be a powerful descriptor for speech and music signals, more accurate and easily interpretable options can be devised. In this study, we present and evaluate the shape-based spectral contrast descrip...
... As expected, home pages are very informative containing most of the base information about the website. In particular, leveraging only on home pages it is possible to derive some structural information about the underneath technologies [17], accomplish large scale usability tests [18], [19] and perform web classification [20]. ...
... Another issue is performance. Inverted indexes are considered to be an essential technique for the provision of timely search results [15]. Indexes are widely used to reference archive data held in ARC (ARChive) or WARC 2 (Web ARChive) file formats [16]. ...
... Indexes are widely used to reference archive data held in ARC (ARChive) or WARC 2 (Web ARChive) file formats [16]. Indexes are held in memory for performance reasons [15], but the increase in the amount of archived data makes this increasingly difficult. Thus, research has also investigated techniques for reducing the size of the index, such as de-duplication [17]. ...
Being able to explore large digital collections effectively is of interest to both academics and practitioners alike. The need to go beyond the provision of keyword-driven functionality to features that support exploration and discovery is widely recognised. In addition, providers are seeking to support more diverse groups of users with varying information needs and tasks. Increasing amounts of cultural heritage are being stored in web archives that present unique challenges as a form of digital cultural heritage. This paper describes a collaboration between the University of Sheffield and the UK National Archives to investigate entity-based methods for exploring the UK Government Web Archive.
... All the institutions managing these web archives are members of the International Internet Preservation Consortium (IIPC), which has the goal of aggregating efforts to produce common tools and standards. This explains the convergence to NutchWAX and was also the reason for us to adopt it in the developing of the Portuguese web archive [3] . We have indexed until now more than 200 million documents . ...
We present the first overview of a web archive user profile and the searching technology that supports it. Most web archives only support URL search and just a few provide full- text search in response to users' expectations. Their technology is essentially based on web search engines, which ignore the temporal dimension of collections. As consequence, the quality of results is poor. We suggest the creation of an initiative for information retrieval evaluation, meeting the needs of web archives. We believe this initiative would foster research in web archives, in resemblance with what other initiatives achieved in their domains.
... Niu limited her study to web archives with an English interface. National libraries have published their web archiving initiatives in various studies, for example, National Library of France [8], Portuguese web archive [9], National Library of the Czech Republic [10], National Taiwan University [11], National Archives of Australia [12], and China Web InfoMall [13]. Memento [14] is an extension for the HTTP protocol to allow the user to browse the past web as the current web. ...
The Memento aggregator currently polls every known public web archive when
serving a request for an archived web page, even though some web archives focus
on only specific domains and ignore the others. Similar to query routing in
distributed search, we investigate the impact on aggregated Memento TimeMaps
(lists of when and where a web page was archived) by only sending queries to
archives likely to hold the archived page. We profile twelve public web
archives using data from a variety of sources (the web, archives' access logs,
and full-text queries to archives) and discover that only sending queries to
the top three web archives (i.e., a 75% reduction in the number of queries) for
any request produces the full TimeMaps on 84% of the cases.
... The Portuguese Web Archive (PWA) 7 is based on the WM, but uses NutchWAX as its full-text and URL search engine [15]. NutchWAX was developed by the International Internet Preservation Consortium (IIPC) and is used by many web archives [14]. ...
Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of management are important aspects to take in consideration when choosing a system architecture. We compare these aspects and initialize the discussion of which search architecture is more suitable for a large-scale web archive.
... This article details the steps taken towards resource discovery for the domain names and Web resources, aiming at a delineation of the Honduran Web. Similar work has been done already for Portugal [6] [7], Argentina [8], and Chile [9]. We adapt and develop some of their methods to arrive at a better understanding of the Honduran Web. ...
... This article details the steps taken towards resource discovery for the domain names and Web resources, aiming at a delineation of the Honduran Web. Similar work has been done already for Portugal [6,7], Argentina [8], and Chile [9]. We adapt and develop some of their methods to arrive at a better understanding of the Honduran Web. ...
... We studied the above issues and drew the first profile of how web archive users search. It is based on the quantitative analysis of the Portuguese Web Archive (PWA) search logs [6]. Our results show that users of both types of systems have similar behaviors. ...
Web archives are a huge source of information to mine the past. However, tools to explore web archives are still in their infancy, in part due to the reduced knowledge that we have of their users. We contribute to this knowledge by presenting the first search behavior characterization of web archive users. We obtained detailed statistics about the users' sessions, queries, terms and clicks from the analysis of their search logs. The results show that users did not spend much time and effort searching the past. They prefer short sessions, composed of short queries and few clicks. Full-text search is preferred to URL search, but both are frequently used. There is a strong evidence that users prefer the oldest documents over the newest, but mostly search without any temporal restriction. We discuss all these findings and their implications on the design of future web archives.
... We faced this problem when we started developing the access functionalities for the Portuguese Web Archive (PWA) [8]. People had a great difficulty in suggesting anything without seeing the system working. ...
A complete characterization of web archive users must respond to three questions: why, what and how do users search? This study focuses on the first two: what are the user intents and which topics are most interesting to them? Answers to these questions are essential for guiding the development of web archives towards better user satisfaction. We used three instruments to collect quantitative and qualitative data, namely, search logs, an online questionnaire and a laboratory study. The obtained results are coincident. Users perform mostly navigational searches and do not restrict searches by date. Other findings show that users prefer full-text over URL search and the oldest documents over the newest. We discuss all these findings and their implications in the design of search engines for web archives.
... The amount of information published is expressed in decimal multiples: 1 KB = 10 3 bytes [21]. The Portuguese Web Archive (PWA) project aims to automatically gather and preserve the information published on the Portuguese Web [22]. The most recent Web characterization results analyzed in this study were extracted from a crawl of the Portuguese Web performed by the PWA in 2008, that included all media types, which we named allmedia08 [4]. ...
The Web is permanently changing, with new technologies and publishing behaviors emerging everyday. It is important to track trends on the evolution of the Web to develop efficient tools to process its data. For instance, Web trends influence the design of browsers, crawlers and search engines. This study presents trends on the evolution of the Web derived from the analysis of 3 characterizations performed within an interval of 5 years. The Web portion used as a case study was the Portuguese Web. Several metrics regarding site and content characteristics were analyzed. Keywords-Web trends; Web measurements; Web character- ization
Building a language model from free available internet information takes several steps and challenges. This new model aims to be a BERT-based language model for European Portuguese, with no specific context. The corpus was built using a web page archive infrastructure provided by and restricted to .pt domains. This paper will describe the overall process of building the corpus and training a BERT model.KeywordsBERTVocabularyArquivo.ptPortuguese European