Figure 1 - uploaded by Stéphane Gançarski
Content may be subject to copyright.
Multi-topics in a web page 

Multi-topics in a web page 

Source publication
Article
Full-text available
Since late 90s, there has been a large investment in web archiving. Accessing these huge information sources is get-ting more and more attention. Web archive users profiles differ from casual web users profiles. Archive users need to analyze, evaluate and compare the information which requires complex queries with temporal dimension. These queries...

Context in source publication

Context 1
... organize our paper as follows. In Section 2, we present the features of the query languages for web archives. In Section 3, we present the related works. In the following section, conceptual model is presented with related operators. Be- fore the conclusion in Section 5, two use cases are explored in Section 5. The overarching design goal of our approach is to offer a query language for web archive users. In this section, we describe the features of this query language and the rationale behind them. Block-Based Search: Today, web pages contain vari- ous topics. A typical example is the web pages of news- papers/TV channels like www.bbc.co.uk/news (Figure 1) where multiple blocks with unrelated topics are marked with different colors. A web page with matched query terms in the same region is more relevant than a web page with matched terms distributed over the entire page. Besides increasing keyword search performance, the segmentation gives a structure to the web page for structural queries. Previous works [20, 14] show that a page can be parti- tioned into multiple blocks and, often, the blocks in a page have a different importance. The importance weights are accorded to different blocks inside a web page according to their location, area size, content, etc. Using a web page as unit of retrieval does not take into account these different regions and their importance. In our model, we add this notion to enrich our query language. With the block based search and the importance, we are able to answer queries like “Find pages mentioning Obama and Sarkozy in same block” or “Find pages mentioning Obama in their most important blocks” which returns more relevant information and also helps to reduce the number of results. Incompleteness and Temporal Coherence: Due to the limited sources, all web archives are incomplete (i.e they do not contain all possible versions of all the pages on the web) and we should query them as they are. If a user asks for a version at t , and if the archive does not have versions at t but it has the versions at t-2 and at t+2 , the access model should decide or support different choices to get the closest version to t . Temporal coherence is another issue in web archiving. The main reason is the dynamic structure of the web. It changes continuously in an unpredicted and unorganized manner. The web sites politeness constraints and the limited allo- cated resources do not allow archiving a whole web site at once, at the same moment. For example, in Figure 2, for a temporal navigation starting from the version of p1 at t2 , p2 at t1 is coherent, while p2 at t3 is incoherent. The access model should be able to find the most coherent version. Temporal coherence and incompleteness lend to broken or defected links which disable a complete navigation and bring access to a standstill. An access model for web archives should take into account these issues. Duplicates: Web archives has much more duplicated contents than the web itself. In web archives, that kind of duplicates can occur in two different cases: two versions of the same URL crawled at different times or the same content is pointed by several different URLs. According to [18], 25% of the documents in web archives are exact duplicates. Duplicates complicate the search and the result visualization. Temporal Ranking and Grouping: Ranking and grouping are most common ways to deliver query results for the large-scale data. For web archives, temporal dimension must be included in both ranking and grouping process. Ranking has also a dynamic nature in WACs. For example, a page archived mentioning “Obama” at 1999 should not have the same ranking result when querying at 2000 and at 2010. Temporal Logic: Support for temporal logic operators enriches the language. For example, a researcher who wants to analyze the effects of the arrest of Julian Assange can execute a query: “Find pages linked to wikileaks.org after Julian Assange’s arrest in London”. User-Friendliness: This query language will be used by researchers and casual web users. So it should enable users to find information without long-term training. Simple queries (like keyword search) should be expressed straight- forwardly. Complex syntax should be used only to express more complex queries. We believe that with an advanced GUI, users can avoid from writing “codes”. In this section, we briefly summarize related works in three different areas concerning the access to web archives: Web archiving, query languages for web and block-based search To explore web archives, traditional access methods, i.e navigation and full-text search are proposed by web archiving initiatives [3, 4, 5]. In fall of 2001, Internet Archive(IA) [3] launched its collaborated project with Alexa Internet called “Wayback Machine” [27]. It allows users to go back in time and view earlier versions of a web page for a given URL. The inconvenience of this method is the necessity of knowing exact URLs. There is another way of navigating in web archives : navigation between different versions. Some web archive initiatives propose a navigation tool like UK Web Archives(UKWAC) [5] as seen in Figure 3 to facilitate the navigation between versions. By using the cursor, users can browse different versions easily. The increasing number of national web archives, diversity of existing works led to the establishment of the International Internet Preservation Consortium (IIPC) [2] in Paris at 2003. The aim is to develop common standards, tools and techniques for web archiving. One of the current projects of IIPC, called WERA, is an archive access solution for searching and navigating web archives. It allows a full-text search besides wayback machine style search and it is based on the NWA(Nordic Web Archive) toolset [19] and the NutchWAX [26] full-text indexer. NutchWAX is extension of Nutch (an open source search engine based on Lucene java for searching and indexing) for searching web archive collections. Today, most of web archive initiatives use the wayback machine to support URL indexing and search. NutchWAX is used to enable full-text indexing and search. Our motiva- tion is focused on enabling complex queries which can not be performed by existing methods. A number of Web query languages have been developed in the past (e.g WebSQL, W3QL, WebLog, WebOQL, etc.) [17]. All those languages are intended for online queries on the web. From the perspective of web archiving, the most important inconvenience of those query languages is their lack of temporal dimension and the lack of handling challenges related to web archives like temporal coherence, incompleteness etc. WebBase [23] project at Stanford University is a web repos- itory project that aims to manage large collections of web pages and to enable web based search. A web warehouse is interpreted simultaneously as a document collection, as a directed graph and as a set of relations. For web based search, a query language with the notions of ranking and ordering is proposed. However, one copy of each page at a time is archived, thus, no temporal dimension is provided in that project. A web warehousing system called WHOWEDA(Warehouse of Web Data) [10] proposed by Web Warehousing and Data Mining group at the Nanyang Technological University in Singapore aims to store and manipulate web information. It stores extracted web information as web tables and provides web operators to manipulate those tables. Its data model is based on nodes (pages) and links (hyperlinks) objects. Links do not have any time-related attribute. Any change in the last-modified time of the web document results in a new node. Content of a web document is represented as “node data tree”. For HTML documents, it is a HTML DOM tree. The flexibility of HTML syntax might cause mistakes in DOM tree structure. In addition, however, DOM tree is powerful for presentation in the browser, it is not introduced for description of the semantic structure of the web page. In WHOWEDA, the internal semantic structure of a page is not modeled, thus it only allows queries to be specified on the whole content of the document. WHOWEDA only fo- cuses on user interested web sites which constitute a much smaller scale than web archives. Using semantic blocks in web pages as an unit of information retrieval is an active research area. The vector space model customized with importance and permeability (the indexing of neighbors blocks) is proposed in [11] without temporal dimension. In [13], after segmenting each page into non overlapping blocks, an importance value is assigned to each block which is used to weight the links in the ranking computation. A block-based language model is proposed in [28]. As far as we know, there is no approach with temporal dimension in block-based search, ranking and language modeling. Our approach is based on visual page segmentation of web pages [22] and uses an importance model proposed in [25]. In conclusion, wayback machine, full-text search and navigation are the only applied solutions to access to web archives. Most of the web query languages do not contain temporal dimension for querying historical data. Different topics in a web page are not handled in search except recent works like [12, 15, 13] which suffer from lack of temporal dimension to be used for web archives. In our approach, we propose a data model which takes into account different regions of a web page with associated importance and temporal dimension. The details of WAC query language are described in this section. After briefly introducing the interpretation of temporal dimension in Section 4.1, we describe our data ...