Figure 5 - uploaded by Stéphane Gançarski
Content may be subject to copyright.
Collapse/Expand 

Collapse/Expand 

Source publication
Article
Full-text available
Since late 90s, there has been a large investment in web archiving. Accessing these huge information sources is get-ting more and more attention. Web archive users profiles differ from casual web users profiles. Archive users need to analyze, evaluate and compare the information which requires complex queries with temporal dimension. These queries...

Context in source publication

Context 1
... out b finds the set of pages pointed in one step from the set of concrete blocks ( CB ) by following any of the links valid at a given period. Operator out uses out b to find the set of pages reachable in one step from the set of pages ( P ) at a given period. For that, it finds the set of concrete blocks foreach page in P and calls out b foreach CB . Operator in finds the set of concrete blocks that points to a set of page by links valid for a requested period. Operators jump + and jump − return a set of pages reachable, respectively, incoming and outgoing direction in one to n steps by following links valid at a given period. It is a combination of in and out operators with iteration. Collapse/Expand COLLAPSE, also referred in litera- ture as coalesce, combines the tuples, which have the same non-temporal values and consecutive or overlapping validities, into one tuple with validity that is the union of the constituent validities. EXPAND expands a tuple into several tuples by splitting its validity into consecutive validities in a given scale. In our approach this scale can be following keywords: YEAR, MONTH, DAY. These operators can be combined with IN period to limit the range. In Figure 5, EXPAND BY YEAR IN [2001,2004) returns the first three tuples. In this section, to illustrate how the different operators work in our approach, we use two examples and construct the corresponding queries. Example 1: We extend the example that we gave in the introduction. A social researcher who studies how French media covered the event “earthquake at Haiti in 2010” over last year wants to know the number per month of different regions in web pages in domain .fr referring to earthquake by eliminating duplicates. In that case, if at t, there were two different articles in “lemonde.fr”, it is counted as 2 instead of 1 (whole page). Figure 6 shows the query graph for this example. First, we need to find all concrete blocks in domain .fr which are valid at 2010. LIKE is used to make a string comparison. Then, with CONTAINS operator, we find contents which mention given keywords (Haiti, earthquake). EXPAND operator is used to group by month over validity of content. COUNT and DISTINCT operators are used to find out the number of different regions. By using GROUP BY operator with url, we can limit the count to web pages. Example 2: Our second example is based on finding broken links from a given url X in a given period (2000 in our example). Figure 7 illustrates the query graph for this example. We need to underline the fact that these broken links are not the result of incompleteness but HTTP 404 error while crawling. In this paper, we addressed the problem of accessing information in web archives. We presented a conceptual model as the basis of a query language for web archives. The operators to support queries are also described. In our model, we take into account different topics in web pages by using visual blocks as an unit of retrieval with accorded importance. Block-based approach is used for information retrieval on the web, however, as far as we know, it is never used with temporal dimension. Navigation operators with temporal dimension let users to execute queries over web archives temporal hyperlink structure. The model and operators enriched with the temporal dimension allow querying web archives powerfully. Our approach is in the early stage of development. Our first priority is to express the language in algebraic form. Next steps will be the implementation with an appropriate user-friendly syntax. We want to underline the fact that in this paper we clarify the requirements of WAC query language formally. It can be implemented as a new query language or as an extension of an existing query language. We will also work on ranking functions which take into account the block-based structure and temporal dimension. By using the existing temporal indexing [9] and block-based indexing approaches [11], we intend to propose a hybrid indexing model. Once the proposed query language will be fully implemented, our attention will focus on query optimization ...