Conference Paper

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA;
DOI: 10.1109/ICDE.2004.1319988 In proceeding of: Data Engineering, 2004. Proceedings. 20th International Conference on
Source: IEEE Xplore

ABSTRACT We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Crawling the deep web often requires the selection of an ap- propriate set of queries so that they can cover most of the documents in the data source with low cost. This can be modeled as a set covering problem which has been extensively studied. The conventional set cov- ering algorithms, however, do not work well when applied to deep web crawling due to various special features of this application domain. Typ- ically, most set covering algorithms assume the uniform distribution of the elements being covered, while for deep web crawling, neither the sizes of documents nor the document frequencies of the queries is distributed uniformly. Instead, they follow the power law distribution. Hence, we have developed a new set covering algorithm that targets at web crawl- ing. Compared to our previous deep web crawling method that uses a straightforward greedy set covering algorithm, it introduces weights into the greedy strategy. Our experiment carried out on a variety of corpora shows that this new method consistently outperforms its un-weighted version.
    Advanced Data Mining and Applications, 5th International Conference, ADMA 2009, Beijing, China, August 17-19, 2009. Proceedings; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: Deep web refers to the hidden portion of the WWW (World Wide Web) which cannot be accessed directly. One of the important issues in the WWW is how to search the hidden Web. Several techniques have been proposed in order to address this issue. In this paper, we have surveyed the current problems of retrieving information from hidden Web and proposed a solution to solve these problems using probability, iterative deepening search and graph theory.
    Computer and Information Technology Workshops, 2008. CIT Workshops 2008. IEEE 8th International Conference on; 08/2008
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias—the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function.We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.
    Data & Knowledge Engineering. 01/2010;

Full-text (2 Sources)

Available from
Jun 1, 2014