Conference Paper

Probe, cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web

Georgia Institute of Technology, Atlanta, Georgia, United States
DOI: 10.1109/ICDE.2004.1319988 Conference: Data Engineering, 2004. Proceedings. 20th International Conference on
Source: IEEE Xplore


We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

Download full-text


Available from: David Buttler
  • Source
    • "Without the knowledge of the data source size, it is difficult to decide when to stop the crawling process, and how to evaluate the performance of the data extractors. There have been tremendous research on data source size estimation [4] [5] [8] [9] [11] [13] [34] [36] [40] [42], all are more or less based on the traditional capture-recapture method [1] [14] [16] that was first developed in ecology for the estimation of wild animals. The basic idea is to capture a collection of animals as randomly as possible, mark the captured animals and release them. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias—the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function.We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.
    Preview · Article · Aug 2010 · Data & Knowledge Engineering
  • Source
    • "Finally, the idea of using clustering for distinguishing between result and error pages comes from [6], although we do not use the same input for the clustering algorithm. In [6], the authors construct the feature vector for a page by extracting the tags from HTML code and use the cosine similarity measure with a tf-idf weighting. In practice, we found out that this tag-signature–based clustering does not work very well in comparison to our scheme of clustering based on the terminal paths in the DOM tree. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its elds to concepts of the domain, and understanding how resulting records are rep- resented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with do- main instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.
    Full-text · Conference Paper · Oct 2008
  • Source
    • "Our current prototype includes simple wrappers for submitting queries via HTML forms and screen-scraping the results. Techniques for creating more complex wrappers, and creating wrappers automatically, have been examined by others [34], [11], [2] and can be integrated into our framework. Beacons can be run by libraries, universities, ISPs, corporations or any organization that wants to provide searching services to its user group. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the InfoBeacons system, a peer-to-peer network of beacons cooperates to route queries to the best information sources. Many internet sources are unwilling to provide more cooperation than simple searching to aid in the query routing.We adapt techniques from information retrieval to deal with this lack of cooperation. In particular, beacons determine how to route queries based on information cached from sources’ responses to queries. In this paper, we examine alternative architectures for routing queries between beacons and to data sources. We also examine how to improve the routing by probing sources in an informed way to learn about their content. Results of experiments using a beacon network to search 2,500 information sources demonstrates the effectiveness of our system; for example, our techniques require contacting up to 71 percent fewer sources than existing peer-to-peer random walk techniques.
    Preview · Article · Jan 2008 · IEEE Transactions on Parallel and Distributed Systems
Show more