Conference Paper

Probe, cluster, and discover: Focused extraction of QA-Pagelets from the Deep Web

Georgia Institute of Technology, Atlanta, Georgia, United States
DOI: 10.1109/ICDE.2004.1319988 Conference: Data Engineering, 2004. Proceedings. 20th International Conference on
Source: IEEE Xplore

ABSTRACT We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

Download full-text

Full-text

Available from: David Buttler, Aug 30, 2015
0 Followers
 · 
97 Views
  • Source
    • "Without the knowledge of the data source size, it is difficult to decide when to stop the crawling process, and how to evaluate the performance of the data extractors. There have been tremendous research on data source size estimation [4] [5] [8] [9] [11] [13] [34] [36] [40] [42], all are more or less based on the traditional capture-recapture method [1] [14] [16] that was first developed in ecology for the estimation of wild animals. The basic idea is to capture a collection of animals as randomly as possible, mark the captured animals and release them. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Many deep web data sources are ranked data sources, i.e., they rank the matched documents and return at most the top k number of results even though there are more than k documents matching the query. While estimating the size of such ranked deep web data source, it is well known that there is a ranking bias—the traditional methods tend to underestimate the size when queries overflow (match more documents than the return limit). Numerous estimation methods have been proposed to overcome the ranking bias, such as by avoiding overflowing queries during the sampling process, or by adjusting the initial estimation using a fixed function.We observe that the overflow rate has a direct impact on the accuracy of the estimation. Under certain conditions, the actual size is close to the estimation obtained by unranked model multiplied by the overflow rate. Based on this result, this paper proposes a method that allows overflowing queries in the sampling process.
    Data & Knowledge Engineering 08/2010; DOI:10.1016/j.datak.2010.03.007 · 1.49 Impact Factor
  • Source
    • "Finally, the idea of using clustering for distinguishing between result and error pages comes from [6], although we do not use the same input for the clustering algorithm. In [6], the authors construct the feature vector for a page by extracting the tags from HTML code and use the cosine similarity measure with a tf-idf weighting. In practice, we found out that this tag-signature–based clustering does not work very well in comparison to our scheme of clustering based on the terminal paths in the DOM tree. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an original approach to the automatic induction of wrappers for sources of the hidden Web that does not need any human supervision. Our approach only needs domain knowledge expressed as a set of concept names and concept instances. There are two parts in extracting valuable data from hidden-Web sources: understanding the structure of a given HTML form and relating its elds to concepts of the domain, and understanding how resulting records are rep- resented in an HTML result page. For the former problem, we use a combination of heuristics and of probing with do- main instances; for the latter, we use a supervised machine learning technique adapted to tree-like information on an automatic, imperfect, and imprecise, annotation using the domain knowledge. We show experiments that demonstrate the validity and potential of the approach.
    10th ACM International Workshop on Web Information and Data Management (WIDM 2008), Napa Valley, California, USA, October 30, 2008; 10/2008
  • Source
    • "Our current prototype includes simple wrappers for submitting queries via HTML forms and screen-scraping the results. Techniques for creating more complex wrappers, and creating wrappers automatically, have been examined by others [34], [11], [2] and can be integrated into our framework. Beacons can be run by libraries, universities, ISPs, corporations or any organization that wants to provide searching services to its user group. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In the InfoBeacons system, a peer-to-peer network of beacons cooperates to route queries to the best information sources. Many internet sources are unwilling to provide more cooperation than simple searching to aid in the query routing.We adapt techniques from information retrieval to deal with this lack of cooperation. In particular, beacons determine how to route queries based on information cached from sources’ responses to queries. In this paper, we examine alternative architectures for routing queries between beacons and to data sources. We also examine how to improve the routing by probing sources in an informed way to learn about their content. Results of experiments using a beacon network to search 2,500 information sources demonstrates the effectiveness of our system; for example, our techniques require contacting up to 71 percent fewer sources than existing peer-to-peer random walk techniques.
    IEEE Transactions on Parallel and Distributed Systems 01/2008; 18(12):1754-1765. DOI:10.1109/TPDS.2007.1107 · 2.17 Impact Factor
Show more