Conference Paper

Probe, cluster, and discover: focused extraction of QA-Pagelets from the deep Web

Coll. of Comput., Georgia Inst. of Technol., Atlanta, GA, USA;
DOI: 10.1109/ICDE.2004.1319988 Conference: Data Engineering, 2004. Proceedings. 20th International Conference on
Source: IEEE Xplore

ABSTRACT We introduce the concept of a QA-Pagelet to refer to the content region in a dynamic page that contains query matches. We present THOR, a scalable and efficient mining system for discovering and extracting QA-Pagelets from the deep Web. A unique feature of THOR is its two-phase extraction framework. In the first phase, pages from a deep Web site are grouped into distinct clusters of structurally-similar pages. In the second phase, pages from each page cluster are examined through a subtree filtering algorithm that exploits the structural and content similarity at subtree level to identify the QA-Pagelets.

0 Bookmarks
 · 
72 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Web is a steadily evolving resource comprising much more than mere HTML pages. With its ever-growing data sources in a variety of formats, it provides great potential for knowledge discovery. In this article, we shed light on some interesting phenomena of the Web: the deep Web, which surfaces database records as Web pages; the Semantic Web, which defines meaningful data exchange formats; XML, which has established itself as a lingua franca for Web data exchange; and domain-specific markup languages, which are designed based on XML syntax with the goal of preserving semantics in targeted domains. We detail these four developments in Web technology, and explain how they can be used for data mining. Our goal is to show that all these areas can be as useful for knowledge discovery as the HTML-based part of the Web.
    ACM SIGKDD Explorations Newsletter 04/2013; 14(2):63-81.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Le Web caché (également appelé Web profond ou Web invisible), c’est-à-dire la partie du Web qui n’est pas directement accessible par des hyperliens, mais à travers des formulaires HTML ou des services Web, est d’une grande valeur, mais difficile à exploiter. Nous présentons un processus pour la découverte, l’analyse syntaxique et sémantique, et l’interrogation des services du Web caché, le tout de manière entièrement automatique. Nous proposons une architecture générale se basant sur un entrepôt semi-structuré de contenu imprécis (probabiliste). Nous fournissons une analyse détaillée de la complexité du modèle d’arbre probabiliste sous-jacent. Nous décrivons comment une combinaison d’heuristiques et de sondages du Web peut être utilisée pour comprendre la structure d’un formulaire HTML. Nous présentons une utilisation originale des champs aléatoires conditionnels (une méthode d’apprentissage supervisé) de manière non supervisée, sur une annotation automatique, imparfaite et imprécise, basée sur la connaissance du domaine, afin d’extraire l’information pertinente de pages de résultat HTML. Afin d’obtenir des relations sémantiques entre entrées et sorties d’un service du Web caché, nous étudions la complexité de l’obtention d’une correspondance de schémas à partir d’instances de bases de données, en se basant uniquement sur la présence des constantes dans ces deux instances. Nous décrivons enfin un modèle de représentation sémantique et d’indexation en compréhension de sources du Web caché, et débattons de la manière de traiter des requêtes de haut niveau à l’aide de telles descriptions.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes a new approach to the use of clustering for automatic data detection in semi-structured web pages. Unlike most exiting web information extraction approaches that usually apply wrapper induction techniques to manually labelled web pages, this approach avoids the pattern induction process by using clustering techniques on unlabelled pages. In this approach, a variant Hierarchical Agglomerative Clustering (HAC) algorithm called K-neighbours-HAC is developed which uses the similarities of the data format (HTML tags) and the data content (text string values) to group similar text tokens into clusters. We also develop a new method to label text tokens to capture the hierarchical structure of HTML pages and an algorithm for mapping labelled text tokens to XML. The new approach is tested and compared with several common existing wrapper induction systems on three different sets of web pages. The results suggest that the new approach is effective for data record detection and that it outperforms these common existing approaches examined on these web sites. Compared with the existing approaches, the new approach does not require training and successfully avoids the explicit pattern induction process, and accordingly the entire data detection process is simpler.
    Integrated Computer Aided Engineering 12/2008; 15(4):297-311. · 3.37 Impact Factor

Full-text (2 Sources)

Download
28 Downloads
Available from
Jun 1, 2014