Automatically maintaining navigation sequences for querying semi-structured web sources.

Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain; Received 14 November 2006. Revised 22 February 2007. Accepted 13 April 2007. Available online 13 May 2007.
Data & Knowledge Engineering (Impact Factor: 1.49). 12/2007; 63:795-810. DOI: 10.1016/j.datak.2007.04.009
Source: DBLP

ABSTRACT A substantial subset of Web data has an underlying structure. For instance, the pages obtained in response to a query executed through a Web search form are usually generated by a program that accesses structured data in a local database, and embeds them into an HTML template. For software programs to gain full benefit from these “semi-structured” Web sources, wrapper programs must be built to provide a “machine-readable” view over them. Since Web sources are autonomous, they may experience changes that invalidate the current wrapper, thus automatic maintenance is an important issue. Wrappers must perform two tasks: navigating through Web sites and extracting structured data from HTML pages. While several works have addressed the automatic maintenance of data extraction tasks, the problem of maintaining the navigation sequences remains unaddressed to the best of our knowledge. In this paper, we propose a set of novel techniques to fill this gap.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: The increase of audiovisual content in DTV accessible by the audience requires mechanisms for integrating information and for advanced content search oriented to user preferences. Semantic approaches offer promising solutions to solve this problem. In this work, we present a system based on Semantic Web technologies capable of querying different content resources and integrate the information to present it uniformly to the user.
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid growth of web information, there is an increasing need to easily and efficiently acquire accurate information from the massive and heterogamous web. Web information extraction is such a research area to meet these needs. In this paper, we analyze the shortcomings of related researches and systems and find that when extracting accurate web information with complex structures, few systems can do so without being too much of a burden to users. Aiming at overcoming this type of pitfalls, this paper will study and propose a comprehensive model and framework that can combine the automatic web data analysis and extraction with the user interaction-based semi-supervised web data extraction. The new model and framework has a good trade-off between the automatic generation of extraction rules and their expression capability towards the accurate information extraction. Based on this, we further present a multi-functional data extraction rule system that will use a variety of structural and textual extraction rules of different functions to achieve powerful expression capability. Furthermore, to offer powerful expression mechanism for data extraction, this paper will describe a well-designed, XML-based data extraction language which works well for rule generation based on both automatic web structure analysis and user interaction.
  • [Show abstract] [Hide abstract]
    ABSTRACT: An increasingly large amount of financial information available in a number of heterogeneous business sources implies that the traditional methods of analysis are no longer applicable. These financial data sources are characterized by the use of disparate data models and their unstructured content with implicit knowledge. In addition, the most up-to-date financial information typically resides in the vast amount of financial-related news that brokers take into account when investing. As Semantic Technologies mature, they provide a consistent and reliable basis for the development of superior, more precise mechanisms to deal with heterogeneous data. In this paper, we present a financial news semantic search engine based on Semantic Web technologies. The search engine is accompanied by an ontology population tool that assists in keeping the financial ontology up-to-date. In addition, a further module has been developed that is capable of crawling the Web in search of financial news and annotating it with knowledge entities from the financial ontology that match with the contents of the news. Our contribution is an overall solution based on a fully fledged architecture that has been validated in a use case scenario for the Spanish stock exchange.
    Expert Systems with Applications 11/2011; 12(38):15565-15572. · 1.97 Impact Factor

Alberto Pan