Automatically maintaining navigation sequences for querying semi-structured web sources.

Department of Information and Communication Technologies, Facultad de Informatica, Campus de Elviña s/n, University of A Coruña, 15071 A Coruña, Spain; Received 14 November 2006. Revised 22 February 2007. Accepted 13 April 2007. Available online 13 May 2007.
Data Knowl. Eng 01/2007; 63:795-810. DOI: 10.1016/j.datak.2007.04.009
Source: DBLP

ABSTRACT A substantial subset of Web data has an underlying structure. For instance, the pages obtained in response to a query executed through a Web search form are usually generated by a program that accesses structured data in a local database, and embeds them into an HTML template. For software programs to gain full benefit from these “semi-structured” Web sources, wrapper programs must be built to provide a “machine-readable” view over them. Since Web sources are autonomous, they may experience changes that invalidate the current wrapper, thus automatic maintenance is an important issue. Wrappers must perform two tasks: navigating through Web sites and extracting structured data from HTML pages. While several works have addressed the automatic maintenance of data extraction tasks, the problem of maintaining the navigation sequences remains unaddressed to the best of our knowledge. In this paper, we propose a set of novel techniques to fill this gap.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: Web automation applications are widely used for different purposes such as B2B integration, automated testing of web applications or technology and business watch. One crucial part in web automation applications is for them to easily generate and reproduce navigation sequences. This problem is specially complicated in the case of the new breed of AJAX-based websites. Although recently some tools have also addressed the problem, they show some limitations either in usability or their ability to deal with complex websites. In this paper, we propose a set of new techniques to build an automatic web navigation system able to deal with these complexities. Our main contributions are: a new method for recording navigation sequences able to scale to a wider range of events, an algorithm to identify in a change-resilient manner the target element of a user action, and a novel method to detect when the effects caused by a user action (including the effects of scripting code and AJAX requests) have finished. In addition, we have also tested our approach with a high number of real web sources and have compared it with other relevant web automation tools obtaining very good results.
    Data Knowl. Eng. 01/2011; 70:269-283.
  • [Show abstract] [Hide abstract]
    ABSTRACT: An increasingly large amount of financial information available in a number of heterogeneous business sources implies that the traditional methods of analysis are no longer applicable. These financial data sources are characterized by the use of disparate data models and their unstructured content with implicit knowledge. In addition, the most up-to-date financial information typically resides in the vast amount of financial-related news that brokers take into account when investing. As Semantic Technologies mature, they provide a consistent and reliable basis for the development of superior, more precise mechanisms to deal with heterogeneous data. In this paper, we present a financial news semantic search engine based on Semantic Web technologies. The search engine is accompanied by an ontology population tool that assists in keeping the financial ontology up-to-date. In addition, a further module has been developed that is capable of crawling the Web in search of financial news and annotating it with knowledge entities from the financial ontology that match with the contents of the news. Our contribution is an overall solution based on a fully fledged architecture that has been validated in a use case scenario for the Spanish stock exchange.
    Expert Systems with Applications 01/2011; 12(38):15565-15572. · 1.85 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: With the rapid growth of web information, there is an increasing need to easily and efficiently acquire accurate information from the massive and heterogamous web. Web information extraction is such a research area to meet these needs. In this paper, we analyze the shortcomings of related researches and systems and find that when extracting accurate web information with complex structures, few systems can do so without being too much of a burden to users. Aiming at overcoming this type of pitfalls, this paper will study and propose a comprehensive model and framework that can combine the automatic web data analysis and extraction with the user interaction-based semi-supervised web data extraction. The new model and framework has a good trade-off between the automatic generation of extraction rules and their expression capability towards the accurate information extraction. Based on this, we further present a multi-functional data extraction rule system that will use a variety of structural and textual extraction rules of different functions to achieve powerful expression capability. Furthermore, to offer powerful expression mechanism for data extraction, this paper will describe a well-designed, XML-based data extraction language which works well for rule generation based on both automatic web structure analysis and user interaction.

Alberto Pan