Automatically maintaining navigation sequences for querying semi-structured web sources.
ABSTRACT A substantial subset of Web data has an underlying structure. For instance, the pages obtained in response to a query executed through a Web search form are usually generated by a program that accesses structured data in a local database, and embeds them into an HTML template. For software programs to gain full benefit from these “semi-structured” Web sources, wrapper programs must be built to provide a “machine-readable” view over them. Since Web sources are autonomous, they may experience changes that invalidate the current wrapper, thus automatic maintenance is an important issue. Wrappers must perform two tasks: navigating through Web sites and extracting structured data from HTML pages. While several works have addressed the automatic maintenance of data extraction tasks, the problem of maintaining the navigation sequences remains unaddressed to the best of our knowledge. In this paper, we propose a set of novel techniques to fill this gap.
- SourceAvailable from: Craig A. Knoblock[show abstract] [hide abstract]
ABSTRACT: A critical problem in developing information agents for the Web is accessing data that is formatted for human use. We have developed a set of tools for extracting data from web sites and transforming it into a structured data format, such as XML. The resulting data can then be used to build new applications without having to deal with unstructured data. The advantages of our wrapping technology over previous work are the the ability to learn highly accurate extraction rules, to verify the wrapper to ensure that the correct data continues to be extracted, and to automatically adapt to changes in the sites from which the data is being extracted.IEEE Data Eng. Bull. 01/2000; 23:33-41.
- [show abstract] [hide abstract]
ABSTRACT: The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year.The journal of artificial intelligence research 01/2003; 18:149-181. · 1.06 Impact Factor
Conference Proceeding: Automatically Maintaining Wrappers for Web Sources.[show abstract] [hide abstract]
ABSTRACT: A substantial subset of the Web data follows some kind of underlying structure. Nevertheless, HTML does not contain any schema or semantic information about the data it represents. A program able to provide software applications with a structured view of those semi-structured Web sources is usually called a wrapper. Wrappers are able to accept a query against the source and return a set of structured results, thus enabling applications to access Web data in a similar manner to that of information from databases. A significant problem in this approach arises because Web sources may experiment changes that invalidate the current wrappers. In this paper, we present novel heuristics and algorithms to address this problem. Our approach is based on collecting some query results during wrapper operation. Then, when the source changes, they are used to generate a set of labeled examples that are then provided as input to a wrapper induction algorithm able to regenerate the wrapper. We have tested our methods in several real-world Web data extraction domains, obtaining high accuracy in all the steps of the process.Ninth International Database Engineering and Applications Symposium (IDEAS 2005), 25-27 July 2005, Montreal, Canada; 01/2005