Article

Novel Method of Web Database Redundancy Computing for Web Data Sources Selection

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

With the fast increasing number of Web databases (WDBs), it is core issue in the study of Web data integration that we should select the most appropriate composition of databases to query and obtain more targeted data at a smaller cost. In this study, in order to reduce redundant data from different sources, we propose a novel method of Web databases redundancy computing to select proper Web data sources for given keywords. To solve the problem, we propose a web database feature representation model, and based on sample data from the sources, we put forward the deep web redundancy computing method considering three different data types: text attribute, numeric attribute and categorical attribute. Experiments show that this method can achieve the desired objectives and can meet the demand to the integrated system very well.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
In this paper, we propose a new approach to automatically clustering e-commerce search engines (ESEs) on the Web such that ESEs in the same cluster sell similar products. This allows an e- commerce metasearch engine (comparison shopping system) to be built over the ESEs for each cluster. Our approach performs the clustering based on the features available on the interface page (i.e. the Web page containing the search form or interface) of each ESE. Special features that are utilized include the number of links, the number of images, terms appearing in the search form and normalized price terms. Our experimental results based on nearly 300 ESEs indicate that this approach can achieve good results.
Article
The dramatic growth of the Internet has created a new problem for users: location of the relevant sources of documents. This article presents a framework for (and experimentally analyzes a solution to) this problem, which we call the text-source discovery problem. Our approach consists of two phases. First, each text source exports its contents to a centralized service. Second, users present queries to the service, which returns an ordered list of promising text sources. This article describes GlOSS, Glossary of Servers Server, with two versions: bGlOSS, which provides a Boolean query retrieval model, and vGlOSS, which provides a vector-space retrieval model. We also present hGlOSS, which provides a decentralized version of the system. We extensively describe the methodology for measuring the retrieval effectiveness of these systems and provide experimental evidence, based on actual data, that all three systems are highly effective in determining promising text sources for a given query.
Article
With the extensive application of WDB, the full use of their data becomes a hot issue of current research. WDB query interface is an important way to obtain the WDB data, the full representation and extraction of WDB query interface is prerequisite to gain the WDB data. This paper presents an ontology-based representation of the WDB query interface, this method can not only represent the general attribute of the query interface, but also can express the context information of the query interface and the relationship between the interface properties, which lays the foundation for the accurate classification and integration to WDB. Meanwhile, this paper also gives a method about interface extraction which is based on DOM and Watir. Practice show that the method can very well complete the expression and extraction of WDB's Query Interface's Context information, property information and relationship information.
Article
The Web has become the preferred medium for many database applications, such as e-commerce and digital libraries. These applications store information in huge databases that users access, query, and update through the Web. Database-driven Web sites have their own interfaces and access forms for creating HTML pages on the fly. Web database technologies define the way that these forms can connect to and retrieve data from database servers. The number of database-driven Web sites is increasing exponentially, and each site is creating pages dynamically-pages that are hard for traditional search engines to reach. Such search engines crawl and index static HTML pages; they do not send queries to Web databases. The information hidden inside Web databases is called the "deep Web" in contrast to the "surface Web" that traditional search engines access easily. We expect deep Web search engines and technologies to improve rapidly and to dramatically affect how the Web is used by providing easy access to many more information resources.