Querying Web Data - The WebQA Approach
ABSTRACT The common paradigm of searching and retrieving information on the Web is based on keyword-based search using one or more search engines, and then browsing through the large number of returned URLs. This is significantly weaker than the declarative querying that is supported by DBMSs. The lack of a schema and the high volatility of Web make "database-like" querying of Web data difficult. In this paper we report on our work in building a system, called WebQA, that provides a declarative query-based approach to Web data retrieval that uses question-answering technology in extracting information from Web sites that are retrieved by search engines. The approach consists of first using meta-search techniques in an open environment to gather candidate responses from search engines and other on-line databases, and then using information extraction techniques to find the answer to the specific question from these candidates. A prototype system has been developed to test this approach. Testing includes evaluation of its performance as a question-answering system using a wellknown evaluation system called TREC-9. Its accuracy using TREC-9 data for simple questions is high and its retrieval performance is good. The system employs an open system architecture allowing for on-going improvements in various aspects.
SourceAvailable from: citeseerx.ist.psu.edu[Show abstract] [Hide abstract]
ABSTRACT: The MultiText QA System performs question answering using a two step passage selection method. In the first step, an arbitrary passage retrieval algorithm efficiently identifies hotspots in a large target corpus where the answer might be located. In the second step, an answer selection algorithm analyzes these hotspots, considering such factors as answer type and candidate redundancy, to extract short answer snippets. This chapter describes both steps in detail, with the goal of providing sufficient information to allow independent implementation. The method is evaluated using the test collection developed for the TREC 2001 question answering track.12/2005: pages 259-283;
[Show abstract] [Hide abstract]
ABSTRACT: Machine learning is the science of building predictors from data while accounting for the predictor's accuracy on future data. Many machine learning classifiers can make accurate predictions when the data is complete. In the presence of insufficient data, statistical methods can be applied to fill in a few missing items. But these methods rely only on the available data to calculate the missing values and perform poorly if the percentage of missing values exceeds a threshold. An alternative is to fill in the missing data by an automated knowledge discovery process via mining the WWW. This novel procedure is applied by first restoring missing information and next learning the parameters of the classifier from the restored data. Using a Bayesian network as a classifier, the parameters, i.e., the probabilities associated with the causal relationships in the network, are deduced using the knowledge mined from the WWW in conjunction with the data available on hand. The method, when tested with heart disease data sets from the UC Irvine Machine Learning Repository [UCI repository of machine learning databases], gave satisfactory results.2004 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2004), 20-24 September 2004, Beijing, China; 01/2004