Conference Paper

Discovery of Weather Forecast Web Resources Based on Ontology and Content-Driven Hierarchical Classification

Conference Paper

Discovery of Weather Forecast Web Resources Based on Ontology and Content-Driven Hierarchical Classification

If you want to read the PDF, try requesting it from the authors.

Abstract

Monitoring of environmental information is critical both for the evolvement of important environmental events, as well as for everyday life activities. In this work, we focus on the discovery of web resources that provide weather forecasts. To this end we submit domain-specific queries to a general purpose search engine and post process the results by introducing a hierarchical two layer classification scheme. The top layer includes two classification models: a) the first is trained using ontology concepts as textual features; b) the second is trained using textual features that are learned from a training corpus. The bottom layer includes a hybrid classifier that combines the results of the top layer. We evaluate the proposed technique by discovering weather forecast websites for cities of Finland and compare the results with previous works.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Chapter
Full-text available
It is common practice to present environmental information of spatial nature (like atmospheric quality patterns) in the form of pre-processed images. The current paper deals with the harmonization, comparison and reuse of Chemical Weather (CW) forecasts in the form of pre-processed images of varying quality and informational content, without having access to the original data. In order to compare, combine and reuse such environmental data, an innovative method for the inverse reconstruction of environmental data from images, was developed. The method is based on a new, neural adaptive data interpolation algorithm, and is tested on CW images coming from various European providers. Results indicate a very good performance that renders this method as appropriate to be used in various image-processing problems that require data reconstruction, retrieval and reuse.
Article
Full-text available
This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.
Conference Paper
Full-text available
Environmental and meteorological conditions are of utmost importance for the population, as they are strongly related to the quality of life. Citizens are increasingly aware of this importance. This awareness results in an increasing demand for environmental information tailored to their specific needs and background. We present an environmental information platform that supports submission of user queries related to environmental conditions and orchestrates results from complementary services to generate personalized suggestions. From the technical viewpoint, the system discovers and processes reliable data in the web in order to convert them into knowledge. At run time, this information is transferred into an ontology-structured knowledge base, from which then information relevant to the specific user is deduced and communicated in the language of their preference. The platform is demonstrated with real world use cases in the south area of Finland showing the impact it can have on the quality of everyday life. © 2012 IFIP International Federation for Information Processing.
Conference Paper
Full-text available
This paper studies how link information can be used to improve classification results for Web collections. We evaluate four different measures of subject similarity, derived from the Web link structure, and determine how accurate they are in predicting document categories. Using a Bayesian network model, we combine these measures with the results obtained by traditional content-based classifiers. Experiments on a Web directory show that best results are achieved when links from pages outside the directory are considered. Link information alone is able to obtain gains of up to 46 points in F1, when compared to a traditional content-based classifier. The combination with content-based methods can further improve the results, but too much noise may be introduced, since the text of Web pages is a much less reliable source of information. This work provides an important insight on which measures derived from links are more appropriate to compare Web documents and how these measures can be combined with content-based algorithms to improve the effectiveness of Web classification.
Conference Paper
Full-text available
Ontology learning has become a major area of research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain. However, there are few studies that attempt to automate the entire ontology learning process from the collection of domain-specific literature, to text mining to build new ontologies or enrich existing ones. In this paper, we present a framework of ontology learning that enables us to retrieve documents from the Web using focused crawling in a biological domain, amphibian morphology. We use a SVM (support vector machine) classifier to identify domain-specific documents and perform text mining in order to extract useful information for the ontology enrichment process. This paper reports on the overall system architecture and our initial experiments on the focused crawler and document classification.
Conference Paper
Full-text available
When a user performs a web search, the first query entered will frequently not return the required information. Thus, one needs to review the initial set of links and then to modify the query or construct a new one. This incremental process is particularly frustrating and difficult to manage for a mobile user due to the device limitations (e.g. keyboard, display). We present a query formulation architecture that employs the notion of context in order to automatically construct queries, where context refers to the article currently being viewed by the user. The proposed system uses semantic metadata extracted from the web page being consumed to automatically generate candidate queries. Novel methods are proposed to create and validate candidate queries. Further two variants of query expansion and a post-expansion validation technique are described. Finally, insights into the effectiveness of our system are provided based on evaluation tests of its individual components.
Conference Paper
Full-text available
http://www.it.lth.se/knowlib/ Abstract. The aim of the study was to determine how significance indicators assigned to different Web page elements (internal metadata, title, headings, and main text) influence automated classification. The data collection that was used comprised 1000 Web pages in engineering, to which Engineering Information classes had been manually assigned. The significance indicators were derived using several different methods: (total and partial) precision and recall, seman- tic distance and multiple regression. It was shown that for best results all the elements have to be included in the classification process. The exact way of combining the significance indicators turned out not to be overly important: us- ing the F1 measure, the best combination of significance indicators yielded no more than 3% higher performance results than the baseline.
Article
Full-text available
Domain-specific Web search engines are effective tools for reducing the difficulty experienced when acquiring information from the Web. Existing methods for building domain-specific Web search engines require human expertise or specific facilities. However, we can build a domain-specific search engine simply by adding domain-specific keywords, called "keyword spices," to the user's input query and forwarding it to a general-purpose Web search engine. Keyword spices can be effectively discovered from Web documents using machine learning technologies. The paper describes domain-specific Web search engines that use keyword spices for locating recipes, restaurants, and used cars.
Article
Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as compared to traditional text classification, but the interconnected nature of hypertext also provides features that can assist the process. As we review work in Web page classification, we note the importance of these Web-specific features and algorithms, describe state-of-the-art practices, and track the underlying assumptions behind the use of information from neighboring pages.
Article
The paper describes an approach to automatic Web-page classification based on the Yahoo hierarchy. Machine learning techniques developed for learning on text data are used here on the hierarchical classification structure. The high number of features is reduced by taking into account the hierarchical structure and using feature subset selection based on the method known from information retrieval. Documents are represented as feature-vectors that include n-grams instead of including only single words (unigrams) as commonly used when learning on text data. Based on the hierarchical structure the problem is divided into subproblems, each representing one on the categories included in the Yahoo hierarchy. The result of learning is a set of independent classifiers, each used to predict the probability that a new example is a member of the corresponding category. Experimental evaluation on real-world data shows that the proposed approach gives good results. For more than a half of testing examples a correct category is among the 3 categories with the highest predicted probability.
Article
Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.
Conference Paper
Analysis and processing of environmental information is considered of utmost importance for humanity. This article addresses the problem of discovery of web resources that provide environmental measurements. Towards the solution of this domain-specific search problem, we combine state-of-the-art search techniques together with advanced textual processing and supervised machine learning. Specifically, we generate domain-specific queries using empirical information and machine learning driven query expansion in order to enhance the initial queries with domain-specific terms. Multiple variations of these queries are submitted to a general-purpose web search engine in order to achieve a high recall performance and we employ a post processing module based on supervised machine learning to improve the precision of the final results. In this work, we focus on the discovery of weather forecast websites and we evaluate our technique by discovering weather nodes for south Finland.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Article
Automatic categorization is a viable method to deal with the scaling problem on the World Wide Web. For Web site classification, this paper proposes the use of Web pages linked with the home page in a different manner from the sole use of home pages in previous research. To implement our proposed method, we derive a scheme for Web site classification based on the k-nearest neighbor (k-NN) approach. It consists of three phases: Web page selection (connectivity analysis), Web page classification, and Web site classification. Given a Web site, the Web page selection chooses several representative Web pages using connectivity analysis. The k-NN classifier next classifies each of the selected Web pages. Finally, the classified Web pages are extended to a classification of the entire Web site. To improve performance, we supplement the k-NN approach with a feature selection method and a term weighting scheme using markup tags, and also reform its document–document similarity measure. In our experiments on a Korean commercial Web directory, the proposed system, using both a home page and its linked pages, improved the performance of micro-averaging breakeven point by 30.02%, compared with an ordinary classification which uses a home page only.
Article
It has become increasingly difficult to locate relevant information on the Web, even with the help of Web search engines. Two approaches to addressing the low precision and poor presentation of search results of current search tools are studied: meta-search and document categorization. Meta-search engines improve precision by selecting and integrating search results from generic or domain-specific Web search engines or other resources. Document categorization promises better organization and presentation of retrieved results. This article introduces MetaSpider, a meta-search engine that has real-time indexing and categorizing functions. We report in this paper the major components of MetaSpider and discuss related technical approaches. Initial results of a user evaluation study comparing MetaSpider, NorthernLight, and MetaCrawler in terms of clustering performance and of time and effort expended show that MetaSpider performed best in precision rate, but disclose no statistically significant differences in recall rate and time requirements. Our experimental study also reveals that MetaSpider exhibited a higher level of automation than the other two systems and facilitated efficient searching by providing the user with an organized, comprehensive view of the retrieved documents.
An Ontological Framework for Decision Support
  • M Rospocher
  • L Serafini