Conference Paper

Discovery of Environmental Web Resources Based on the Combination of Multimedia Evidence

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This work proposes a framework for the discovery of environmental Web resources providing air quality measurements and forecasts. Motivated by the frequent occurrence of heatmaps in such Web resources, it exploits multimedia evidence at different stages of the discovery process. Domain-specific queries generated using empirical information and machine learning driven query expansion are submitted both to the Web and Image search services of a general-purpose search engine. Post-retrieval filtering is performed by combining textual and visual (heatmap-related) evidence in a supervised machine learning framework. Our experimental results indicate improvements in the effectiveness when performing heatmap recognition based on SURF and SIFT descriptors using VLAD encoding and when combining multimedia evidence in the discovery process.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Visual artwork is also crawled from public databases, by deploying advanced retrieval techniques [19] [12] that collect paintings, images, as well as related textual content, such as critics, catalogues and museum guides, retrieved from social media, movie-related free archives and other open sources. Crawling data for Delphi showcase are depicted in Fig. 2. ...
Conference Paper
Full-text available
are in great need of acquiring, re-using and re-purposing visual and textual data to recreate, renovate or produce a novel target space, building or element. This come in align with the abrupt increase, which is lately observed, in the use of immersive VR environments and the great technological advance that can be found in the acquisition and manipulation of digital models. V4Design is a new project that takes into account these needs and intends to build a complete and robust design solution that will help the targeted industries to improve their creation process.
Article
Full-text available
In this study we apply two methods for data collection that are relatively new in the field of atmospheric science. The two developed methods are designed to collect essential geo-localized information to be used as input data for a high resolution emission inventory for residential wood combustion (RWC). The first method is a webcrawler that extracts openly online available real estate data in a systematic way, and thereafter structures them for analysis. The webcrawler reads online Norwegian real estate advertisements and it collects the geo-position of the dwellings. Dwellings are classified according to the type (e.g., apartment, detached house) they belong to and the heating systems they are equipped with. The second method is a model trained for image recognition and classification based on machine learning techniques. The images from the real estate advertisements are collected and processed to identify wood burning installations, which are automatically classified according to the three classes used in official statistics, i.e., open fireplaces, stoves produced before 1998 and stoves produced after 1998. The model recognizes and classifies the wood appliances with a precision of 81%, 85% and 91% for open fireplaces, old stoves and new stoves, respectively. Emission factors are heavily dependent on technology and this information is therefore essential for determining accurate emissions. The collected data are compared with existing information from the statistical register at county and national level in Norway. The comparison shows good agreement for the proportion of residential heating systems between the webcrawled data and the official statistics. The high resolution and level of detail of the extracted data show the value of open data to improve emission inventories. With the increased amount and availability of data, the techniques presented here add significant value to emission accuracy and potential applications should also be considered across all emission sectors.
Conference Paper
Full-text available
It is common practice to disseminate Chemical Weather (air quality and meteorology) forecasts to the general public, via the internet, in the form of pre-processed images which differ in format, quality and presentation, without other forms of access to the original data. As the number of on-line available Chemical Weather (CW) forecasts is increasing, there are many geographical areas that are covered by different models, and their data could not be combined, compared, or used in any synergetic way by the end user, due to the aforementioned heterogeneity. This paper describes a series of methods for extracting and reconstructing data from heterogeneous air quality forecast images coming from different data providers, to allow for their unified harvesting, processing, transformation, storage and presentation in the Chemical Weather portal.
Article
Full-text available
The EU legislative framework related to air quality, together with national legislation and relevant Declarations of the United Nations (UN) requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The WWW is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Conference Paper
Full-text available
In this paper, we present a novel scale- and rotation-invariant interest point detector and descriptor, coined SURF (Speeded Up Robust Features). It approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (in casu, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper presents experimental results on a standard evaluation set, as well as on imagery obtained in the context of a real-life object recognition application. Both show SURF’s strong performance.
Conference Paper
Full-text available
We address the problem of image search on a very large scale, where three constraints have to be considered jointly: the accuracy of the search, its efficiency, and the memory usage of the representation. We first propose a simple yet efficient way of aggregating local image descriptors into a vector of limited dimension, which can be viewed as a simplification of the Fisher kernel representation. We then show how to jointly optimize the dimension reduction and the indexing algorithm, so that it best preserves the quality of vector comparison. The evaluation shows that our approach significantly outperforms the state of the art: the search accuracy is comparable to the bag-of-features approach for an image representation that fits in 20 bytes. Searching a 10 million image dataset takes about 50ms.
Conference Paper
Full-text available
Ontology learning has become a major area of research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain. However, there are few studies that attempt to automate the entire ontology learning process from the collection of domain-specific literature, to text mining to build new ontologies or enrich existing ones. In this paper, we present a framework of ontology learning that enables us to retrieve documents from the Web using focused crawling in a biological domain, amphibian morphology. We use a SVM (support vector machine) classifier to identify domain-specific documents and perform text mining in order to extract useful information for the ontology enrichment process. This paper reports on the overall system architecture and our initial experiments on the focused crawler and document classification.
Conference Paper
Full-text available
The explosion of image data on the Internet has the potential to foster more sophisticated and robust models and algorithms to index, retrieve, organize and interact with images and multimedia data. But exactly how such data can be harnessed and organized remains a critical problem. We introduce here a new database called ldquoImageNetrdquo, a large-scale ontology of images built upon the backbone of the WordNet structure. ImageNet aims to populate the majority of the 80,000 synsets of WordNet with an average of 500-1000 clean and full resolution images. This will result in tens of millions of annotated images organized by the semantic hierarchy of WordNet. This paper offers a detailed analysis of ImageNet in its current state: 12 subtrees with 5247 synsets and 3.2 million images in total. We show that ImageNet is much larger in scale and diversity and much more accurate than the current image datasets. Constructing such a large-scale database is a challenging task. We describe the data collection scheme with Amazon Mechanical Turk. Lastly, we illustrate the usefulness of ImageNet through three simple applications in object recognition, image classification and automatic object clustering. We hope that the scale, accuracy, diversity and hierarchical structure of ImageNet can offer unparalleled opportunities to researchers in the computer vision community and beyond.
Conference Paper
Full-text available
Subject-specific search facilities on health sites are usually built using manual inclusion and exclusion rules. These can be expensive to maintain and often provide incomplete cov- erage of Web resources. On the other hand, health infor- mation obtained through whole-of-Web search may not be scientifically based and can be potentially harmful. To address problems of cost, coverage and quality, we built a focused crawler for the mental health topic of depression, which was able to selectively fetch higher quality relevant in- formation. We found that the relevance of unfetched pages can be predicted based on link anchor context, but the qual- ity cannot. We therefore estimated quality of the entire link- ing page, using a learned IR-style query of weighted single words and word pairs, and used this to predict the quality of its links. The overall crawler priority was determined by the product of link relevance and source quality. We evaluated our crawler against baseline crawls using both relevance judgments and objective site quality scores ob- tained using an evidence-based rating scale. Both a rele- vance focused crawler and the quality focused crawler re- trieved twice as many relevant pages as a breadth-first con- trol. The quality focused crawler was quite eective in re- ducing the amount of low quality material fetched while crawling more high quality content, relative to the relevance focused crawler. Analysis suggests that quality of content might be improved by post-filtering a very big breadth-first crawl, at the cost of substantially increased network trac.
Article
Full-text available
Domain-specific Web search engines are effective tools for reducing the difficulty experienced when acquiring information from the Web. Existing methods for building domain-specific Web search engines require human expertise or specific facilities. However, we can build a domain-specific search engine simply by adding domain-specific keywords, called "keyword spices," to the user's input query and forwarding it to a general-purpose Web search engine. Keyword spices can be effectively discovered from Web documents using machine learning technologies. The paper describes domain-specific Web search engines that use keyword spices for locating recipes, restaurants, and used cars.
Conference Paper
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif- ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make training faster, we used non-saturating neurons and a very efficient GPU implemen- tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called dropout that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Chapter
This chapter takes a look at the existing search landscape to observe that there are a myriad of search systems, covering all possible domains, which share a handful of theoretical retrieval models and are separated by a series of pre- and post-processing steps as well as a finite set of parameters. We also observe that given the infinite variety of real-world search tasks and domains, it is unlikely that any one combination of these steps and parameters would yield a renaissance-engine — a search engine that can answer any questions about art, sciences, law or entertainment. We therefore set forth to analyze the different components of a search system and propose a model for domain specific search, including a definition thereof, as well as a technical framework to build a domain specific search system.
Article
Domain-specific internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are difficult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specific Internet portals. We describe new research in reinforcement learning, information extraction and text classification that enables efficient spidering, the identification of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are widely applicable to portal creation in other domains.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Conference Paper
In this work we deal with the problem of how different local descriptors can be extended, used and combined for improving the effectiveness of video concept detection. The main contributions of this work are: 1) We examine how effectively a binary local descriptor, namely ORB, which was originally proposed for similarity matching between local image patches, can be used in the task of video concept detection. 2) Based on a previously proposed paradigm for introducing color extensions of SIFT, we define in the same way color extensions for two other non-binary or binary local descriptors (SURF, ORB), and we experimentally show that this is a generally applicable paradigm. 3) In order to enable the efficient use and combination of these color extensions within a state-of-the-art concept detection methodology (VLAD), we study and compare two possible approaches for reducing the color descriptor’s dimensionality using PCA. We evaluate the proposed techniques on the dataset of the 2013 Semantic Indexing Task of TRECVID.
Article
We trained a large, deep convolutional neural network to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 dif-ferent classes. On the test data, we achieved top-1 and top-5 error rates of 37.5% and 17.0% which is considerably better than the previous state-of-the-art. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three fully-connected layers with a final 1000-way softmax. To make train-ing faster, we used non-saturating neurons and a very efficient GPU implemen-tation of the convolution operation. To reduce overfitting in the fully-connected layers we employed a recently-developed regularization method called "dropout" that proved to be very effective. We also entered a variant of this model in the ILSVRC-2012 competition and achieved a winning top-5 test error rate of 15.3%, compared to 26.2% achieved by the second-best entry.
Conference Paper
Environmental data are considered of utmost importance for human life, since weather conditions, air quality and pollen are strongly related to health issues and affect everyday activities. This paper addresses the problem of discovery of air quality and pollen forecast Web resources, which are usually presented in the form of heatmaps (i.e. graphical representation of matrix data with colors). Towards the solution of this problem, we propose a discovery methodology, which builds upon a general purpose search engine and a novel post processing heatmap recognition layer. The first step involves generation of domain-specific queries, which are submitted to the search engine, while the second involves an image classification step based on visual low level features to identify Web sites including heatmaps. Experimental results comparing various visual features combinations show that relevant environmental sites can be efficiently recognized and retrieved.
Conference Paper
Focussed crawlers enable the automatic discovery of Web resources about a given topic by automatically navigating the Web link structure and selecting the hyperlinks to follow by estimating their relevance to the topic based on evidence obtained from the already downloaded pages. This work proposes a classifier-guided focussed crawling approach that estimates the relevance of a hyperlink to an unvisited Web resource based on the combination of textual evidence representing its local context, namely the textual content appearing in its vicinity in the parent page, with visual evidence associated with its global context, namely the presence of images relevant to the topic within the parent page. The proposed focussed crawling approach is applied towards the discovery of environmental Web resources that provide air quality measurements and forecasts, since such measurements (and particularly the forecasts) are not only provided in textual form, but are also commonly encoded as multimedia, mainly in the form of heatmaps. Our evaluation experiments indicate the effectiveness of incorporating visual evidence in the link selection process applied by the focussed crawler over the use of textual features alone, particularly in conjunction with hyperlink exploration strategies that allow for the discovery of highly relevant pages that lie behind apparently irrelevant ones.
Article
The European Union (EU) legislative framework related to air quality, together with national legislation and relevant declarations of the United Nations (UN), requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The World Wide Web (WWW) is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Article
The latest generation of Convolutional Neural Networks (CNN) have achieved impressive results in challenging benchmarks on image recognition and object detection, significantly raising the interest of the community in these methods. Nevertheless, it is still unclear how different CNN methods compare with each other and with previous state-of-the-art shallow representations such as the Bag-of-Visual-Words and the Improved Fisher Vector. This paper conducts a rigorous evaluation of these new techniques, exploring different deep architectures and comparing them on a common ground, identifying and disclosing important implementation details. We identify several useful properties of CNN-based representations, including the fact that the dimensionality of the CNN output layer can be reduced significantly without having an adverse effect on performance. We also identify aspects of deep and shallow methods that can be successfully shared. A particularly significant one is data augmentation, which achieves a boost in performance in shallow methods analogous to that observed with CNN-based methods. Finally, we are planning to provide the configurations and code that achieve the state-of-the-art performance on the PASCAL VOC Classification challenge, along with alternative configurations trading-off performance, computation speed and compactness.
Conference Paper
Analysis and processing of environmental information is considered of utmost importance for humanity. This article addresses the problem of discovery of web resources that provide environmental measurements. Towards the solution of this domain-specific search problem, we combine state-of-the-art search techniques together with advanced textual processing and supervised machine learning. Specifically, we generate domain-specific queries using empirical information and machine learning driven query expansion in order to enhance the initial queries with domain-specific terms. Multiple variations of these queries are submitted to a general-purpose web search engine in order to achieve a high recall performance and we employ a post processing module based on supervised machine learning to improve the precision of the final results. In this work, we focus on the discovery of weather forecast websites and we evaluate our technique by discovering weather nodes for south Finland.
Conference Paper
Raster map images (e.g., USGS) provide much information in digital form; however, the color assignments and pixel labels leave many serious ambiguities. A color histogram classification scheme is described, followed by the application of a tensor voting method to classify linear features in the map as well as intersections in linear feature networks. The major result is an excellent segmentation of roads, and road intersections are detected with about 93% recall and 66 % precision.
Conference Paper
In this paper, we propose two kinds of semi-automatic training-example generation algorithms for rapidly synthesizing a domain-specific Web search engine. We use the keyword spice model, as a basic framework, which is an excellent approach for building a domain-specific search engine with high precision and high recall. The keyword spice model, however, requires a huge amount of training examples which should be classified by hand. For overcoming this problem, we propose two kinds of refinement algorithms based on semi-automatic training-example generation: (i) the sample decision tree based approach, and (ii) the similarity based approach. These approaches make it possible to build a highly accurate domain-specific search engine with a little time and effort. The experimental results show that our approaches are very effective and practical for the personalization of a general-purpose search engine
Conference Paper
The TREC-3 project at Virginia Tech focused on methods for combining the evidence from multiple retrieval runs and queries to improve retrieval performance over any single retrieval method or query. The largest improvements result from the combination of retrieval paradigms rather than from the use of multiple similar queries.
Conference Paper
The separation of overlapping text and graphics is a challenging problem in document image analysis. This paper proposes a specific method of detecting and extracting characters that are touching graphics. It is based on the observation that the constituent strokes of characters are usually short segments in comparison with those of graphics. It combines line continuation with the feature line width to decompose and reconstruct segments underlying the region of intersection. Experimental results showed that the proposed method improved the percentage of correctly detected text as well as the accuracy of character recognition significantly.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
MPEG-7, formally known as the Multimedia Content Description Interface, includes standardized tools (descriptors, description schemes, and language) enabling structural, detailed descriptions of audio-visual information at different granularity levels (region, image, video segment, collection) and in different areas (content description, management, organization, navigation, and user interaction). It aims to support and facilitate a wide range of applications, such as media portals, content broadcasting, and ubiquitous multimedia. We present a high-level overview of the MPEG-7 standard. We first discuss the scope, basic terminology, and potential applications. Next, we discuss the constituent components. Then, we compare the relationship with other standards to highlight its capabilities
Article
. Domain-specic internet portals are growing in popularity because they gather content from the Web and organize it for easy access, retrieval and search. For example, www.campsearch.com allows complex queries by age, location, cost and specialty over summer camps. This functionality is not possible with general, Web-wide search engines. Unfortunately these portals are dicult and time-consuming to maintain. This paper advocates the use of machine learning techniques to greatly automate the creation and maintenance of domain-specic Internet portals. We describe new research in reinforcement learning, information extraction and text classication that enables ecient spidering, the identication of informative text segments, and the population of topic hierarchies. Using these techniques, we have built a demonstration system: a portal for computer science research papers. It already contains over 50,000 papers and is publicly available at www.cora.justresearch.com. These techniques are ...
Yuan and etal. THU and ICRC at TRECVID
  • J Yuan
  • Yuan J.