ArticlePDF Available

Abstract and Figures

Environmental data analysis and information provision are considered of great importance for people, since environmental conditions are strongly related to health issues and directly affect a variety of everyday activities. Nowadays, there are several free web-based services that provide environmental information in several formats with map images being the most commonly used to present air quality and pollen forecasts. This format, despite being intuitive for humans, complicates the extraction and processing of the underlying data. Typical examples of this case are the chemical weather forecasts, which are usually encoded heatmaps (i.e. graphical representation of matrix data with colors), while the forecasted numerical pollutant concentrations are commonly unavailable. This work presents a model for the semi-automatic extraction of such information based on a template configuration tool, on methodologies for data reconstruction from images, as well as on text processing and Optical Character Recognition (OCR). The aforementioned modules are integrated in a standalone framework, which is extensively evaluated by comparing data extracted from a variety of chemical weather heat maps against the real numerical values produced by chemical weather forecasting models. The results demonstrate a satisfactory performance in terms of data recovery and positional accuracy.
Content may be subject to copyright.
A model for environmental data extraction from multimedia and its evaluation
against various chemical weather forecasting datasets
Anastasia Moumtzidou
a,
, Victor Epitropou
b
, Stefanos Vrochidis
a
, Kostas Karatzas
b
, Sascha Voth
c
,
Anastasios Bassoukos
b
, Jürgen Moßgraber
c
, Ari Karppinen
d
,JaakkoKukkonen
d
, Ioannis Kompatsiaris
a
a
Information Technologies Institute, Centre for Research and Technology Hellas, Greece
b
Informatics Systems and Applications Group, Aristotle University of Thessaloniki, Greece
c
Fraunhofer Institute of Optronics, System Technologies and Image Exploitation, Germany
d
Finnish Meteorological Institute, Helsinki, Finland
abstractarticle info
Article history:
Received 31 January 2013
Received in revised form 12 July 2013
Accepted 20 August 2013
Available online xxxx
Keywords:
Air quality
Heatmap
Image processing
OCR
Environmental
Multimedia
Environmental data analysis and information provision are considered of great importance for people, since en-
vironmental conditions are strongly related to health issues and directly affect a variety of everyday activities.
Nowadays, there are several free web-based services that provide environmental information in several formats
with map images being the most commonly used to present air quality and pollen forecasts. This format, despite
being intuitive for humans, complicates the extraction and processing of the underlying data.Typical examples of
this case are the chemical weather forecasts, which are usually encoded heatmaps (i.e. graphical representation
of matrix data with colors), while the forecasted numerical pollutant concentrations are commonly unavailable.
This work presents a model for the semi-automatic extraction of such information based on a template congu-
ration tool, on methodologies for data reconstruction from images, as well as on text processing and Optical
Character Recognition (OCR). The aforementioned modules are integrated in a standalone framework, which is
extensively evaluated by comparing data extracted from a variety of chemical weather heat maps against the
real numerical values produced by chemical weather forecasting models. The results demonstrate a satisfactory
performance in terms of data recovery and positional accuracy.
© 2013 Elsevier B.V. All rights reserved.
1. Introduction
The analysis of environmental data and the generation, combination
and reuse of related information, such as air pollutant concentrations, is
of particular interest for people. Environmental status information (in
particular, the concentration of certain pollutants in the air) is consid-
ered to be correlated with a series of health issues, such as cardiovascu-
lar and respiratory diseases, it directly affects several outdoor activities
(e.g. commuting, sports, trip planning, agriculture) and therefore it is
strongly related to the overall quality of life. In addition, the analysis
of environmental information is often a prerequisite for the fulllment
of legal mandates on the management and preservation of environmen-
tal quality, according to the EU's and other legal frameworks (Karatzas
and Moussiopoulos, 2000). With a view to offering personalized deci-
sion support services for people based on environmental information
regarding their everyday activities (Wanner et al., 2012) and supporting
environmental experts in air quality preservation tasks, there is a need
to extract, combine and compare complementary and competing envi-
ronmental information from several resources in order to generate
more reliable and cross-validated information on the environmental
conditions. One of themain steps towards this goal is theenvironmental
information extraction from heterogeneous resources.
Environmental observations are automatically performed by spe-
cialized instruments, hosted in stations established by environmental
organizations, while the forecasts, which are used to foretell weather
conditions, the levels of pollution or pollen concentration in areas of
interest, are provided by environmental prediction models, the output
of which are gridded numerical data, henceforth referred to as actual
or originaldata. In practice only a few of the data providers make
available to the public some means of access to their actual (numerical)
forecast data, while the majority publishes the results in the form of
preprocessed images, that address specic environmental pressures
(like air pollution concentrations), for specic temporal scales (usually
in the order of hours or days), and for specic geographical areas of in-
terest. However, even if the original data values of environmental infor-
mation had been available, these would commonly be presented in
various technical formats, using various coordinates and spatial resolu-
tions, different units, and several other choices (e.g., Kukkonen et al.,
Ecological Informatics xxx (2013) xxxxxx
Corresponding author at: Centre for Research and Technology Hellas, Information
Technologies Institute, 6th km Charilaou-Thermi Road, P.O. Box 60361, 57001 Thermi,
Thessaloniki, Greece. Tel.: +30 2311257746.
E-mail addresses: moumtzid@iti.gr (A. Moumtzidou), vepitrop@isag.meng.auth.gr
(V. Epitropou), stefanos@iti.gr (S. Vrochidis), kkara@eng.auth.gr (K. Karatzas),
sascha.voth@iosb.fraunhofer.de (S. Voth), abas@isag.meng.auth.gr (A. Bassoukos),
juergen.mossgraber@iosb.fraunhofer.de (J. Moßgraber), ari.karppinen@fmi.
(A. Karppinen), jaakko.kukkonen@fmi.(J. Kukkonen), ikom@iti.gr (I. Kompatsiaris).
ECOINF-00416; No of Pages 14
1574-9541/$ see front matter © 2013 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
Contents lists available at ScienceDirect
Ecological Informatics
journal homepage: www.elsevier.com/locate/ecolinf
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
2012). It can therefore be a laborious task to convert these data les to
the same harmonized format, for inter-comparison purposes. Conse-
quently, the main sources of environmental information for everyday
use are web portals and sites, which provide a variety of information
of diverse spatial and temporal nature. Although the weather forecasts
are usually presented in textual format (Moumtzidou et al., 2012b),
important environmental information such as the air quality and pollen
forecasts is encoded in multimedia formats (Karatzas, 2005). Specical-
ly, the vast majority of such environmental data are published as static
heatmaps (i.e. graphical representation of matrix data with colors), or
as sequences of heatmaps (time-lapse animations). A characteristic
example of a heatmap is presented in Fig. 1 (generated by the SILAM
model, courtesy of FMI). However, since this information comes from
different providers and is presented in a variety of not intercomparable
and compatible visual forms, it is not possible to directly combine them
and compile a synthetic service that takes into accountall available data
sources. In order to dealwith this problem, it is necessary to design and
develop a model that is capable of extracting environmental informa-
tion from heatmaps and translate them to a structured numerical for-
mat. The processing of images for their conversion into numerical data
would comprise the core of environmental data recovery techniques,
at least in the air pollution and the pollen concentration domains.
In this context, this paper addresses the extraction of air quality and
pollen forecasts from heatmaps, by proposing a semi-automatic frame-
work, which consists of three main components: an annotation tool
for administrative user intervention used for generating conguration
templates for each heatmap, an Optical Character Recognition (OCR)
and text processing module used for fetching text information em-
bedded in the image and making the necessary corrections, as well
as the AirMerge heatmap processing module (Epitropou et al., 2011)
that allows for the automatic harvesting, annotation, harmonization
and reconversion of heatmaps into numerical data. The framework is
evaluated against the AirMerge system and various chemical weather
forecast datasets. It should be highlighted that the AirMerge system,
per-se, does not include an automated annotation process, therefore
any heatmap harvesting and parsing procedure must be manually
scripted, even though the programmatic generation of certain types of
highly repetitive scripts e.g. to handle series of images from one same
provider, is possible. On the contrary, the proposed framework aims
at automating this scripting process, by generating the conguration
scripts required by AirMerge on a per-case basis via optical heatmap
analysis, the use of graphical templates and machine automation. The
results of the resulting scripts are then compared to those obtained
by using the best manually congured AirMerge scripts for a given
heatmap template, and the differences in their setup and nal data
extraction results are discussed.
The contribution of this work is a novel framework that integrates
multimedia annotation and processing modules, in order to allow for
the semi-automatic extraction of air quality and/or pollen forecast
data presented in heatmaps. Specically, this framework integrates
multimedia conguration components (annotation tool), advanced sys-
tems for heatmap image processing (AirMerge) and optimized OCR
techniques. These modules are integrated in a standalone, user-based
interface that allows for template-based customization of heatmaps
and thus assists in handling several formats of heatmaps. This paper
substantially extends the works presented in Moumtzidou et al.
(2012a) and Vrochidis et al. (2012), which have demonstrated the ini-
tial results of this framework, by providing an extensive evaluation,
which includes a comparative study of the proposed framework against
the manually congured AirMerge system and real numerical data pro-
vided by forecast models for a variety of providers.
This paper is structured as follows: Section 2 presents the previous
research on heatmap analysis and content extraction, Section 3 de-
scr ibes the results of studies on the presentation format of environmental
Fig. 1. An example of an air quality heatmap: the forecast of NO
2
concentrations (μg/m
3
) at 8 UTC time of 6 December 2012, using the SILAM chemical transport model.
2A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
information as well the a typical heatmap. Section 4 presents the prob-
lem and its requirements, while Section 5 describes the overall architec-
ture, the involved modules (i.e. annotation tool, text extraction and
processing module and heatmap processing module) and a short com-
parison of the proposed system and AirMerge. The evaluation results
are presented in Section 6 and nally, Section 7 concludes the paper.
2. Previous research
The task of map analysis strongly depends on the map type and
the information we need to extract. Depending on the application, a
straightforward requirement would be to perform semantic image seg-
mentation (e.g. rivers, forests, etc.), while in the case of heatmaps it is to
transform color into numerical data. In general, the discriminating fac-
tors between map types are reected in their scale, colorization, quality,
accuracy, topology and many other aspects. In the case of air quality
(or chemical weather) maps there are mainly two types of information
covered by the map data:
Geographical information: points and lines describing country fron-
tiers or other well-known points of interests or structures (e.g. sea,
land, capitals) in a given coordinate system. These features can often
be used as cues for manually or automatically identifying the geo-
graphical registration of a map.
Feature information: forecasted or measured data of any kind
(e.g. average temperature or pollutant concentration), which are
coded via a color scale representing the measured values. Single values
are referenced geographically by a color value at the corresponding
geographical point.
Chemical weather maps often use raster map images to represent
measured or forecasted data. There are several approaches to extract
and digitalize this image information automatically. Musavi et al. (1988)
describe the process of the vectorization of digital image data. Hereby
the geographical information, in form of lines, is extracted and converted
to digital storable vector data.
In another work (Desai et al., 2005), the authors describe an ap-
proach to efciently identify street maps among several other images
by applying image processing techniques to identify unique patterns,
such as street lines, which differentiate them among all other images.
For the identication of street maps, the Law's texture classication
algorithm is applied in order to recognize the unique image patterns
such as street lines and street labels. Finally, the authors use GEOPPM,
an algorithm for automatically determining the geocoordinates and
scale of the maps. In another similar work (Henderson and Linton,
2009), the authors use the specic knowledge of the known colorization
in USGS maps, to automatically segment these maps based on their
semantic contents (e.g. roads, rivers). Chiang and Knoblock (2006)
propose an algorithm using 2-D Discrete Cosine Transformation (DCT)
coefcients and Support Vector Machines (SVM) to classify the pixels
of lines and characters on raster maps.
In Michelson et al. (2008), the authors present an automatic ap-
proach to mine collections of maps from the web. This method harvests
images from the web and then classies them as maps or non-maps by
comparing them to previously classied map and non-map images
using methods from Content-Based Image Retrieval (CBIR). Specically,
a voting, k-Nearest Neighbor classier is used as it allows exploiting
image similarities without explicitly modeling them compared to
other traditional machine learning techniques such as Support Vector
Machines.
Finally, Cao and Tan (2002) improve the segmentation quality of
text and graphics in color map images, to enhance the results of the
following analysis processes (e.g. OCR), by selecting black or dark pixels
from color maps, cleaning them up from possible errors or known
unwanted structures (e.g. dashed lines), to get cleaner text structures.
In addition, a specic attempt for map recognition was realized within
the context of TRECVID workshops (Smeaton et al., 2006). Specically,
the mapsconcept was evaluated in the high level concept feature ex-
traction task of TRECVID 2007 (Kraaij et al., 2007). The best performing
system for the map concept was Yuan et al. (2007), which is based on
supervised machine learning techniques on several fused visual de-
scriptors. In another approach evaluated in TRECVID 2007 (Ngo et al.,
2007), the authors explore the upper limit of bag-of-visual-words
(BoW) approach based upon local appearance features and evaluate
several factors which could impact their performance. The proposed
system is based on the fusion of Support Vector Machine classiers
that use BoW, spatial layout of keypoints, edge histogram, grid based
color moment and wavelet texture features. In this context, Chang
et al. (2007) developed a cross-domain SVM (CDSVM) algorithm for
adapting previously learned support vectors from one domain to help
the classication in another domain. However, these algorithms were
tested generally on maps and no testing was realized on heatmaps.
Although research work has been conducted towards the automatic
extraction of information in maps, very few works address the automatic
extraction of information from chemical weather maps or environmental
maps in general. However, such an extraction method has been included
in the European Open Access Chemical Weather Forecasting Portal
(ECWFP
1
), while an overview of the rst version of this portal has been
presented by Balk et al. (2011).InEpitropou et al. (2011, 2012),ameth-
od to reconstruct environmental data out of chemical weather images is
described and developed (AirMerge system). First, the relevant map sec-
tion is scraped from the chemical weather image. Then, disturbances are
removed and a color classication is used to classify every single data
point (pixel), to recover the measured data. With the aid of the known
geographical boundaries, given by the coordinate axis and the map pro-
jection type, the geographical position of the measured data point can be
retrieved. In the case of missing data points, a special interpolation algo-
rithm (based on a novel Articial Neural Network algorithm developed
by the authors) is used to close these gaps. The authors in Moumtzidou
et al. (2012a) and Vrochidis et al. (2012) propose a framework that inte-
grates the system of Epitropou et al. (i.e. AirMerge system) and aims at
automating and thus facilitating its use. In both works the proposed sys-
tem is evaluated only against the AirMerge system (semi-automated
versus manual conguration), while in the current work a more exten-
sive evaluation is realized by using the original numerical values that
were generated by the corresponding forecast models as the ground
truth.
3. Study and description of forecasted chemical weather heatmaps
In this section we present insights into the presentation of environ-
mental information, focusing on air quality and pollen forecasts. The re-
sults of a study we have conducted on more than 60 environmental
websites (dealing with weather, air quality and pollen), as well as the
ndings of previous works (Karatzas, 2005) revealed that a consider-
able share of environmental content, almost 60%, is encoded in images
and specically heatmaps. Overall, it can be said (Balk et al., 2011)
that the chemical weather forecasting information is usually presented
in the form of images representing pollutant concentrations over a geo-
graphically bounded region, typically in terms of maximum or average
concentration values for the time scale of reference, which is usually
the hour or day (Epitropou et al., 2011). These providers present air
quality forecasts almost exclusively in the form of preprocessed images
with a color index scale indicating the concentration of pollutants. In
addition, they individually choose the image resolution and the color
scale employed for visualizing pollution loadings, the covered region,
as well as the geographical map projection. The mode of presentation
varies from simple web images to AJAX, Java or Adobe Flash viewers
(Kukkonen et al., 2009) and while this representation is informative
for the casual user (e.g. compared to a table with numerical values), it
1
http://www.chemicalweather.eu/Domains.
3A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
has the drawback that the data are being presented in a wide range of
highly heterogeneous forms, which makes it very complicated to ex-
tract and compare their results. Moreover, some of the images are per-
manently marked with visible watermarks, text, lines etc. that would
make the extraction phase even more challenging.
In general, the heatmaps that contain chemical weather information
are commonly static bitmap images, which represent the coverage data
(e.g. concentrations) in terms of a color-coded scale over a geographical
map. A characteristic example of such a heatmap, obtained from the
SILAM FMI
2
website, is depicted in Fig. 1.
In general, the information that can be embedded in a heatmap
image, is the geographical coordinates of the map, the type of environ-
mental aspect (e.g. ozone, birch pollen), the date/time information of
the meaningful information and the color scale. After a careful observa-
tion of numerous heatmaps, we conclude that the information that is
considered of importance besides the geographical coordinates and
the concentrations, is the type of physical property (i.e. concentration
of NO
2
), the date/time information (i.e. 2012-12-6 1268 08:00) and
the color scale. Summarizing, the main parts of information that need
to be extracted and/or processed from all images are the following:
Heatmap: map depicting a geographical region with colors representing
the values of an environmental quantity.
Color scale: range indicating the correspondence between value and
color.
Coordinate axes (x, y): indicate the geographical longitude and latitude
of every map point for a specic geographic projection. On some
heatmaps, the coordinates and their scale are explicit, while for others
they must be deduced differently, e.g., by using known landmarks.
Title: contains information such as the type of measured physical
property, the time and date of the forecast, and additional informa-
tion such as type of measurement procedure (e.g. hourly average
or daily maximum).
Additional information: watermarks, border and coastal lines, wind
elds superimposed to concentration maps and any other informa-
tion that can be useful for visual interpretation and geographical
registration purposes. However this type of information is categorized
as noisein terms of inuencing the information content and repre-
sentation value of the specicheatmap.
4. Problem statement and requirements
After having described the format of heatmaps and the type of the
encoded information (i.e. geographical and color information), we will
briey describe the problem we address and the steps towards its
solution.
The problem description is summarized into the following lines: re-
trieval of the concentrations of air pollutants' (or other environmental
aspects such as birch pollen concentration) numerical values and geo-
graphical coordinates out of a heatmap by taking into consideration
that the original values have been quantized in order to allow their visu-
alization and thus no one-to-one mapping is possible. The proposed
procedure towards the solution of this problem is a four step process
and is depicted in Fig. 2. The steps that reect the requirements of the
proposed framework are the following:
1) Removal of noisy elements (e.g. border and coastal lines)
2) Retrieval of the heatmap's raster grid's coordinates and mapping
them to actual geographical coordinates
3) Mapping of the heatmap's pixel color to a range of values according
to the color scale
4) Retrieval of the nal result, i.e. coordinates and pollutant values.
5. Overall architecture of the heatmap processing model
The architecture of the proposed framework draws upon the re-
quirements that were set in the previous section. The idea is to employ
image analysis and processing techniques to map the color variations on
the images on specic categories, whichdirectly correspond to ranges of
values, in order to further automatize the process supported by the
AirMerge system. Normally, the latter relies on manually or program-
matically prepared scripts to perform this task, but, its modular archi-
tecture allows for automating it, making AirMerge suitable for use in
an automated service. Such automation is crucially needed for the use
2
http://silam.fmi./AQ_forecasts/Regional_v4_9/index.html.
Fig. 2. Problem statement and steps involved.
4A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
of this system in an open access portal, such as the ECWFP. To this end,
optical character recognition techniques need to be employed for recog-
nizing text encoded in image format, such as image titles, dates, envi-
ronmental information and coordinates, while an annotation tool is
required to support the intervention of an administrative user. Due to
the fact that there is a large variation of images and many different
representations, there is a need for optimizing and conguring the algo-
rithms involved. Specically, the intervention of an administrative user
is required, in orderto annotate and manually segment different parts of
a new image type (like data, legend, etc.), which need to be processed
by the content extraction algorithms. Ideally, the goal is to construct a
complete conguration template with metadata for AirMerge in a
more automatic way and thus limiting the user input.
The proposed system workow and the involved modules are
depicted in Fig. 3. In order to facilitate this conguration through a
graphical user interface as already discussed, we have implemented
the annotation tool(AnT), which is tailored for dealing with heat
maps. The output of this tool is a conguration le that holds the static
information of the image. The second module is the text extraction and
processing, which uses the information of the conguration le to ex-
tract data from the corresponding image. More specically, it retrieves
and analyzes the information captured in text format using text pro-
cessing techniques and OCR. The third module is the heatmap
processing, which uses information both from the output of the
text processingmodule and the conguration le to process the
heatmap located inside the image.
The input of the framework is an image containing a heatmap and
the output is an XML le, in which each geographical coordinate of
the initial heatmap is associated with a value (e.g., pollutant concentra-
tion or air quality index).
5.1. Annotation tool
To facilitate the annotation process for the user, an annotation tool
(AnT) was developed which can be easily used to annotate heat maps
interactively. The annotation tool was realized in C++ and based on
the QT framework, which allowed for creating a platform independent
tool. To ensure expandability, the MVC (model/view/controller) pattern
was used in the software design. Based on this pattern two different
interaction methods were implemented on two different data views,
at which both are derived from one data structure (the loaded XML
template). The rst data view was implemented as simple TreeView,
which represents the underlying XML data structure and its entries
as traversable tree. The second data view was implemented as a
GraphicsView, whichis capable of interpreting and viewing the selected
datasets graphically. This view is used to draw regions of interests
(ROIs) or point of interests (POIs) as overlays over the heatmap. The ini-
tial drawing of the data is triggered by the selection of the data element
(e.g. ROI element for the legend) in the TreeView. Fig. 4 shows the AnT
tool with an already loaded heatmap from the SILAM FMI website
(Fig. 1).
The left section of the AnT user interface shows the loaded heatmap.
The loaded image consists of the following elements: a) the dyed map
(i.e. heatmap), b) the x and y axes of the map, c) the color legend
with its corresponding d) measurement values, and e) the title and de-
scription of the heatmap. The smaller heatmaps on the bottom left and
right are secondary information heatmaps present in this particular in-
stance of a published chemical weather image, which however are not
being considered in this particular example. After selecting a ROI ele-
ment from the pre-dened basic template, the ROI is drawn over the
heatmap as a red rectangle. Then, the user has the ability to manipulate
the ROI directly by moving the rectangle boundaries with the mouse, or
alternatively by manipulating the values in the TreeView through direct
text input. Both input methods record their changes to the same XML
template data structure and update the other data views.
5.2. Text extraction and processing module
This module is driven by the conguration le produced by the AnT
tool and focuses on retrieving the textual information captured in the
image using text extraction and processing techniques through a two-
step procedure. The rst step (i.e. text extraction) includes the appli-
cation of OCR on the following parts of the input image: title, color
scale, map x and y axes, searching for potential text strings contain-
ing relevant information to the heatmap itself. The OCR software that
is used is Abbyy Fine Reader,
3
though any text processing module
could be, in theory, plugged in. It should be noted that the OCR step
is not expected to be error free and thus a second step (i.e. text pro-
cessing) for text correction is required. In this step, we apply text
processing based on heuristic rules, in order to correct to a certain
extent, extract and understand the semantic information encoded
in the aforementioned locations. It should be noted that each of
these locations was treated in a different way.
The module produces two output les: the rst one is used as input
for the heatmap processing and holds information concerning the color
scale and the map geographical coordinates, while the second captures
general information, such as the date and the type of environmental
aspect.
In the sequel, we describe these two steps by applying them on the
characteristic heatmap example of Fig. 1 and present the results. It
should be noted that this example is very demanding, since especially
the resolution of the text that describes the x and y axes is of very low
quality.
5.2.1. OCR on title, color scale, axes
Basedonthestudyonheatmaps(seeSection 4), considerable part of
the meaningful information can be extracted from the text surrounding
the image. More specically, the color scale and themap axes areessen-
tial elements that provide information about the values and the geo-
graphical area covered. On the other hand, the title (usually) contains
information about the environmental physical property measured and
the corresponding date/time. The location of the aforementioned
image parts needs to be captured in the conguration template.
Therefore, we apply OCR on the aforementioned parts of heatmap
depicted in Fig. 1.Tables 1, 2, 3 and 4 contain the input and output of
Fig. 3. Overall heatmap content distillation architecture.
3
http://www.abbyy.com.gr/.
5A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
OCR for the title, the color scale and, x and y axes respectively. The
values in bold indicate the errors produced by OCR. It should be noted
that for the cases of the color scale and the x and y axes, we also re-
trieved the exact position of the text, in order to relate the latter with
the corresponding colors and geographical coordinates. This is done
on the grounds that it is reasonable to assume that e.g. a number
under a horizontal line in the image most likely represents a longitude
value, while a number located under (or at the side, in case of vertical
scales) thebeginning of a color region in the color scale most likely rep-
resents the minimum or starting value for that color.
A careful observation of Tables 1 and 2 shows that the text in the title
and the color scale was identied accurately compared to that of the
axes. Especially the results after processing the text on the y axis contain
a lot of errors. This is due to the fact that the resolution of the gures
along y axis is particularly low, which makes it difcult even for the
human eye to recognize them successfully. We will attempt to correct
as much as possible these errors in the second step.
5.2.2. Text processing on OCR results
The next step includes the application of heuristic rules that accrue
from the study of the sites containing heatmaps and aim at correcting
and understanding the semantic information encoded in the aforemen-
tioned places. Each of these segments is treated in different ways, since
the type of the semantic information included is different.
5.2.2.1. Title. The title usually contains the name of the environmental
aspect, the measurement units and the date/time. Regarding the mea-
surement units, these are usually standard depending on the measured
environmental aspect and therefore we will not attempt to extract
them. The date/time is considered as the most complex element given
that it is presented in several different formats, which need to be handled
separately, using a trial-by-error and maximum likelihood strategy. In
order to correct possible errors in the textual format of the month, day
and aspect, we apply the following procedure:
1) Construct manually three English vocabularies, which are used as
ground truth datasets. These vocabularies hold all the possible values
of the aforementioned elements that is the month (e.g. January, Jan.),
the day (e.g. Monday, Mon.) and the environmental aspect (e.g. O
3
,
ozone);
2) Split the text returned by OCR into words;
3) Compare the words returned from OCR with the each one of the man-
ually constructed English ground truth sets using the Levenshtein
Fig. 4. Annotation tool (AnT) user interface.
Table 1
Title input image (top) and OCR output (bottom).
Forecast for NO
2
. Last analysis time: 20121206_00
Concentration, μgN/m
3
, 08Z06DEC2012
Forecast for NO
2
. Last analysis time: 20121206_00
Concentration, μgN/m
3
, 08Z06DEC2012
Table 2
Color scaleinput image (top) and OCR with text position output(bottom), expressedin
pixel coordinates (horizontal and vertical positions, with upper-left origin (0,0)).
0.1 0.2 0.4 0.8 1.5 2.5 4 7 15 25
Position (left, top, right, bottom): 47, 5, 90, 30 value: 0,1
Position (left, top, right, bottom): 149, 5, 195, 30 value: 0,2
Position (left, top, right, bottom): 248, 5, 294, 30 value: 0.4
Position (left, top, right, bottom): 347, 5, 393, 30 value: 0,8
Position (left, top, right, bottom): 452, 5, 495, 30 value: 1,5
Position (left, top, right, bottom): 548, 5, 594, 30 value: 2,5
Position (left, top, right, bottom): 662, 5, 681, 24 value: 4-
Position (left, top, right, bottom): 764, 5, 780, 30 value: 7
Position (left, top, right, bottom): 857, 5, 891, 31 value: 15
Position (left, top, right, bottom): 953, 5, 990, 30 value: 25
6A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
distance metric (Levenshtein, 1966); the Levenshtein distance is a
string metric for measuring the difference (distance) between two
sequences (words in our case); specically, the Levenshtein distance
metric is calculated as the minimum number of single-character
edits required to change one word into the other;
4) Correct the initial OCR result by considering the word from the
ground truth dataset that has the minimum distance from it.
In the specicexampleofFig. 1, the OCR module recognized correct-
ly the date/time and aspect parameters and thusno corrections were re-
quired. The information we obtained from the title is the following:
Date/time: 2012-12-06 08:00:00, aspect: NO
2
.
5.2.2.2. Color scale. The color scale holds the mapping between color
variations in the map and aspect values. The extraction of information
from the color scale is a two step procedure. During the rst step, the re-
sults of OCR (i.e. values and positions) are corrected, while in the second
we associate values with colors. Regarding the rst step, it should be
noted that in case the scale values change in a linear way, the most com-
mon difference among them is calculated and then the scale values are
adapted accordingly. Otherwise, we do not proceed on such adapta-
tions, since it is possible that the resulting values will not be corrected.
The information regarding the linearity of color scale values is provided
by the administrative user through the AnT tool. Then, the correlation of
values to colors is achieved by taking into consideration the orientation
of the color scale and by using the pixel coordinates given by OCR.
In the specic example, the values 0.81.5 are mapped to the color
found at the (268, 447) coordinates of the initial image. It should be
noted that since the scale values do not increase in a linear way, no
attempt is made to modify them.
5.2.2.3. X and Y axes. In orderto deal with x and y axes, similar processing
techniques are applied, since they both representthe geographical coor-
dinates of the map. Specically, at least two points of the map (giving
two distinct geographical coordinates), as well as their position with re-
spect to themap's raster (giving two distinct pixel coordinates) needto
be resolved, in order to successfully identify all the point coordinates
through a geographical bearing extrapolation procedure. The procedure
followed includes again two steps: a) correction of the errors produced
by OCR and b) use of the coordinates' elements. In a similar way to
the color scale processing, in order to correct the values in both axes
we estimate the most common difference among the axis values and
adjust the others accordingly, since the values in this case change in a
linear way.
For the specicexampleofFig. 1, after correcting OCR results, we
associated the geographical coordinates (9°, 70°) and (18°, 68°) to the
image map pixels (162, 130) and (292, 164) respectively. It should
be noted that for the specic site a lot of processing and severalassump-
tions were required since the OCR results for the coordinate axes (espe-
cially for y axis) were not satisfactory.
5.3. Heatmap processing module
In this section, we present the heatmap processing module that
extracts data from different models and coordinate systems. This is real-
ized by the AirMerge engine, which is a complex processing framework
with its primary purpose being the extraction of environmental data
from heatmaps, by using image segmentation, scraping and processing
algorithms. Even though it was initially designed to be used for the
extraction of chemical weather forecasting data, its methodology is
generalizable to any type of heatmaps, provided that it can be algo-
rithmically processed. In addition, AirMerge implements auxiliary
functionality such as automatic harvesting of heatmaps, batch pro-
cessing of large numbers of heatmaps, and persistence of processing
results (database storage).
The most important component of AirMerge, a derivative of which
is also reused (under license) by the proposed framework, is the
AirMerge Core Engine, which performs the conversion from image
data (heatmaps) to numerical gridded data. The functionality and per-
formance of this engine has been described in Epitropou et al. (2011,
2012) and Karatzas et al. (2011). The Core Engine performs the extrac-
tion of data from heatmaps using a processing chain that consists of two
main procedures: a) the screen scraping procedure, where raw RGB
pixel data are extracted from heatmaps, and classied according to a
color scale in order to be mapped to ranges of numerical values;
nally, this procedure includes a linear deprojection phase, where the
images' raster is interpreted as a geographical grid in a specied geo-
graphical projection, centered on reference keypoints; b) the recon-
struction of missing values and data gap procedure, which deals with
noisy elements on heatmaps.
5.3.1. Screen scraping procedure
This step handles the cropping of the original image to a region of
interest and parsing of it into a 2D data array directly mapped to the
original image's pixels. Also, it deals with the association of the color
to minimum/maximum value ranges of the air pollutant concentration
levels, which is often implied by the color scale associated with the orig-
inal image. It shouldbe noted that the information about where to crop,
where each color on the legend is, to which index it should correspond,
etc. is provided by the conguration template of theAnT tool in the pro-
posed system. In this phase, the mapping of the images' raster to a
specic geographical grid is performed, since the images themselves
represent geographical region. The conguration options of AirMerge
allow for choosing between the most commonly encountered geograph-
ical projections (equirectangular, conical, polar stereographic etc.) and
choosing keypoints in the image to allow for precise pixel-coordinate
mapping. Regarding the pixel-coordinated mapping, while the selection
of keypoints is performed manually when using AirMerge as standalone
tool, in the proposed work this functionality is realized in an automatic
way with the aid of the text processing and extractionmodule.
5.3.2. Reconstruction of missing values and data gapsprocedure
This step is introduced to deal with unwanted elements such as
legends, text, geomarkings and watermarks, as well as regions that are
Table 3
Coordinates of x axis input image (top) and OCR (with position) output (bottom),
expressed in pixel coordinates (horizontal and vertical positions, with upper-left origin
(0,0)).
6E 9E 12E 15E 18E 21E 24E 27E 30E
Position (left, top, right, bottom): 18, 3, 44, 24 value: 6E
Position (left, top, right, bottom): 147, 3, 173, 24 value: 9E
Position (left, top, right, bottom): 273, 3, 311, 24 value: 1iE
Position (left, top, right, bottom): 401, 3, 441, 24 value: 16E
Position (left, top, right, bottom): 531, 3, 570, 24 value: 1AE
Position (left, top, right, bottom): 659, 2, 684, 24 value: 21
Position (left, top, right, bottom): 689, 2, 702, 24 value: E
Position (left, top, right, bottom): 788, 2, 831, 24 value: 24E
Position (left, top, right, bottom): 918, 2, 960, 24 value: 2?E
Position (left, top, right, bottom): 1050, 3, 1088, 24 value: 3u£
Table 4
Coordinates of y axis input image (left) and OCRwith position output (right),expressed
in pixel coordinates (horizontal and vertical positions, with upper-left origin (0,0)).
70N Position (left, top, right, bottom): 36, 69, 77, 86 value: TON
68N Position (left, top, right, bottom): 39, 168, 77, 189 value: WN
66N Position (left, top, right, bottom): 39, 270, 78, 287 value: E4N
64N Position (left, top, right, bottom): 39, 369, 78, 387 value: G4M
62N Position (left, top, right, bottom): 38, 467, 78, 489 value: &2N
60N Position (left, top, right, bottom): 39, 570, 77, 587 value: 60N
58N Position (left, top, right, bottom): 36, 668, 78, 690 value: & N
56N Position (left, top, right, bottom): 35, 770, 78, 788 value: 56N
54N Position (left, top, right, bottom): 36, 885, 77, 891 value: c4
7A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
not part of the forecast area, which might be present after the screen
scraping phase. The image pixels are classied into three main catego-
ries: valid data (with colors that satisfy the color scale's classication),
invalid data (with colors notpresent in the color scale),and regions con-
taining colors that are explicitly marked for exclusion, and which are
considered void during processing. Such marked regionsare not consid-
ered as part of the forecast, and thus do not undergo data correction.
However, regions containing unmarked invalid data are considered as
regions with correctable errors or data gaps, which can be lled-in.
This distinction is due to their different appearance patterns: void re-
gions are usually extended and continuous (e.g. sea regions not covered
by the forecast, but present on the map), while invalid data regions are
usually smaller but more noticeable (e.g. lines, text, watermarks etc.)
and with more noise-like patterns, and thus it is more compelling to
remove them by using gap-lling techniques. These techniques include
traditional grid and pattern-based interpolation techniques using neu-
ral networks.
In order for the Core Engine module to function, it must be guided
through all the relevant details of the heatmap (position, dimension,
colors, geographical projection etc.). Normally, this is achieved via an
XML scripting subsystem, which is used as AirMerge's conguration
template. Each distinct type of heatmap needs its own scripting/cong-
uration le, although similar heatmaps can use the same conguration
with no or only minor variations.
Generally speaking, whenever a new source of environmental
heatmaps is added to AirMerge's list of tasks, a new conguration tem-
plate/script (using XML syntax) must be created by hand, though it is
possibleto partially customize this template so that a series oftemplates
will be automatically produced from it. For example, the pattern of the
URLs used by a model provider to publish their own heatmaps can be
encoded in the template, and used to automatically produce variations
of the template only for the parts that vary e.g. resolution may be con-
stant for all images from a given provider, but color scale maybe differ-
ent for every available pollutant, and there might be several different
time series available (e.g. 48 or 72 h) for the same pollutant and region.
The proposed framework aims at automating the creation of these
conguration scripts, which can be quite time consuming and require
technical skills, andthus any comparisonsare drawn primarily between
the accuracy achievable by a technically skilled human operator that
knows how to classify heatmaps and create appropriate scripts, and a
semi-automated system, which instead relies on cues contained in
the heatmaps themselves and a guidance by environmental operators,
who do not possess technical skills.
5.4. Comparison of the proposed system and AirMerge
Given that both the proposed framework and the AirMerge compo-
nent can be employed to perform the same task it would be useful to list
their advantages and disadvantages in order to make clearer whichlim-
itations of AirMerge attempt the proposed architecture to overcome
and the errors that are introduced when limiting the user intervention.
The advantages of using a manually congured system (i.e. AirMerge)
are that in general, it is a very accurate system, if spot-on information
(i.e. latitude and longitude lines and their values) is available and that
it allows a skilled operator to detect optimizations and cues that are dif-
cult for an automated system to realize e.g. template redundancy and
reuse (master templates), the use of unusual map projections, images
with little or no geographical cues etc. However, the main disadvantage
is that the manual conguration of the system is a laborious, time con-
suming and error prone task, while specic expertise and technical
skills are required.
On the other hand, the proposed framework automates further the
data extraction procedure from heatmaps by relieving human operators
from the tedious task of manual conguration and allows the usage by
administrative users (i.e. environmental experts), who do not have
technical skills. However, this automation does not come without a
cost, since it is possible that error is introduced during the second mod-
ule, which includes the OCR and coordinate mapping step.
Although both systems have pros and cons, they could serve differ-
ent application needs. For instance the proposed framework could be
more useful for administrative environmental experts, without techni-
cal skills,while AirMerge could certainly be used by technically qualied
personnel to provide quality measurements. Table 5 contains a brief
overview of the advantages and disadvantages of the manually and
the semi-automatic congured systems.
6. Results
The evaluation of the framework is carried out into two steps with
different focuses. The rst step deals with evaluating thetext extraction
and processing module (i.e. OCR and text processing using heuristic
rules) by presenting a visual assessment of the output. The nal XML
output of the system (i.e. mappingof geographic coordinates to forecast
values) is not provided, since its visual presentation is not informative,
compared to the reconstructed image, which derives from this repre-
sentation and is more appropriate for visual inspection. The second
step presents a direct comparison of the results of the proposed frame-
work with the ones of the AirMerge system, as well as with thenumer-
ical values obtained from the corresponding forecast models.
6.1. Qualitative evaluation
The tests that have been performed during this step focus on the
recognition of the x and y axes and evaluate the mapping of pixels to
Table 5
Advantages and disadvantages of manually and semi-automatic congured systems.
Manually congured system (AirMerge) Semi-automatic congured system (framework)
Advantages Potentially very accurate, if spot-on information is available
A technically skilled operator can detect optimizations and cues that are difcult for an
automated system to realizee.g. template redundancyand reuse (master templates).
Relieves human operators from a potentially tedious task
Signicant step towards the creation of completely automated
systems
Can automatically deal with unknown/unlisted types of heatmaps
Usable in a completely automated service
Disadvantages Creating proper templates is laborious and error prone
Incorrect assumptions by part of the operator can lead to sub-optimal templates
The template conguration requires technical skills and it cannot easily be used by
environmental experts
Certain types of heatmaps do not contain enough cues for an automated
system to completely analyze without manual intervention.
Introduction of error during the geographical mapping procedure
Table 6
OCR error in websites.
Site Longitude Latitude
Original
degrees
Estimation Absolute
error
Original
degrees
Estimation Absolute
error
FMI Pollen 5 4.98775 0.01225 5 4.98404 0.01596
FMI 5 4.97516 0.02484 5 4.96523 0.03477
GEMs 5 5.0634 0.0 125 5 4.9776 0.0 045
LAPS 2 1.9924 0.004 1 1.0236 0.023
AOPG 2 1.9937 0.003 1 0.9958 0.004
8A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
geographical coordinates. Given the fact that in this case we are not
aware of the original forecasted data that were used for constructing
the heatmap values, we assess the results by visual comparison of the
original image and the one produced by the proposed framework.
The images tested are extracted from the following sites:
FMI Pollen, Pollen Finnish Meteorological Institute site (http://
pollen.fmi.)
It contains forecasts measurements for several types of pollen such
as birch and grass for Europe in general.
FMI, SILAM Finnish Meteorological Institute site (http://silam.fmi./)
It contains forecasted measurements for several air pollutants such as
nitrogen oxides and ne particles for Europe and for the Northern
European countries.
GEMS, Global and regional Earth-system Monitoring using Satellite
and in-situ data project site (http://gems.ecmwf.int/d/products/raq/)
It contains outputs from several state-of-the-art chemistry and trans-
port models for Europe.
LAPS, Laboratory of Atmospheric Physics of the Aristotle University of
Thessaloniki site (http://lap.physics.auth.gr/forecasting/airquality.htm)
It contains regional air quality forecasts for Greece.
AOPG, Atmospheric and Oceanic Physics Group site (http://www.sica.
unige.it/atmosfera/bolchem/MAPS/)
It presents the results of BOLCHEM numerical model that simulates the
composition of the atmosphere for Italy.
Fig. 5. Original image retrieved from Pollen FMI site representing the fraction of birch.
Fig. 6. Reconstructed image.
Fig. 7. Originalimage retrieved from FMI site representing theNO
2
forecastconcentration
from 500 m using SILAM model.
9A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
Table 6 contains the error introduced by the text extractionand pro-
cessing module (called absolute error), during the process of recog-
nizing the values and the position of the horizontal (latitude) and
vertical (longitude) axes. This error is introduced mostly due to the in-
ability of OCR to perfectly identify the position of each coordinate on
the map axes and it is calculated as the average difference of the OCR es-
timation (e.g. 4.98775 in the rst line) compared to the initial degrees
range (e.g. 5 in the rst line) for two consecutive lines and represents
the error in the latitude and longitude step (i.e. the difference between
two subsequent degrees in the map). It should be noted that the pixel
coordinate matching is based on how well the OCR recognizes the
position of each coordinate axis value and how well the alignment
of this value and the coordinate lines (or ticks) is. Since several
heatmaps do not include grid lines, this approach relies only on the
position of the coordinates on the heatmap to dene the pixel coor-
dinate matching. In the following, we present in detail the results for
each website.
6.1.1. FMI Pollen website
Fig. 5 is the original image retrieved from the site and represents the
fraction of birch (%). Fig. 6 depicts the reconstructed image produced
from the proposed system after visualizing the XML output. The
reconstructed gure is almost identical and in addition any noise
(e.g. black lines) was removed. Moreover, the absolute error both
for of the latitude and longitude step is very low (around 0.3%).
6.1.2. FMI website
In case of the FMI site, based on visual assessment, the reconstructed
image (Fig. 8) is almost identical to the initial one (Fig. 7). The original
image depicts NO
2
forecast concentrations for a height of 500 m as esti-
mated by SILAM model. The absolute geo-coordinate error is very low
(around 0.6%) for both latitude and longitude and thus the error intro-
duced by OCR is not signicant.
6.1.3. GEMs website
The images capture O
3
forecast concentrations using the EURAD-IM
model. Figs. 9 and 10 depict the original and reconstructed image by the
AirMerge system, which are almost identical, and any noise (e.g. black
lines) is removed. In both cases the error is very low.
6.1.4. LAPS site
In a similar way we present the initial and the reconstructed image
of this website in Figs. 11 and 12. The results are reported in Table 6
and the error is again very small. The original image was produced
using the Fifth Generation Penn/State Mesoscale Model, MM5, and the
Eulerian photochemical air quality model CAMx and represents the
maximum concentration of NO
2
.
Fig. 8. Reconstructed image.
Fig. 9. Original image from GEMS site representing the O
3
forecast concentration using the EURAD-IM model.
10 A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
6.1.5. AOPG site
The results of the last provider report an average error of 0.35%. The
initial and the reconstructed maps are illustrated in Figs. 13 and 14.It
should be noted that the white region in Fig. 13 is treated as void
spacein Fig. 14, and considered as a distinct case than national border
Fig. 11. Original image from LAPS site representing the maximum forecast concentration of NO
2
using the Fifth Generation Penn/State Mesoscale model, MM5, and the Eulerian photo-
chemical air quality model CAMx.
Fig. 12. Reconstructed image.
Fig. 10. Reconstructed image.
11A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
lines, which are instead treatedas unwanted noise and lled-in.Regard-
ing the original image, it represents the concentration of PM
10
pollutant
as predicted by the BOLCHEM model.
6.2. Quantitative evaluation
The quantitative evaluation focuses on comparing the performance
of the AirMerge system with the proposed framework against the real
numerical data. This is realized by comparing the reconstructed data
from the published images from both systems with the original forecast
data as produced by the forecast model. In this way, we can calculate
more accurately, compared to the rst evaluation step, how signicant
is error introduced by OCR and the quality of the nal results.
The tests are performed on a set of 108 images, which are extracted
from the FMI site
4
and their reconstructed data from these images are
compared with the original data provided in a NetCDF format le
by FMI. These images are selected so that multiple air pollutants
and time/dates are covered. The selection of diverse input data aims at
retrieving a variety of images and thus testing the systems with as dif-
ferent as possible input images.
Specically, the dataset is created using the following restrictions:
6 pollutants are handled (i.e. CO, NO, NO
2
,PM
10
,PM
2.5
and SO
2
)
3 h per day (i.e. 8:00, 16:00, 24:00)
6 days, a weekend and 4 weekdays were selected
Surface height was used exclusively.
Table 7 contains the following results for each pollutant separately:
a) the number of images, b) the absolute average latitude and longitude
step differences, which indicate the error introduced by the proposed
framework for 5° in each axis, c) the average percentage of pixels with
correct values (i.e. compared with the original numerical values produced
by the SILAM model), d) the average error (er) introduced in each pixel
from AirMerge (AM) during data extraction from the heatmaps (the
mathematical formula for er, which is presented later, is based on the
formula used for estimating the relative error), e) the average error (er)
introduced in each pixel from the framework (FW) due to OCR and thus
misalignment of the coordinates, f) the mean squared error per pixel for
4
http://silam.fmi./AQ_forecasts/Regional_v4_9/index.html.
Fig. 13. Original image fromAOPG site representing the forecast concentration of PM
10
using the BOLCHEM model.
Fig. 14. Reconstructed image.
12 A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
the AirMerge (AM) system and g) the root mean squared error of AM and
FW respectively which has the same units as the quantity being estimat-
ed.Theerroreriscalculatedas:
er ¼Xn
i¼0
vievi
vi
n;
where n is the total number of pixels, v
i
is the original data from the
specic geographical coordinates and ev
i
is the value of AirMerge
with manual conguration or the value of the proposed framework for
the specic coordinates. The mathematical formula for Mean Squared
Error (MSE) is:
MSE ¼1
nX
n
i¼0
vievi
ðÞ
2
;
where n, v
i
and ev
i
stand for the same parameters as in the error er. Final-
ly, the Root Mean Squared Error (RMSE) is dened as the square root of
MSE:
RMSE ¼ffiffiffiffiffiffiffiffiffi
MSE
p¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
1
nX
n
i¼0
vievi
ðÞ
2
v
u
u
t:
Based on Table 7, it is evident that the latitude and longitude errors
are common for all pollutants, since the map types considered are sim-
ilar. They are both quite low and thus the proposed framework could
identify rather well the position of the horizontal and vertical lines on
the map. The percentage of pixels with correct values is satisfactory
for both systems with that of the framework being slightly lower due
to the OCR error. Regarding the error introduced in each pixel value, it
is in general quite low except the case of CO, where the error is higher.
This is probably due tothe fact that the values betweensequential pixels
varied more coarsely compared to the other cases, a phenomenon
which was also observed in Epitropou et al. (2012) and attributable to
the use of a linearly spaced, but coarse and sparse color scale, as well
as to the higher average magnitude of the observed values. The same
applies for the MSE and RMSE errors. In general, it is evident that the
proposed framework introduces an additional error to the original
values compared to AirMerge. However the error introduced is not sig-
nicant and shows that a manually congured extraction system could
be substituted by a semi-automatic one, which could facilitate the tasks
of environmental administrators.
7. Conclusions
In this paper, we have proposed a framework for environmental
information extraction from air quality and pollen forecast heatmaps,
combining image processing, template conguration, as well as textual
recognition components. The proposed framework overcomes the
limitation of not having access to the raw data, since it only considers in-
formation in form of heatmaps that are publicly available on the Inter-
net, and estimates the original numerical forecasted data by using the
reconstructed data of the heatmaps. The evaluation revealed that the
proposed semi-automatic congured system has almost similar results
(i.e. the estimated values are rather close to the original ones) to the
manual one, since in most of the cases no signicant error is introduced
by the OCR.
Potential uses for the proposed framework include supporting envi-
ronmental systems that provide either air quality information from
several providers for direct comparison or orchestration purposes or de-
cision support on everyday issues (e.g. travel planning) (Wanner et al.,
2012), and in general providing a way to access sufciently usable nu-
merical environmental data for a host of applications involving the pro-
cessing of the latter, without requiring explicit data publishing policy
changes by part of environmental data providers, thus creating more
exibility.
Future work includesevaluation with images in different projections
(such as conical) and an effort to further automate the procedure. This
can be achieved by applying segmentation techniques on the original
image, which will result to the automatic recognition of its element
(heatmap, color scale, axis) boundaries. Towards this direction we
plan to investigate and apply segmentation techniques that are based
only on rough image features (Hoenes and Lichter, 1994), on Voronoi
diagrams (Kise et al., 1998) and on connected components (Bukhari
et al., 2010).
Acknowledgments
This work was supported by PESCaDO project (FP7-248594).
References
Balk, T., Kukkonen, J., Karatzas, K., Bassoukos, A., Epitropou, V., 2011. A European open
access chemical weather forecasting portal. Atmos. Environ. 45, 69176922.
Bukhari, S., Al Azawi, M.I.A., Shafait, F., Breuel, T.M., 2010. Document image segmentation
using discriminative learning over connected components. Proceedings of the 9th
IAPR International Workshop on Document Analysis Systems (DAS '10). ACM, New
York, NY, USA, pp. 183190.
Cao, R., Tan,C., 2002. Text/graphics separation in maps. In: Blostein, D., Kwon, Y.-B.(Eds.),
Fourth IAPR Workshop on Graphics Recognition. Lecture Notes in Computer Science,
vol. 2390. Springer, Berlin, pp. 167177.
Chang, S., Jiang, W., Yanagawa, A., Zavesky, E., 2007. Columbia University TRECVID 2007
high-level feature extraction. Proceedings of TREC Video Retrieval Workshop
(TRECVID 07).
Chiang, Y.Y., Knoblock, C.A., 2006. Classication of line and character pixels on
raster maps using discrete cosine transformation coefcients and support vector
machine. Proceedings of the 18th International Conference on Pattern Recognition,
pp. 10341037.
Desai, S., Knoblock, C.A., Chiang, Y.-Y., Desai, K., Chen, C.-C., 2005. Automatically identify-
ing and georeferencing street maps on the web. Proceedings of the 2005 Workshop
on Geographic Information Retrieval (GIR '05). ACM, New York, NY, USA, pp. 3538.
Epitropou, V., Karatzas, K.D., Bassoukos, A., Kukkonen, J., Balk, T., 2011.A new en vironmental
image processing method for chemical weather forecasts in Europe. Proceedings of the
5th International Symposium on Information Technologies in Environmental Engineer-
ing, Poznan.
Epitropou, V., Karatzas, K., Kukkonen, J., Vira, J., 2012. Evaluation of the accuracy of an
inverse image-based reconstructionmethod for chemical weather data. International
Journal of Articial Intelligence 9 (S12), 152171.
Henderson, T.C., Linton,T., 2009. Raster map image analysis. Proceedings of the 2009 10th
International Conference on Document Analysis and Recognition (ICDAR '09). IEEE
Computer Society, Washington, DC, USA, pp. 376380.
Hoenes,F.,Lichter,J.,1994.Layout extraction of mixed mode documents.Mach. Vis. Appl.
7, 237246.
Karatzas, K., 2005. Internet-based management of environmental simulation tasks. In:
Farago, I., Georgiev, K., Havasi, A. (Eds.), Advances in Air Pollution Modelling for
Environmental Security, pp. 253262 (NATO Reference EST.ARW980503, 406 p.).
Table 7
Results comparing the proposed framework and AirMerge system with the original
numerical values produced by SILAM model.
Pollutant Total
CO NO
2
NO PM10 PM2.5 SO
2
Number of images 18 18 18 18 18 18 108
Latitude step difference
between AM and FW
8.72 · 10
4
Longitude step difference
between AM and FW
1.33 · 10
4
Average percentage of
pixels without error
in value (AM)
74.9% 83.2% 89.7% 85.6% 86.6% 77.2% 82.9%
Average percentage of
pixels without error
in value (FW)
72.1% 76.4% 89.6% 80.3% 81.5% 69.5% 78.3%
Average error per pixel
(AM)
19.857 0.283 0.025 0.712 0.622 0.188 3.490
Average error per pixel
(FW)
20.566 0.3426 0.029 0.831 0.717 0.238 3.657
RMSE per pixel (AM) 36.218 0.473 0.219 1.156 0.922 0.454 6.574
RMSE per pixel (FW) 38.497 0.638 0.250 1.520 1.193 0.618 7.120
13A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
Karatzas, K., Moussiopoulos, N., 2000. Urban air quality management and information
systems in Europe: legal framework and information access. J. Environ. Assess. Policy
Manag. 2 (Νο. 2), 263272.
Karatzas, K., Kukkonen,J., Bassoukos, A., Epitropou, V.,Balk, T., 2011. A European chemical
weather forecasting portal. In: Steyn, Douw G., Trini Castelli, Silvia (Eds.), 31st ITM -
NATO/SPS International Technical Meeting on Air Pollution Modelling and Its Appli-
cation, Torino, 28 Sept. 2010. Published in Air Pollution Modeling and Its Applications
XXI, Springer, NATO Science for Peace and Security Series C: Environmental Security,
pp. 239243.
Kise, K., Sato, A., Iwata, M., 1998. Segmentation of page images using the area Voronoi
diagram. Comput. Vis. Image Underst. 70 (3), 370382.
Kraaij, W., Over, P., Awad, G., 2007. TRECVID-2007 high-level feature task: overview.
Online Proceedings of the TRECVID Video Retrieval Evaluation Workshop.
Kukkonen, J., Klein, T., Karatzas, K., Torseth, K., Fahre Vik, A., San José, R., Balk, T., Soev,
M., 2009. COST ES0602: towards a European network on chemical weather forecast-
ing and information systems. Adv. Sci. Res. J. 1, 17.
Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A., Klein, T., Miranda, A.I., Monteiro, A.,
Hirtl, M., Tarvainen, V., Boy, M., Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S.,
Soev, M., Sokhi, R., Lehtinen, K.E.J., Karatzas, K., San José, R., Astitha, M., Kallos, G.,
Schaap, M., Reimer, E., Jakobs, H., Eben, K., 2012. A review of operat ional, reg ional-
scale, chemical weather forecasting models in Europe. Atmos. Chem. Phys. 12, 187.
Levenshtein, V.I., 1966.Binary codes capableof correcting deletions, insertions, and rever-
sals. Sov. Phys. Dokl. 10, 707710.
Michelson, M., Goel, A., Knoblock, C.A., 2008. Identifying maps on the World Wide Web.
In: Cova, Thomas J., Miller, Harvey J., Beard, Kate, Frank, Andrew U., Goodchild,
Michael F. (Eds.), Proceedings of the 5th International Conference on Geographic
Information Science (GIScience '08).Springer-Verlag, Berlin, Heidelberg, pp. 249260.
Moumtzidou,A.,Epitropou,V.,Vrochidis,S.,Voth,S.,Bassoukos,A.,Karatzas,K.,Mossgraber,
J.,Kompatsiaris,I.,Karppinen,A.,Kukkonen,J.,2012a.Environmental data extraction
from multimedia resources. Proceedings of the 1st ACM International Workshop on
Multimedia Analysis for Ecological Data (MAED 2012), November 2, Nara, Japan,
pp. 1318.
Moumtzidou, A., Vrochidis, S., Tonelli, S., Kompatsiaris, I., Pianta, E., 2012b. Discovery of
environmental nodes in the web. Proceedings of the 5th IRF Conference, Austria,
Vienna, July 23.
Musavi, M.T., Shirvaikar, M.V., Ramanathan, E., Nekovei, A.R., 1988. Map processing
methods:an automated alternative. Proceedings of the Twentieth Southeastern Sym-
posium on, IEEE Computer Society, System Theory, pp. 300303.
Ngo, Ch., et al., 2007. Experimenting VIREO-374: bag-of-visual-words and visual-based
ontology for semantic video indexing and search. Proceedings of TREC Video Retrieval
Workshop (TRECVID 07).
Smeaton, A.F., Over, P., Kraaij, W., 2006. Evaluation campaigns and TREC Vid. Proceedings
of 8th ACM International Workshop on Multimedia Information Retrieval, California,
USA, pp. 321330.
Vrochidis, S., Epitropou, V., Bassoukos, A., Voth, S., Karatzas, K., Moumtzidou, A., Mossgraber,
J.,Kompatsiaris,I.,Karppinen,A.,Kukkonen,J.,2012.Extraction of environmental data
from on-line environmental information sources. Articial Intelligence Applications
and Innovations. IFIP Advances i nI nformation and Communication Technology, volume
382 361370.
Wanner, L., Rospocher, M., Vrochidis, S., Bosch, H., Bouayad-Agha, N., Bugel, U.,
Casamayor, G., Ertl, T., Hilbring, D., Karppinen, A., Kompatsiaris, I., Koskentalo, T.,
Mille, S., Mossgraber, J., Moumtzidou, A., Myllynen, M., Pianta, E., Saggion, H.,
Serani, L., Tarvainen, V., Tonelli, S., 2012. Personalized environmental service
conguration and delivery orchestration: the PESCaDO demonstrator. Proceedings
of the 9th Extended Semantic Web Conference (ESWC 2012), Heraclion, Crete,
Greece.
Yuan, Y., et al., 2007. THU and ICRC at TRECVID 2007. Proceedings of TREC Video Retrieval
Workshop (TRECVID 07).
14 A. Moumtzidou et al. / Ecological Informatics xxx (2013) xxxxxx
Please cite this article as: Moumtzidou, A., et al., A model for environmental data extraction from multimedia and its evaluation against various
chemical weather forecasting datasets, Ecological Informatics (2013), http://dx.doi.org/10.1016/j.ecoinf.2013.08.003
... This component can be used autonomously, and even off-line, provided a suitable configuration script is supplied, describing how to parse a specific heatmap. The configuration script can be generated either manually or automatically using a dedicated annotation tool [17], and is similar to the examples already shown. ...
... Then, this information is validated and further edited by an administrative user with the aid of a dedicated Annotation Tool. This information is then used to generate an AirMerge heatmap parsing XML script, which is fed to the parsing AirMerge component, according to the scheme in Fig. 13, where the AirMerge component is indicated as the BHeatmap Processing^block [17]. ...
Article
Full-text available
The AirMerge platform was designed and constructed for increasing the availability and improving the interoperability of heatmap-based environmental data on the Internet. This platform allows data from multiple heterogeneous chemical weather data sources to be continuously collected and archived in a unified repository; all the data in this repository have a common data format and access scheme. In this paper, we address the technical structure and applicability of the AirMerge platform. The platform facilitates personalized information services, and can be used as an environmental information node for other web-based information systems. The results demonstrate the feasibility of this approach and its potential for being applied also in other areas, in which image-based environmental information retrieval will be needed.
... Tidsskriftet Ecological Informatics publiserte i 2014 en artikkel om en modell for uttrekk av miljødata fra multimedia som også inneholdt en evaluering av pollendata opp mot ulike luftforurensningsdata («kjemiske vaervarslingsdatasett» -chemical weather forecasting data sets) (Moumtzidou et al., 2014). ...
Technical Report
Full-text available
Evaluering av den nasjonale pollenvarslingstjenesten, NOVA-r a p p o r t n r 7/18 . Den nasjonale pollenvarslingstjenesten ble etablert her i landet rundt 1975. Dette er første gang tjenesten blir evaluert. Målet med evalueringen er å få vite mer om hvilken nytteverdi varslingstjenesten har i dag, og om mulighetene framover for best mulig å forebygge utvikling og forverring av pollenallergier og astma. NOVA foretok en litteraturstudie av relevant forskning i flere land, en nettundersøkelse blant 570 brukere av pollenvarslingen i norge og intervjuet helsepersonell og eksperter. Mange er fornøyd med dagens pollenvarslingstjeneste. Men evalueringen viser behov for flere pollenfeller, raskere innsamling av data, samt mer forskning, blant annet knyttet opp mot klimaendringene. Evalueringsundersøkelsen er utført som et oppdrag for Norges Astma-og Allergiforbund (NAAF) og Helsedirektoratet.
... There were some approaches to get environmental information from heatmap semi-automatically. They based on annotation tool and OCR techniques (Moumtzidou et al. 2012(Moumtzidou et al. , 2013. ...
Chapter
The paper is focused on adaptation of the information extraction method for environmental heatmaps to U-matrices of Self Organising Maps—SOM neural networks. Our method bases on OCR, image processing and image recognition techniques. The approach was designed to be as much as possible general but information acquired from a heatmap and the form of a heatmap defer depending on the heatmap type. In the paper we introduce some dedicated processing steps while trying to minimize the number of changes in the previously proposed method. The results for U-matrices of SOM neural network are evaluated in the experimental study and compared with efficiency of the method for environmental maps.
... Only very recent projects such as PESCaDO 1 Pl@ntNet 2 and PASODOBLE 3 have dealt with developing innovative services that take into account environmental information and investigated the extraction and semantic interpretation of the environmental information encoded in multimedia format such as weather, air quality, pollen forecasts or citizen's multimedia records. In this context, recent works in multimedia analysis of environmental data deal with heatmap analysis for forecast data extraction [3], plant identification [2], underwater visual data analysis [4] and monitoring of the atmosphere [1]. ...
... In this case it is configured for the environmental domain but could service others as well. Details on the content extraction can be found in the works of Pianta and Tonelli [2010] and Moumtzidou et al. [ , 2014. ...
Thesis
The often cited information explosion is not limited to volatile network traffic and massive multimedia capture data. Structured and high quality data from diverse fields of study become easily and freely available, too. This is due to crowd sourced data collections, better sharing infrastructure, or more generally speaking user generated content of the Web 2.0 and the popular transparency and open data movements. At the same time as data generation is shifting to everyday casual users, data analysis is often still reserved to large companies specialized in content analysis and distribution such as today's internet giants Amazon, Google, and Facebook. Here, fully automatic algorithms analyze metadata and content to infer interests and believes of their users and present only matching navigation suggestions and advertisements. Besides the problem of creating a filter bubble, in which users never see conflicting information due to the reinforcement nature of history based navigation suggestions, the use of fully automatic approaches has inherent problems, e.g. being unable to find the unexpected and adopt to changes, which lead to the introduction of the Visual Analytics (VA) agenda. If users intend to perform their own analysis on the available data, they are often faced with either generic toolkits that cover a broad range of applicable domains and features or specialized VA systems that focus on one domain. Both are not suited to support casual users in their analysis as they don't match the users' goals and capabilities. The former tend to be complex and targeted to analysis professionals due to the large range of supported features and programmable visualization techniques. The latter trade general flexibility for improved ease of use and optimized interaction for a specific domain requirement. This work describes two approaches building on interactive visualization to reduce this gap between generic toolkits and domain-specific systems. The first one builds upon the idea that most data relevant for casual users are collections of entities with attributes. This least common denominator is commonly employed in faceted browsing scenarios and filter/flow environments. Thinking in sets of entities is natural and allows for a very direct visual interaction with the analysis subject and it stands for a common ground for adding analysis functionality to domain-specific visualization software. Encapsulating the interaction with sets of entities into a filter/flow graph component can be used to record analysis steps and intermediate results into an explicit structure to support collaboration, reporting, and reuse of filters and result sets. This generic analysis functionality is provided as a plugin-in component and was integrated into several domain-specific data visualization and analysis prototypes. This way, the plug-in benefits from the implicit domain knowledge of the host system (e.g. selection semantics and domain-specific visualization) while being used to structure and record the user's analysis process. The second approach directly exploits encoded domain knowledge in order to help casual users interacting with very specific domain data. By observing the interrelations in the ontology, the user interface can automatically be adjusted to indicate problems with invalid user input and transform the system's output to explain its relation to the user. Here, the domain related visualizations are personalized and orchestrated for each user based on user profiles and ontology information. In conclusion, this thesis introduces novel approaches at the boundary of generic analysis tools and their domain-specific context to extend the usage of visual analytics to casual users by exploiting domain knowledge for supporting analysis tasks, input validation, and personalized information visualization.
Article
There are limitations to traditional visualization solutions regarding real-time 3D visualization of time-varying and large-volume 3D gridded oceanographic data in a web environment. We adopted the open-source visualization technologies to implement a browser-based 3D visualization framework. The developed 3D visualization interfaces provide users 3DGIS experiences on a virtual globe and simultaneously provide efficient 3D volume rendering and enriched interactive volume analysis. Our experiments suggest that the well-designed Cesium and Plotly.js API allow researchers to easily establish 3D visualization applications while avoiding the requirements of intensive programming and computations. The case study conducted shows that the proposed methods is a feasible alternative web-based 3D visualization solution, which provides a faster rendering speed, high visual effects and on-the-fly 3D visualization of oceanographic data. Due to its open-source architecture and the simplicity of the adopted technologies, the visualization framework can be easily customized to visualize other scientific data with few modifications.
Book
Frühwarnsysteme dienen zur möglichst frühzeitigen Information über eine sich anbahnende oder auftretende Gefahr, um Personen und Organisationen die Möglichkeit zu geben entsprechend darauf reagieren zu können. Die Konzeption eines Frühwarnsystems stellt komplexe Herausforderungen an die Systemarchitekten, hierzu liefert die vorliegende Arbeit ein Rahmenwerk für die Architektur zukünftiger Frühwarnsysteme.
Article
Data on observed and forecasted environmental conditions, such as weather, air quality and pollen, are offered in a great variety in the web and serve as basis for decisions taken by a wide range of the population. However, the value of these data is limited because their quality varies largely and because the burden of their interpretation in the light of a specific context and in the light of the specific needs of a user is left to the user herself. To remove this burden from the user, we propose an environmental Decision Support System (DSS) model with an ontology-based knowledge base as its integrative core. The availability of an ontological knowledge representation allows us to encode in a uniform format all knowledge that is involved (environmental background knowledge, the characteristic features of the profile of the user, the formal description of the user request, measured or forecasted environmental data, etc.) and apply advanced reasoning techniques on it. The result is an advanced DSS that provides high quality environmental information for personalized decision support.
Article
Full-text available
There is a large amount of meteorological and air quality data available online. Often, different sources provide deviating and even contradicting data for the same geographical area and time. This implies that users need to evaluate the relative reliability of the information and then trust one of the sources. We present a novel data fusion method that merges the data from different sources for a given area and time, ensuring the best data quality. The method is a unique combination of land-use regression techniques, statistical air quality modelling and a well-known data fusion algorithm. We show experiments where a fused temperature forecast outperforms individual temperature forecasts from several providers. Also, we demonstrate that the local hourly NO2 concentration can be estimated accurately with our fusion method while a more conventional extrapolation method falls short. The method forms part of the prototype web-based service PESCaDO, designed to cater personalized environmental information to users.
Chapter
Full-text available
The European Chemical Weather Forecasting Portal (ECWFP) has been developed within the COST (European Cooperation in Science and Technology) ES0602 action, “Towards a European Network on Chemical Weather Forecasting and Information Systems”. The portal provides access to the predictions of a substantial number of chemical weather forecasting systems and may be used to find out which services are available for specific (1) areas, (2) time periods and (3) pollutants. The portal serves as a “one stop shop” of chemical weather modeling services and associated information, and is currently expanding its functionalities to allow for a harmonized presentation and inter-comparison of the various available forecasts, as well as for the computation of model ensemble predictions.
Chapter
Full-text available
It is common practice to present environmental information of spatial nature (like atmospheric quality patterns) in the form of pre-processed images. The current paper deals with the harmonization, comparison and reuse of Chemical Weather (CW) forecasts in the form of pre-processed images of varying quality and informational content, without having access to the original data. In order to compare, combine and reuse such environmental data, an innovative method for the inverse reconstruction of environmental data from images, was developed. The method is based on a new, neural adaptive data interpolation algorithm, and is tested on CW images coming from various European providers. Results indicate a very good performance that renders this method as appropriate to be used in various image-processing problems that require data reconstruction, retrieval and reuse.
Article
Full-text available
The EU legislative framework related to air quality, together with national legislation and relevant Declarations of the United Nations (UN) requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The WWW is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Article
Full-text available
It is common practice to publish environmental information via the Internet. In the case of geographical coverage information such as pollutant concentration charts and maps in chemical weather forecasts, such data are published as web-resolution images. These forecasts are commonly presented with an associated value-range pseudocolor scale, which represents a simplified version of the original data, obtained through dispersion models and related post-processing methods. In this paper, the numerical and signal processing performance of a method to reconstruct numerical data from the published coverage images is evaluated by comparing the reconstructed data with the original forecast data
Conference Paper
Full-text available
Analysis of environmental information is considered of utmost importance for humans, since environmental conditions are strongly related to health issues and to a variety of everyday activities. Despite the fact that there are already many free on-line services providing environmental information, there are several cases, in which the presentation format complicates the extraction and processing of such data. A very characteristic example is the air quality forecasts, which are usually encoded in image maps of heterogeneous formats, while the initial (numerical) pollutant concentrations, calculated and predicted by a relevant model, remain unavailable. This work addresses the task of semi-automatic extraction of such information based on a template configuration tool, on methodologies for data reconstruction from images, as well as on Optical Character Recognition (OCR) techniques. The framework is tested with a number of air quality forecast heatmaps demonstrating satisfactory results.
Article
The European Union (EU) legislative framework related to air quality, together with national legislation and relevant declarations of the United Nations (UN), requires an integrated approach concerning air quality management (AQM), and accessibility of related information for the citizens. In the present paper, the main requirements of this legislative framework are discussed and main air quality management and information system characteristics are drawn. The use of information technologies is recommended for the construction of such systems. The World Wide Web (WWW) is considered a suitable platform for system development and integration and at the same time as a medium for communication and information dissemination.
Article
A European chemical weather forecasting portal is presented in this paper that has been developed within the COST (European Cooperation in Science and Technology) ES0602 action, “Towards a European Network on Chemical Weather Forecasting and Information Systems”. The portal includes an access to a substantial number (currently 21) of available chemical weather forecasting systems and their numerical forecasts; these cover in total more than 30 regions in Europe. This portal can be used, e.g., to find out, which services are available for a specific domain, for specific source categories or for specific pollutants. The portal currently expands its functionalities to allow for a harmonized presentation and inter-comparison of the various available forecasts, as well as for the computation of model ensemble predictions. It provides functions for obtaining relevant supplementary information, e.g., using the Model Documentation System of the European Environmental Agency. The new portal is an open access system, through which chemical weather forecasts can be added to the system, and the predictions can be accessed, analysed and inter-compared. Such a single point of reference for the European chemical weather forecasting information has previously not been in operation. We present the characteristics of the new portal, and discuss how this activity complements the GEMS and PROMOTE air quality forecasting portals.
Conference Paper
Analysis and processing of environmental information is considered of utmost importance for humanity. This article addresses the problem of discovery of web resources that provide environmental measurements. Towards the solution of this domain-specific search problem, we combine state-of-the-art search techniques together with advanced textual processing and supervised machine learning. Specifically, we generate domain-specific queries using empirical information and machine learning driven query expansion in order to enhance the initial queries with domain-specific terms. Multiple variations of these queries are submitted to a general-purpose web search engine in order to achieve a high recall performance and we employ a post processing module based on supervised machine learning to improve the precision of the final results. In this work, we focus on the discovery of weather forecast websites and we evaluate our technique by discovering weather nodes for south Finland.