Content uploaded by Stefanos Vrochidis
Author content
All content in this area was uploaded by Stefanos Vrochidis
Content may be subject to copyright.
L. Iliadis et al. (Eds.): AIAI 2012 Workshops, IFIP AICT 382, pp. 361–370, 2012.
© IFIP International Federation for Information Processing 2012
Extraction of Environmental Data from On-Line
Environmental Information Sources
Stefanos Vrochidis1, Victor Epitropou2, Anastasios Bassoukos2, Sascha Voth3,
Kostas Karatzas2, Anastasia Moumtzidou1, Jürgen Moßgraber3,
Ioannis Kompatsiaris1, Ari Karppinen4, and Jaakko Kukkonen4
1 Centre for Research and Technology Hellas, Informatics and Telematics Institute
2 Informatics Systems and Applications Group, Aristotle University of Thessaloniki
3 Fraunhofer Institute of Optronics, System Technologies and Image Exploitation
4 Finnish Meteorological Institute, Helsinki
{stefanos,moumtzid,ikom}@iti.gr,
{vepitrop,abas}@isag.meng.auth.gr,
{sascha.voth,juergen.mossgraber}@iosb.fraunhofer.de,
{ari.karppinen,jaakko.kukkonen}@fmi.fi, kkara@eng.auth.gr
Abstract. Analysis of environmental information is considered of utmost
importance for humans, since environmental conditions are strongly related to
health issues and to a variety of everyday activities. Despite the fact that there are
already many free on-line services providing environmental information, there are
several cases, in which the presentation format complicates the extraction and
processing of such data. A very characteristic example is the air quality forecasts,
which are usually encoded in image maps of heterogeneous formats, while the
initial (numerical) pollutant concentrations, calculated and predicted by a relevant
model, remain unavailable. This work addresses the task of semi-automatic
extraction of such information based on a template configuration tool, on
methodologies for data reconstruction from images, as well as on Optical Character
Recognition (OCR) techniques. The framework is tested with a number of air
quality forecast heatmaps demonstrating satisfactory results.
Keywords: Environmental, air quality, heatmap, image processing, OCR, data
reconstruction, template configuration.
1 Introduction
Analysis of environmental information is considered of utmost importance for human
population, as this is strongly related to health issues (e.g. cardiovascular diseases), as
well as to a variety of important activities (e.g. agriculture). In everyday life,
environmental conditions of the atmospheric environment, in terms of air quality,
weather, pollen measurements and forecasts are also of particular interest for outdoor
activities (e.g. trip planning) and therefore they strongly affect the quality of life.
Nowadays, the main sources of such information for the everyday user are web
portals and sites. In order to support people in everyday action planning considering
the environmental conditions, we need to provide them with services, which combine
complementary environmental information from several resources, with a view to
362 S. Vrochidis et al.
generate more reliable environmental measurements. The first step towards this
direction is the extraction of data from environmental resources. In practice only a
few of the data providers make available some means of access to their actual
(numerical) forecast data. In this context, this paper addresses the semi-automatic
extraction of air quality forecasts from heatmap images.
After studying a number of on-line chemical weather forecasts by various providers
[1], it can be said that the air quality information is most usually presented in the form of
images representing forecast pollutant concentrations over a geographically bounded
region, typically in terms of maximum or average air pollution concentration values for
the time scale of reference, which is usually the hour or day [2], [3], [4], [5], [6]. These
providers present their air quality forecasts almost exclusively in the form of
preprocessed images with a color index scale indicating the concentration of pollutants.
In addition, these providers arbitrarily choose the resolution of their images, the color
scale and color depth employed for visualizing pollution loadings, the covered region, as
well as the geographical map projection. The actual mode of presentation varies from
simple web images to more elaborated AJAX, Java or Adobe Flash viewers [7]. While
this representation is informative for the casual user (e.g. compared to a table with
numerical values), it has the drawback that the data are being presented in a wide range
of highly heterogeneous forms, which makes it very complicated to extract and compare
their results. To make it worse, some of the images are permanently marked with visible
watermarks, text, lines etc. that would make the extraction phase even more challenging.
In order to address this challenge we propose a semi-automatic framework for
extracting air quality information from such images and store them into a numerical
format. The proposed system is based on an annotation tool, which supports an
administrative user to generate a configuration template for each heatmap, and on
Optical Character Recognition (OCR) techniques for text information extraction. The
basic functionality of the system (i.e. the information extraction from heatmaps), is
based on AirMerge [4], [6], [8], [9], a system that allows for the automatic harvesting,
annotation, harmonization and reverse engineering of heatmaps, in order to come up
with easily deployable numerical values of chemical weather forecasts.
The contribution of this paper is the methodology and the framework for user-
assisted air quality information extraction from heatmaps, which extends previous
works (i.e. AirMerge) by further adding OCR techniques, as well as allowing user
configuration with the aid of a dedicated graphical user interface. More specifically,
we propose a framework, which is based on a novel heatmap Annotation Tool (AnT),
on the application and optimization of OCR techniques for textual information
extraction from heatmap images and on the AirMerge tool [4] for image processing.
This paper is structured as follows: section 2 presents the related work, while
section 3 describes the framework architecture. Section 4 presents the Annotation
Tool, section 5 the OCR techniques and section 6 the AirMerge system. The results
are presented in section 7 and finally, section 8 concludes the paper.
2 Related Work
Existing maps can be grouped into map types based on the placement and
presentation of their information. Discriminating factors between map types can be
Extraction of Environmental Data from On-Line Environmental Information Sources 363
found in their scale, colorization, quality, accuracy, topology and many other aspects.
In case of air quality (or chemical weather) maps there are mainly two types of
information covered by the map data: a) Geographical information: points and lines
describing country frontiers or other well-known points of interests or structures (e.g.
sea, land, capitals) in a given coordinate system, b) Color information: measured data
of any kind (e.g. average temperature), which are coded via a color scale representing
the measured values. Single values are referenced geographically by a color value at
the corresponding geographical point. Chemical weather maps often use this type of
maps called raster map or heatmaps images to represent measured data. There are
several approaches to extract and digitalize this image information automatically.
First, the authors in [10] describe the process of the vectorization of digital image
data. Hereby the geographical information, in form of lines, is extracted and
converted to digital storable vector data. Only the lines are processed. The work in
[11] makes use of the specific knowledge of the known colorization in USGS maps,
to have the ability to automatically segment these maps based on their semantic
contents (e.g. roads, rivers). In [12] the segmentation quality of text and graphics in
color map images is improved, to enhance the results of the following analysis
processes (e.g. OCR), by selecting black or dark pixels from color maps, cleaning
them up from possible errors or known unwanted structures (e.g. dashed lines), to get
cleaner text structures.
Although research work has been conducted towards the automatic extraction of
information in maps, very few works address the automatic extraction of information
from chemical weather maps. In such works [4], [6], [8], a method to reconstruct
environmental data out of chemical weather images is proposed. In a first step the
relevant map section is scraped from the chemical weather image. After that
disturbances are removed (e.g. country lines) and a color classification is employed to
classify every single data point (pixel), to recover the measured data. With the aid of
the known geographical boundaries, given by the coordinate axis and the map
projection type, the geographical position of the measured data point can be retrieved.
In case of missing data points, a special interpolation algorithm is used to fill these
gaps.
The proposed work goes one step beyond the aforementioned heatmap extraction
methods, since it introduces a configurable user-assisted environment, which
facilitates the application of the framework on new heatmaps without requiring
programming skills and low level configuration on the user’s side.
3 Framework Architecture
The architecture of the proposed framework is illustrated in Figure 1 and includes two
main components: the Annotation Tool and the data extraction service.
The first phase, called “Template Configuration” (1→2→3), includes the manual
annotation of an image with the AnT, and the generation of a configuration file. This
process is controlled by an administrative user with the aid of AnT. The second phase
includes the “Data extraction” (1+3→4→5), which uses the configuration file to
364 S. Vrochidis et al.
Fig. 1. Air quality data extraction framework
extract data from the specific heatmap. During this phase, the parts of each image are
analyzed using image and text processing techniques. Specifically, the heatmap is
processed with the AirMerge system, while the text information located in the image
is extracted and processed using OCR techniques and text processing.
4 Annotation Tool
The Annotation Tool (AnT) is used to interactively annotate heat maps and it was
developed in order to make the annotation process easier for the user.
To make the tool platform independent, the QT Framework1 was used. The
implementation is designed via the MVC (Model/View/Controller) pattern, to ensure
its expandability. To allow for different interaction possibilities, two views were
implemented. First a simple Tree View, which represents the XML structure and its
entries as traversable tree, and a window, which represents the selected tree data
graphically. Regions of Interest (ROIs) and Points of Interest (POIs) are drawn onto
this window.
Figure 2 depicts the AnT tool after a heatmap from GEM’s Project2 site is loaded.
The air quality heatmaps contained in the site are typical examples of images used for
representing chemical weather forecasts. The left part of the tool contains the heatmap
as well as the ROIs, which are the following: a) the map itself, b) the x and y axis
related to the heatmap, c) the color scale, d) the numbers corresponding to the color
scale and e) the title of the heatmap. The ROIs are depicted as red bounding boxes
and are defined by the user. Finally, their values are recorded to the right part of the
AnT inside the XML template under the corresponding nodes.
1 http://qt.nokia.com/products/
2 http://gems.ecmwf.int/d/products/raq/
Extraction of Environmental Data from On-Line Environmental Information Sources 365
Fig. 2. Screenshot of Image Annotation tool
5 OCR Techniques
The OCR module uses the information of the configuration file to extract textual data
from images and improve the results using text processing based on heuristic rules.
The first processing step of this module includes the application of OCR on
specific parts of the input image as the title, the color scale, the map x and y axes
parts. The OCR software that was used is Abbyy Fine Reader3.
In the second step, we apply text processing based on heuristic rules in order to
correct, extract and understand the semantic information encoded in the
aforementioned locations. Each of these locations was treated in a different way.
The title, if it exists, usually contains the name of the aspect, the measurement
units and the date/time. The measurement units are usually standard depending on the
measured aspect so we do not need to extract them. The date/time is considered as the
element that is the most difficult to extract, given the fact that many different formats
exist. In order to correct possible mistakes in the textual format of the month, day or
aspect we exploited the Levenshtein distance. More specifically, three English ground
truth sets were created for the three aforementioned elements and were compared to
the corresponding OCR result. Then, we have selected the element from the ground
truth that had the smaller Levenshtein distance from the text generated by the OCR.
The color scale contains the values that each color of the map corresponds to. The
processing and extraction of information from the color scale element can be divided
into two main parts. In the first, we attempt to check and correct OCR results for the
scale, while in the second we correlate values to colors. In order to correct the OCR
results, we find the most common range among the scale values and adapt
3 http://www.abbyy.com/
Heatma
p
and com
p
onent
p
arts
XML Tem
p
late
366 S. Vrochidis et al.
accordingly the mistaken values. The correlation of values to colors is achieved by
pointing at the middle of each color by using the coordinates of the values.
The last two elements that are analyzed with OCR are the x and y axes. In the case
of heatmaps the two axes contain similar information and thus we will apply similar
processing techniques to them. The information that can be obtained by each axis is
the geographical coordinates of the points of the map. In order to realize this, we have
to identify successfully at least two points (x, y coordinates) of the map axes and
define their position in relation to the map. In order to identify these points, we first
correct most of the errors produced by OCR and then use the coordinates of the
elements, as defined by OCR to specify the position of those points.
6 AirMerge
AirMerge is a web-based system that supports harvesting Chemical Weather forecast
images and converting them to numerical data. A derivative of its image processing
engine is used in the Data Extraction phase of the proposed toolchain, and it is already
used for creating harmonized, numerical Chemical Weather data4.
The AirMerge system combines elements of screen scraping and innovative image
processing algorithms [4], [6], [8] in order to produce uniform, indexed data. These
data are then stored in a back-end database and may be recalled for further processing
such as numerical applications, model ensembles, visualization, transformation etc.
The main task of AirMerge is the extraction of data from heatmaps. This is
achieved by using a processing chain that consists of a “screen scraping” phase,
where raw RGB pixel data are extracted from heatmaps, a mapping phase, where
RGB values are classified to a color scale and mapped to ranges of numerical values
and a linear deprojection phase, where the images’ raster is interpreted as a
geographical grid in a specified geographical projection, centered on key points.
Screen scraping procedure: This step handles the cropping of the original image to a
region of interest and parsing of it into a 2D data array directly mapped to the original
images’ pixels. Also, it associates the color to minimum/maximum value ranges of
the air pollutant concentration levels, which is often implied by the color scale
associated with the original images. It should be noted that the information about
where to crop, where each color on the legend is, to which index it should correspond,
etc. are provided by the configuration template of the AnT in the proposed system. In
this phase, the mapping of the images’ raster to a specific geographical grid is
performed, since the images themselves represent geographical region. The
configuration system allows choosing between the most commonly encountered
geographical projections (equirectangular, conical, polar stereographic etc.) and
choosing keypoint in the image to allow for precise pixel-coordinate mapping.
“Reconstruction of missing values and data gaps” procedure: This step is introduced
to deal with unwanted elements such as legends, text, geomarkings and watermarks,
as well as regions that are not part of the forecast area, which might be present after
the screen scraping phase. The image pixels are classified into three main categories:
valid data (with colors that satisfy the color scale’s classification), invalid data (with
4 http://projects.isag.meng.auth.gr/airmerge/
Extraction of Environmental
colors not present in the c
o
marked for exclusion, and
w
regions are not considere
d
correction. However, regi
o
regions with correctable er
r
is due to their different a
p
continuous (e.g. sea regio
n
while invalid data regions
watermarks etc.) and with
m
remove them by using ga
p
grid and pattern-based inte
r
It should be noted that
t
API, which is available as
request related to the hea
t
processing), thus making it
7 Results
In this section, we presen
t
quality heatmaps from diff
e
provided in [8], we evalu
Regarding the OCR, we fo
c
most important informatio
n
right coordinates. The foll
o
Laboratory of Atmospheri
c
the Atmospheric and Ocea
n
7.1 GEMs Website
Figures 3 and 4 depict the
which are almost identical,
Fig. 3. Original I
m
5
http://gems.ecmwf.i
n
6
http://lap.physics.
a
7
http://www.fisica.u
n
Data from On-Line Environmental Information Sources
o
lor scale), and regions containing colors that are expli
c
w
hich are considered void during processing. Such mar
d
as part of the forecast, and thus do not undergo
d
o
ns containing unmarked invalid data are considere
d
r
ors or “data gaps” which can be filled-in. This distinc
t
p
pearance patterns: void regions are usually extended
n
s not covered by the forecast, but present on the m
a
are usually smaller but more noticeable (e.g. lines, t
m
ore noise-like patterns, and thus it is more compellin
g
p
-filling techniques. These techniques include traditi
o
r
polation techniques using neural networks.
t
he AirMerge system functionality is also provided vi
a
a REST service [9]. Therefore, AirMerge can serve
t
maps of many chemical weather models (e.g. every-
suitable for environmental service-oriented application
s
t
the results of the framework when applied in three
e
rent providers. Since an evaluation of AirMerge is alre
ate the results of the OCR and the total system out
p
c
us on the recognition of the x and y axes, since this is
n
in order to correctly map the air quality index onto
o
wing providers are considered for the evaluation: GE
M
c
Physics of the Aristotle University of Thessaloniki
6
n
ic Physics Group
7
.
original and reconstructed image by the Airmerge syst
and any noise (e.g. black lines) was removed [4], [8].
m
age Fig. 4. Image Reconstructed from AirMe
r
n
t/d/products/raq/
a
uth.gr/forecasting/fore_images/
n
ige.it/atmosfera/bolchem/MAPS/
367
c
itly
r
ked
d
ata
d
as
t
ion
and
a
p),
ext,
g
to
o
nal
a
an
any
day
s
.
air
ady
p
ut.
the
the
M
S
5
,
and
e
m
,
r
ge
368 S. Vrochidis et al.
T
a
Resolut
i
Step
s
Longitude step 0.079
1
Latitude step 0.077
7
During this process an
OCR to perfectly identify
t
we report the longitude an
d
correct step value between
0
0
, 5
0
, 10
0
, etc. the step v
a
finally the introduced erro
general we assume that an
e
7.2 Laboratory of At
m
In a similar way we prese
n
Figures 5 and 6. The result
s
Fig. 5. Original I
m
Table 2. Results fo
Resolut
i
Step
s
Longitude step 0.031
1
Latitude step 0.027
6
7.3 Atmospheric and
O
Finally, in table 3 we prese
n
0.35%. The initial and th
e
should be noted that the w
h
and considered as a distinc
as unwanted noise and fille
d
a
ble 1. Results for the GEMs website
i
on
s
Correct
Value
Estimated
Value
Absolute
Difference Erro
r
1
5 5.0634 0.0634 1.25
%
7
5 4.9776 0.0224 0.45
%
error is usually introduced mostly due to the inabilit
y
t
he position of each coordinate on the map axes. In tab
l
d
latitude steps (i.e. the coordinate step for each pixel),
two subsequent coordinate marks (e.g. when the marks
a
lue is 5), the estimated value, the absolute difference
r. In both cases the error is very low and acceptable
e
rror is acceptable, when it is less than 3%).
m
ospheric Physics of the AUTH site
n
t the initial and the reconstructed image of this websit
e
s
are reported in table 2 and the error is again very smal
l
m
age
Fig. 6. Image Reconstructed from AirMer
g
r the Atmospheric and Oceanic Physics Group website
i
on
s
Correct
Value
Estimated
Value
Absolute
Difference Erro
r
1
2 1.9924 0.0076 0.4
%
6
1 1.0236 0.0236 2.3
%
O
ceanic Physics Group Site
n
t results for the last provider reporting an average erro
e
reconstructed map are illustrated in figures 7 and
8
h
ite region in figure 7 is treated as “void space” in figur
t case than national border lines, which are instead tre
a
d
-in.
r
%
%
y
of
l
e 1
the
are
and
(in
e
in
l
.
g
e
r
%
%
o
r of
8
. It
e 8,
a
ted
Extraction of Environmental
Fig. 7. Original I
m
Table 3. Results fo
Resolut
i
Step
s
Longitude step 0.028
9
Latitude step 0.024
9
8 Conclusions
Despite the fact that the c
u
ideal for casual users, it is
expect a structured and nu
m
proposed a framework
f
combining existing (AirM
e
framework could serve as
a
either air quality informati
o
purposes or high level sug
g
advanced decision suppo
r
proposed work overcomes
t
only considers informatio
n
also data access policies. A
heatmaps it could also
d
represented in the same w
images in different projecti
o
Acknowledgments. This
w
References
1. Balk, T., Kukkonen, J.,
access chemical weathe
r
(2011), doi:10.1016/j.at
m
Data from On-Line Environmental Information Sources
m
age Fig. 8. Image Reconstructed from AirMer
g
r the Atmospheric and Oceanic Physics Group website
i
on
s
Correct
Value
Estimated
Value
Absolute
Difference Erro
r
9
2 1.9937 0.0063 0.3
%
9
1 0.9958 0.0042 0.4
%
u
rrent presentation format of air quality forecasts migh
t
not easily accessible by automatic services which w
o
m
erical format of the forecast data. In this context, we
h
f
or air quality information extraction from heatm
a
e
rge), as well as new (AnT and OCR) components.
T
a
basis for supporting environmental systems that pro
v
o
n from several providers for comparison or orchestra
t
g
estions on everyday issues (e.g. travel planning) base
d
r
t [13], which could facilitate the quality of life.
T
t
he limitation of not having access to the raw data, sin
c
n
being publicly available on the Internet, thus respec
t
lthough the system has been tested with forecast air qu
a
d
eal with observed pollutant and pollen concentrat
i
ay. Future work includes extensive evaluation with
m
o
ns (e.g. conical) and addressing of pollen heatmaps.
w
ork was supported by PESCaDO project (FP7-248594)
Karatzas, K., Bassoukos, A., Epitropou, V.: A European
o
r
forecasting portal. Atmospheric Environment 45, 6917–
6
m
osenv.2010.09.058
369
g
e
r
%
%
t
be
o
uld
h
ave
a
ps,
T
his
v
ide
t
ion
d
on
T
he
c
e it
t
ing
a
lity
i
ons
m
ore
.
o
pen
6
922
370 S. Vrochidis et al.
2. Karatzas, K.: Internet-based management of Environmental simulation tasks. In Farago, I.,
Georgiev, K., Havasi, A. (eds) Advances in Air Pollution Modelling for Environmental
Security, Hardcover, NATO Reference EST.ARW980503, 406 p., pp. 253–262. Springer
(2005) ISBN: 1-4020-3349-4
3. San José, R., Baklanov, A., Sokhi, R.S., Karatzas, K., Pérez, J.L.: Computational Air
Quality Modelling. In: Jakeman, A.J., Voinov, A.A., Rizzoli, A.E., Chen, S.H. (eds.)
Developments in Integrated Environmental Assessment. Environmental Modelling,
Software and Decision Support, vol. 3 (2008) ISBN: 9780080568867
4. Epitropou, V., Karatzas, K., Bassoukos, A.: A method for the inverse reconstruction of
environmental data applicable at the Chemical Weather portal. In: Geospatial Crossroads
@GI_Forum 2010, Proceedings of the GeoInformatics Forum Salzburg, pp. 58–68.
Wichmann Verlag, Berlin (2010) ISBN 978-3-87907-496-9
5. Karatzas, K., Kukkonen, J., Bassoukos, A., Epitropou, V., Balk, T.: A European Chemical
Weather forecasting Portal. In: 31st ITM - NATO/SPS International Technical Meeting on
Air Pollution Modelling and its Application, Torino, September 28 (2010); Published in
Steyn, D.G., Trini Castelli, S. (eds.) Air Pollution Modeling and its Applications XXI, 1st
edn., Hardcover. NATO Science for Peace and Security Series C: Environmental Security,
pp. 239–243. Springer (2011) ISBN 978-94-007-1358-1
6. Epitropou, V., Karatzas, K.D., Bassoukos, A., Kukkonen, J., Balk, T.: A new
environmental image processing method for chemical weather forecasts in Europe. In:
Proceedings of the 5th International Symposium on Information Technologies in
Environmental Engineering, Poznan, July 6-8 (2011)
7. Kukkonen, J., Klein, T., Karatzas, K., Torseth, K., Fahre Vik, A., San José, R., Balk, T.,
Sofiev, M.: COST ES0602: Towards a European network on chemical weather forecasting
and information systems. Advances in Science and Research Journal 1, 1–7 (2009)
8. Epitropou, V., Karatzas, K., Kukkonen, J., Vira, J.: Evaluation of the accuracy of an
inverse image-based reconstruction method for chemical weather data. International
Journal of Artificial Intelligence 9/A12, 152–171 (2012)
9. Epitropou, V., Johansson, L., Karatzas, K., Bassoukos, A., Karppinen, A., Kukkonen, J.,
Haakana, M.: Fusion of Environmental Information for the Delivery of Orchestrated
Services for the Atmospheric Environment in the PESCaDO Project. In: Seppelt, R.,
Voinov, A.A., Lange, S., Bankamp, D. (eds.) 2012 International Congress on
Environmental Modelling and Software, Managing Resources of a Limited Planet,
Leipzig, Germany. International Environmental Modelling and Software Society (iEMSs)
(in press, 2012)
10. Musavi, M.T., Shirvaikar, M.V., Ramanathan, E., Nekovei, A.R.: Map processing
methods: an automated alternative. In: Proceedings of the Twentieth Southeastern
Symposium on System Theory, pp. 300–303. IEEE Computer Society (1988)
11. Henderson, T.C., Linton, T.: Raster Map Image Analysis. In: Proceedings of the 2009 10th
International Conference on Document Analysis and Recognition (ICDAR 2009), pp. 376–
380. IEEE Computer Society, Washington, DC (2009)
12. Cao, R., Tan, C.-L.: Text/Graphics Separation in Maps. In: Blostein, D., Kwon, Y.-B.
(eds.) GREC 2001. LNCS, vol. 2390, pp. 167–177. Springer, Heidelberg (2002)
13. Wanner, L., Vrochidis, S., Tonelli, S., Moßgraber, J., Bosch, H., Karppinen, A., Myllynen,
M., Rospocher, M., Bouayad-Agha, N., Bügel, U., Casamayor, G., Ertl, T., Kompatsiaris,
I., Koskentalo, T., Mille, S., Moumtzidou, A., Pianta, E., Saggion, H., Serafini, L.,
Tarvainen, V.: Building an Environmental Information System for Personalized Content
Delivery. In: Hřebíček, J., Schimak, G., Denzer, R. (eds.) ISESS 2011. IFIP AICT,
vol. 359, pp. 169–176. Springer, Heidelberg (2011)