Content uploaded by Stefanos Vrochidis
Author content
All content in this area was uploaded by Stefanos Vrochidis
Content may be subject to copyright.
An Environmental Search Engine Based on Interactive
Visual Classification
Stefanos Vrochidis1, Harald Bosch2, Anastasia Moumtzidou1,
Florian Heimerl2, Thomas Ertl2, Ioannis Kompatsiaris1
1Information Technologies Institute
6th Klm Charilaou-Thermi Road
Thessaloniki, GR-5700, Greece
{stefanos, moumtzid, ikom}@iti.gr
2University of Stuttgart
Universitätstraße 38
70569 Stuttgart, Germany
firstname.surname@vis.uni-stuttgart.de
ABSTRACT
Environmental conditions play a very important role in hu-
man life.Nowadays, environmental data and measurements
are freely made available through dedicated web sites, ser-
vices and portals. This work deals with the problem of
discovering such web resources by proposing an interactive
domain-specific search engine, which is built on top of a
general purpose search engine, employing supervised ma-
chine learning and advanced interactive visualization tech-
niques. Our experiments and the evaluation show that in-
teractive classification based on visualization improves the
performance of the system.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
D.2.8 [Software Engineering]: Metrics—complexity mea-
sures, performance measures
General Terms
Algorithms, Performance, Experimentation
Keywords
Environmental, search engine, visualization, classification,
interaction, domain-specific search
1. INTRODUCTION
Environmental information plays a very important role
in human life, since it is strongly related to health issues
(e.g. cardiovascular diseases), as well as in everyday life
outdoor activities.Environmental measurements gathered by
dedicated stations, are usually made available through web
portals and services (mentioned in this work as environmen-
tal nodes) in multimedia formats including text, images and
tables. The discovery of environmental nodes is very impor-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MAED’12, November 2, 2012, Nara, Japan.
Copyright 2012 ACM 978-1-4503-1588-3/12/11 ...$10.00.
tant in order to offer direct access to a variety of resources
providing complementary environmental information.
This paper addresses the discovery of environmental nodes,
which is considered as a domain-specific search problem. To-
wards the solution of this problem we propose a novel inter-
active and relevance feedback-based domain-specific search
engine, which extends the current state of the art by propos-
ing an interactive visualization layer on top of domain-specific
search techniques. The proposed search engine is developed
to support expert users to retrieve websites providing envi-
ronmental measurements. As experts, we consider the users,
who have a basic understanding of the classification algo-
rithm and basic knowledge of environmental aspects (i.e.
weather, pollen, air quality).
The main methodologies used for the implementation of
a domain-specific search engine can be divided into two cat-
egories: a) post-processing results of existing web search
engines and b) directed crawling. Our implementation falls
into the first category. In a related work [5], the authors at-
tempt to automatically enrich an ontology by post-processing
domain-specific search. Finally, in a more recent work [4],
the proposed vertical search engine combines visual and tex-
tual features with a view to retrieve images of a certain
category. This work differentiates from the aforementioned
approaches by introducing an interactive classification layer
for post-processing the initial results of a search engine using
visualization techniques.
Supervised classifier learning methods have shown to be
very effective for document classification. The visual compo-
nent of the proposed environmental search engine is inspired
by the fully interactive classifier training tool of our previous
work [3]. It was adapted to website data and modified to
work with radial SVMs. Similar to our method, there is ad-
ditional work that attempts to give humans control during
the retrieval process. Interactive decision tree construction
systems are the most commonly described variant for visu-
alizing interactive machine learning concepts (e.g. [9]). In
another relevant work, Fogarty et al. [2] develop an image
retrieval system that helps users in creating classifiers for
certain concepts.
The contribution of this work is the novel environmental-
specific search engine, which extends current approaches of
domain-specific search by offering advanced options of inter-
action and relevance user feedback functionalities based on
visualization.
This paper is structured as follows: Section 2 presents
the framework of the environmental search engine including
Figure 1: The framework of the interactive search
engine.
the web search techniques and the interactive classification.
Then, Section 3 describes the experimental setup, Section 4
presents the results and the evaluation and finally, Section
5 concludes the paper.
2. ENVIRONMENTAL SEARCH ENGINE
The framework of the interactive environmental-specific
search engine is illustrated in Figure 1 and consists of two
main parts, where the user can interact: the web search and
the interactive classification.
The web search component includes the query submission
to a general purpose search engine. The list of search re-
sults already includes many relevant websites but also many
irrelevant ones. Then, we aim at improving the precision
of the initial results by including a filtering step, which is
based on SVM classification. Therefore, we extract textual
features from the initial results and subsequently train an
initial classifier to create a model that could discriminate
relevant from irrelevant websites. To further improve the
classifier performance we apply an active learning inspired
technique by visualizing the state of the classifier on a 2D
plane and enable users to iteratively correct the labels of mis-
classified websites or to confirm uncertain ones. Then the
SVM model is re-trained by employing the enhanced train-
ing set, which also includes the user annotations in order to
provide more relevant results in the next round.
2.1 Web Search
The web search is realized using Yahoo! BOSS API1as
a general purpose search engine. The query formulation is
based on a semi-automatic technique, in which the users
select only the environmental aspect they are interested in
1http://developer.yahoo.com/search/boss/
(i.e. weather, air quality, pollen) and the geographical area
of interest. To generate the environmental keywords, we
constructed one set of domain-specific terms for each envi-
ronmental aspect based on empirical knowledge. Naturally,
for the geographical area selection we employ a map based
interface, where the users can select the geographic area of
interest by creating a circle.
Finally the query is produced by combining environmental
terms with geographical locations. An example of such a
query is: “weather+Helsinki”.
2.2 Interactive Classification
Subsequently, we present the automatic classification based
on SVMs and describe how these results can be improved
further in an interactive postprocessing step.
2.2.1 Support Vector Machine Classification
For the application of a supervised machine learning method,
labeled training data have to be created and appropriate fea-
tures to be extracted for training the classifier. The initial
training set is constructed manually by labeling relevant and
non relevant websites. Given the fact that the general pur-
pose search engine will provide results related to the envi-
ronmental area, we select as negative samples both random
websites and portals discussing generic environmental issues.
After observing the environmental websites, it seems that
the most interesting information is encoded in textual for-
mat,therefore feature extraction is based on the bag-of-words
model with textual features. Specifically we extract a code-
book of the most frequent terms from the training set and
then we create a feature vector that represents each webpage
in the hyperspace. The term extraction is done with KX [7],
which is a an unsupervised tool for weighted textual concept
extraction from documents.
In the classification process, a SVM classifier is employed,
which performs classification by constructing hyperplanes
in a multidimensional space that separate cases of different
class labels. We trained the SVM’s model using the afore-
mentioned training set and features. Afterwards, this classi-
fier is capable of deciding whether a new website is relevant
to our environmental search. In this implementation, we
employed the LIBSVM [1] library and we consider a binary
C-Support Vector Classification using as kernel the radial
basis function.
2.2.2 Visualization of Classification Results
To support interactive classification, we provide expert
users with a tool (see Figure 2) that shows the websites’ fea-
ture vectors in relation to a visual abstraction of the classifier
to let them assess the classifier’s performance interactively.
The most prominent component (Main View - Figure 2a)
is the visual abstraction of the decision boundary as a straight
vertical white space separating the non-relevant region on
the left from the relevant region on the right. Each website
belonging to the training data and the web search results is
depicted as a dot in the appropriate region defined by the
classification result. Dots in white represent labeled training
sites, while dots in gray are the unlabeled sites. The cur-
rently selected site is always marked with a darker outline.
Each dot’s distance to the decision boundary corresponds
to the classifier’s confidence value for this site. The vertical
layout of the view is based on a principle component analysis
of the training data and attempts to place each site closer to
a
b
c
d
e
f
Figure 2: Desktop for interactive classification with (a) Main View depicting the websites in relation to the
decision boundary, (b) Cluster View showing website similarity, (c) Website Preview, (d) Important Terms
View with the predominant terms in the support vectors, and (e) controls for labeling.
its most similar labeled site. In case an already labeled site,
which is not part of the support vectors, is misclassified by
the current classifier, it is marked by a red outline.
The Cluster View (Figure 2b) considers both dimensions
to map the similarity and ignores the uncertainty. It con-
tains a clustering of the 100 most uncertainly classified web-
sites. The clusters are computed by using the bisecting k-
means algorithm [8] and a subsequent projection into 2D
using the LSP algorithm [6].
The Important Terms View (Figure 2d) shall show the
user how the classifier “perceives” the websites. We show the
most prominent and distinguishing features of the support
vectors by summing the features of the support vectors of
both classes separately and then calculating the difference
between them.
The remaining views can be used to steer the labeling
process, review previously assigned labels, and load earlier
training iterations. All described views are linked and web-
sites that are hovered in one view are highlighted in the
others.
In order to work with the Cluster and Main view, the users
need a way to inspect in detail the actual content of the sites.
If multiple sites are selected or hovered (e.g. through the bar
chart) the websites preview in the low left corner (Figure 2c)
shows the titles of the websites as a list.
The aggregated content of multiple sites can be inspected
with the Term Lens (purple circle with terms to its right -
Figure 2f). When hovered over websites, most ten features
are displayed sorted according to their document frequency
(i.e. the number of websites under the lens that they occur
in). The font size of each term also indicates its relative
document frequency.
2.2.3 Interactive Labeling
Interactive classification of search results is an iterative
cycle of labeling and re-training. Websites are labeled by
expert users by selecting a set of sites in one of the views
and using the appropriate buttons in the top right window.
Newly labeled sites change their shape to triangles pointing
upward/downwards for relevant/non-relevant labels. They
also change their color respectively. After each labeled ac-
tion, a preview a new model that was trained on the new
training set is provided. All unlabeled documents that would
change their class according to the preview model are high-
lighted by color (see blue dots in Figure 2).
After some labeling actions, the users can select to retrain
the main SVM model which recomputes the layout in the
Cluster as well as Main view and updates the bar chart
with the new feature weights. This also creates a new node
in the history of classifiers, which allows the users to go back
to previous classifier states.
3. EXPERIMENTAL EVALUATION
We conducted an experiment with expert users (i.e with
basic knowledge of classification tasks and environmental
aspects) to demonstrate that the visualization of classified
search results and the interactive training improves the re-
sult of the environmental search engine compared to the ini-
tial classifier model. The initial model (M0)isconstructed
using a training set defined by environmental experts, which
comprised 99 positive and 200 negative website examples.
Additionally, we constructed a test data set of 2329 web-
sites with gold labels (1692 non-relevant and 637 relevant),
which resulted from 8 different queries to the general pur-
pose search engine. Due to the highly qualified target audi-
ence and the need to familiarize with the complex visualiza-
tion tool, our user study was limited to six expert users.
3.1 Setup
In order to rule out overfitting effects during performance
Figure 3: Classifier accuracy fluctuation for T3to T7
considering the interactive labeling.
testing we created an cross-validation inspired experiment
setup. The test data was divided into 8 sets Ti∈{1..8}de-
rived from different indicative queries (e.g. T1: weather
+ Helsinki). The we recruited 6 users to search and in-
teract with the search engine. Each participant (Pj∈{1..6})
works on an individual sequence of three consecutive test
sets (Tj,T
j+1,T
j+2). Initially, the participant is provided
with the visualized result of M0for Tjand performs the in-
teractive training to receive a new classifier model Mj
1.The
resulting model is then used to test the second test set Tj+1
and after another phase of interactive labeling the model Mj
2
is generated. The same approach is repeated until the final
model Mj
3is generated. This progression of models can then
be tested against the remaining sets that were not used for
interactive training.
For instance, P1initially submits a query and retrieves the
results set T1, which is tested by model M0.Thenhe/she
performs the interactive labeling and the model M1
1is gen-
erated. The latter is tested with the second test of results
T2and this procedure goes on until the third model M1
3is
created. Finally we test the remaining test sets with the
third model.
The users had 5 minutes to perform interactive labeling
for each model creation (i.e. 15 minutes in total for each
user). The experimental setup is presented in table 1.
Test sets /
Users
T1T2T3T4T5T6T7T8
P1M0M1
1M1
2M1
3
P2M0M2
1M2
2M2
3
P3M0M3
1M3
2M3
3
P4M0M4
1M4
2M4
3
P5M0M5
1M5
2M5
3
P6M6
3M0M6
1M6
2
Table 1: Experimental Setup.
3.2 Results
After the interactive experiments have taken place, we
compare the system performance before and after the user
interactive classification. In figure 3 we present the classifier
accuracy for the 5 test sets T3−T7.Foreachdatasetwe
consider three different models created by successive users.
This is clearly represented by the columns of table 1 that
include a full sequence of Models (M0-M3). It should
be noted that the M0results are available for all test sets.
For instance the performance for T4was tested with models
M0, M3
1,M2
2and M1
3. It can be observed that almost in all
cases the results are improved after the interactive labeling
is performed. However it is interesting to notice that in the
case of T7there is no improvement, which is probably due
to the fact that additional examples provided by the users
were not very relevant to the ones included in this dataset
and therefore the classifier performance remained the same.
4. CONCLUSIONS
In this work we have presented a novel environmental-
specific search engine based on interactive visual classifica-
tion. Based on the results it seems that the user interaction
during the classification phase can substantially improve the
results, since the system is optimized towards the require-
ments set by the user.
5. ACKNOWLEDGMENTS
This work is supported by the FP7 project “PESCaDO”.
6. REFERENCES
[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for
support vector machines. ACM Trans. on Intelligent
Systems and Technology, 2(3):27:1–27:27, 2011.
[2] J. Fogarty, D. Tan, A. Kapoor, and S. Winder.
CueFlik: interactive concept learning in image search.
In Proc. 26th Annual SIGCHI Conf. on Human Factors
in Computing Systems, CHI ’08, pages 29–38, New
York, NY, USA, 2008. ACM.
[3] F. Heimerl, S. Koch, H. Bosch, and T. Ertl. Visual
classifier training for text document retrieval. IEEE
Transactions on Visualization and Computer Graphics
(VisWeek 2012 Special Issue), 18(12), December 2012.
[4] R. Z. Kun Wu, Hai Jin and Q. Zhang. A vertical search
engine based on visual and textual features. In
Proceedings of the E dutainment ’10, Cha ngchum,
China, August 16-18, 2010, pages 476–485.
Springer-Verlag, 2010.
[5] H. P. Luong, S. Gauch, and Q. Wang. Ontology-based
focused crawling. In Int. Conf. on Information, Process,
and Knowledge Management, pages 123–128, 2009.
[6] F. V. Paulovich, L. G. Nonato, R. Minghim, and
H. Levkowitz. Least square projection: A fast
high-precision multidimensional projection technique
and its application to document mapping. IEEE Trans.
Vis. Comput. Graph., 14(3):564–575, 2008.
[7] E. Pianta and S. Tonelli. KX: A flexible system for
keyphrase extraction. In Proc. of SemEval 2010, 2010.
[8] M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. In In
KDD Workshop on Text Mining, 2000.
[9] S. van den Elzen and J. van Wijk. BaobabView:
Interactive construction and analysis of decision trees.
In Proc. IEEE Conf. on Visual Analytics Science and
Technology (VAST), 2011, pages 151–160, oct. 2011.