Conference PaperPDF Available

An Environmental Search Engine Based on Interactive Visual Classification

Authors:

Abstract and Figures

Environmental conditions play a very important role in hu-man life.Nowadays, environmental data and measurements are freely made available through dedicated web sites, ser-vices and portals. This work deals with the problem of discovering such web resources by proposing an interactive domain-specific search engine, which is built on top of a general purpose search engine, employing supervised ma-chine learning and advanced interactive visualization tech-niques. Our experiments and the evaluation show that in-teractive classification based on visualization improves the performance of the system.
Content may be subject to copyright.
An Environmental Search Engine Based on Interactive
Visual Classification
Stefanos Vrochidis1, Harald Bosch2, Anastasia Moumtzidou1,
Florian Heimerl2, Thomas Ertl2, Ioannis Kompatsiaris1
1Information Technologies Institute
6th Klm Charilaou-Thermi Road
Thessaloniki, GR-5700, Greece
{stefanos, moumtzid, ikom}@iti.gr
2University of Stuttgart
Universitätstraße 38
70569 Stuttgart, Germany
firstname.surname@vis.uni-stuttgart.de
ABSTRACT
Environmental conditions play a very important role in hu-
man life.Nowadays, environmental data and measurements
are freely made available through dedicated web sites, ser-
vices and portals. This work deals with the problem of
discovering such web resources by proposing an interactive
domain-specific search engine, which is built on top of a
general purpose search engine, employing supervised ma-
chine learning and advanced interactive visualization tech-
niques. Our experiments and the evaluation show that in-
teractive classification based on visualization improves the
performance of the system.
Categories and Subject Descriptors
H.4 [Information Systems Applications]: Miscellaneous;
D.2.8 [Software Engineering]: Metrics—complexity mea-
sures, performance measures
General Terms
Algorithms, Performance, Experimentation
Keywords
Environmental, search engine, visualization, classification,
interaction, domain-specific search
1. INTRODUCTION
Environmental information plays a very important role
in human life, since it is strongly related to health issues
(e.g. cardiovascular diseases), as well as in everyday life
outdoor activities.Environmental measurements gathered by
dedicated stations, are usually made available through web
portals and services (mentioned in this work as environmen-
tal nodes) in multimedia formats including text, images and
tables. The discovery of environmental nodes is very impor-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
MAED’12, November 2, 2012, Nara, Japan.
Copyright 2012 ACM 978-1-4503-1588-3/12/11 ...$10.00.
tant in order to offer direct access to a variety of resources
providing complementary environmental information.
This paper addresses the discovery of environmental nodes,
which is considered as a domain-specific search problem. To-
wards the solution of this problem we propose a novel inter-
active and relevance feedback-based domain-specific search
engine, which extends the current state of the art by propos-
ing an interactive visualization layer on top of domain-specific
search techniques. The proposed search engine is developed
to support expert users to retrieve websites providing envi-
ronmental measurements. As experts, we consider the users,
who have a basic understanding of the classification algo-
rithm and basic knowledge of environmental aspects (i.e.
weather, pollen, air quality).
The main methodologies used for the implementation of
a domain-specific search engine can be divided into two cat-
egories: a) post-processing results of existing web search
engines and b) directed crawling. Our implementation falls
into the first category. In a related work [5], the authors at-
tempt to automatically enrich an ontology by post-processing
domain-specific search. Finally, in a more recent work [4],
the proposed vertical search engine combines visual and tex-
tual features with a view to retrieve images of a certain
category. This work differentiates from the aforementioned
approaches by introducing an interactive classification layer
for post-processing the initial results of a search engine using
visualization techniques.
Supervised classifier learning methods have shown to be
very effective for document classification. The visual compo-
nent of the proposed environmental search engine is inspired
by the fully interactive classifier training tool of our previous
work [3]. It was adapted to website data and modified to
work with radial SVMs. Similar to our method, there is ad-
ditional work that attempts to give humans control during
the retrieval process. Interactive decision tree construction
systems are the most commonly described variant for visu-
alizing interactive machine learning concepts (e.g. [9]). In
another relevant work, Fogarty et al. [2] develop an image
retrieval system that helps users in creating classifiers for
certain concepts.
The contribution of this work is the novel environmental-
specific search engine, which extends current approaches of
domain-specific search by offering advanced options of inter-
action and relevance user feedback functionalities based on
visualization.
This paper is structured as follows: Section 2 presents
the framework of the environmental search engine including
Figure 1: The framework of the interactive search
engine.
the web search techniques and the interactive classification.
Then, Section 3 describes the experimental setup, Section 4
presents the results and the evaluation and finally, Section
5 concludes the paper.
2. ENVIRONMENTAL SEARCH ENGINE
The framework of the interactive environmental-specific
search engine is illustrated in Figure 1 and consists of two
main parts, where the user can interact: the web search and
the interactive classification.
The web search component includes the query submission
to a general purpose search engine. The list of search re-
sults already includes many relevant websites but also many
irrelevant ones. Then, we aim at improving the precision
of the initial results by including a filtering step, which is
based on SVM classification. Therefore, we extract textual
features from the initial results and subsequently train an
initial classifier to create a model that could discriminate
relevant from irrelevant websites. To further improve the
classifier performance we apply an active learning inspired
technique by visualizing the state of the classifier on a 2D
plane and enable users to iteratively correct the labels of mis-
classified websites or to confirm uncertain ones. Then the
SVM model is re-trained by employing the enhanced train-
ing set, which also includes the user annotations in order to
provide more relevant results in the next round.
2.1 Web Search
The web search is realized using Yahoo! BOSS API1as
a general purpose search engine. The query formulation is
based on a semi-automatic technique, in which the users
select only the environmental aspect they are interested in
1http://developer.yahoo.com/search/boss/
(i.e. weather, air quality, pollen) and the geographical area
of interest. To generate the environmental keywords, we
constructed one set of domain-specific terms for each envi-
ronmental aspect based on empirical knowledge. Naturally,
for the geographical area selection we employ a map based
interface, where the users can select the geographic area of
interest by creating a circle.
Finally the query is produced by combining environmental
terms with geographical locations. An example of such a
query is: “weather+Helsinki”.
2.2 Interactive Classification
Subsequently, we present the automatic classification based
on SVMs and describe how these results can be improved
further in an interactive postprocessing step.
2.2.1 Support Vector Machine Classification
For the application of a supervised machine learning method,
labeled training data have to be created and appropriate fea-
tures to be extracted for training the classifier. The initial
training set is constructed manually by labeling relevant and
non relevant websites. Given the fact that the general pur-
pose search engine will provide results related to the envi-
ronmental area, we select as negative samples both random
websites and portals discussing generic environmental issues.
After observing the environmental websites, it seems that
the most interesting information is encoded in textual for-
mat,therefore feature extraction is based on the bag-of-words
model with textual features. Specifically we extract a code-
book of the most frequent terms from the training set and
then we create a feature vector that represents each webpage
in the hyperspace. The term extraction is done with KX [7],
which is a an unsupervised tool for weighted textual concept
extraction from documents.
In the classification process, a SVM classifier is employed,
which performs classification by constructing hyperplanes
in a multidimensional space that separate cases of different
class labels. We trained the SVM’s model using the afore-
mentioned training set and features. Afterwards, this classi-
fier is capable of deciding whether a new website is relevant
to our environmental search. In this implementation, we
employed the LIBSVM [1] library and we consider a binary
C-Support Vector Classification using as kernel the radial
basis function.
2.2.2 Visualization of Classification Results
To support interactive classification, we provide expert
users with a tool (see Figure 2) that shows the websites’ fea-
ture vectors in relation to a visual abstraction of the classifier
to let them assess the classifier’s performance interactively.
The most prominent component (Main View - Figure 2a)
is the visual abstraction of the decision boundary as a straight
vertical white space separating the non-relevant region on
the left from the relevant region on the right. Each website
belonging to the training data and the web search results is
depicted as a dot in the appropriate region defined by the
classification result. Dots in white represent labeled training
sites, while dots in gray are the unlabeled sites. The cur-
rently selected site is always marked with a darker outline.
Each dot’s distance to the decision boundary corresponds
to the classifier’s confidence value for this site. The vertical
layout of the view is based on a principle component analysis
of the training data and attempts to place each site closer to
a
b
c
d
e
f
Figure 2: Desktop for interactive classification with (a) Main View depicting the websites in relation to the
decision boundary, (b) Cluster View showing website similarity, (c) Website Preview, (d) Important Terms
View with the predominant terms in the support vectors, and (e) controls for labeling.
its most similar labeled site. In case an already labeled site,
which is not part of the support vectors, is misclassified by
the current classifier, it is marked by a red outline.
The Cluster View (Figure 2b) considers both dimensions
to map the similarity and ignores the uncertainty. It con-
tains a clustering of the 100 most uncertainly classified web-
sites. The clusters are computed by using the bisecting k-
means algorithm [8] and a subsequent projection into 2D
using the LSP algorithm [6].
The Important Terms View (Figure 2d) shall show the
user how the classifier “perceives” the websites. We show the
most prominent and distinguishing features of the support
vectors by summing the features of the support vectors of
both classes separately and then calculating the difference
between them.
The remaining views can be used to steer the labeling
process, review previously assigned labels, and load earlier
training iterations. All described views are linked and web-
sites that are hovered in one view are highlighted in the
others.
In order to work with the Cluster and Main view, the users
need a way to inspect in detail the actual content of the sites.
If multiple sites are selected or hovered (e.g. through the bar
chart) the websites preview in the low left corner (Figure 2c)
shows the titles of the websites as a list.
The aggregated content of multiple sites can be inspected
with the Term Lens (purple circle with terms to its right -
Figure 2f). When hovered over websites, most ten features
are displayed sorted according to their document frequency
(i.e. the number of websites under the lens that they occur
in). The font size of each term also indicates its relative
document frequency.
2.2.3 Interactive Labeling
Interactive classification of search results is an iterative
cycle of labeling and re-training. Websites are labeled by
expert users by selecting a set of sites in one of the views
and using the appropriate buttons in the top right window.
Newly labeled sites change their shape to triangles pointing
upward/downwards for relevant/non-relevant labels. They
also change their color respectively. After each labeled ac-
tion, a preview a new model that was trained on the new
training set is provided. All unlabeled documents that would
change their class according to the preview model are high-
lighted by color (see blue dots in Figure 2).
After some labeling actions, the users can select to retrain
the main SVM model which recomputes the layout in the
Cluster as well as Main view and updates the bar chart
with the new feature weights. This also creates a new node
in the history of classifiers, which allows the users to go back
to previous classifier states.
3. EXPERIMENTAL EVALUATION
We conducted an experiment with expert users (i.e with
basic knowledge of classification tasks and environmental
aspects) to demonstrate that the visualization of classified
search results and the interactive training improves the re-
sult of the environmental search engine compared to the ini-
tial classifier model. The initial model (M0)isconstructed
using a training set defined by environmental experts, which
comprised 99 positive and 200 negative website examples.
Additionally, we constructed a test data set of 2329 web-
sites with gold labels (1692 non-relevant and 637 relevant),
which resulted from 8 different queries to the general pur-
pose search engine. Due to the highly qualified target audi-
ence and the need to familiarize with the complex visualiza-
tion tool, our user study was limited to six expert users.
3.1 Setup
In order to rule out overfitting effects during performance
Figure 3: Classifier accuracy fluctuation for T3to T7
considering the interactive labeling.
testing we created an cross-validation inspired experiment
setup. The test data was divided into 8 sets Ti∈{1..8}de-
rived from different indicative queries (e.g. T1: weather
+ Helsinki). The we recruited 6 users to search and in-
teract with the search engine. Each participant (Pj∈{1..6})
works on an individual sequence of three consecutive test
sets (Tj,T
j+1,T
j+2). Initially, the participant is provided
with the visualized result of M0for Tjand performs the in-
teractive training to receive a new classifier model Mj
1.The
resulting model is then used to test the second test set Tj+1
and after another phase of interactive labeling the model Mj
2
is generated. The same approach is repeated until the final
model Mj
3is generated. This progression of models can then
be tested against the remaining sets that were not used for
interactive training.
For instance, P1initially submits a query and retrieves the
results set T1, which is tested by model M0.Thenhe/she
performs the interactive labeling and the model M1
1is gen-
erated. The latter is tested with the second test of results
T2and this procedure goes on until the third model M1
3is
created. Finally we test the remaining test sets with the
third model.
The users had 5 minutes to perform interactive labeling
for each model creation (i.e. 15 minutes in total for each
user). The experimental setup is presented in table 1.
Test sets /
Users
T1T2T3T4T5T6T7T8
P1M0M1
1M1
2M1
3
P2M0M2
1M2
2M2
3
P3M0M3
1M3
2M3
3
P4M0M4
1M4
2M4
3
P5M0M5
1M5
2M5
3
P6M6
3M0M6
1M6
2
Table 1: Experimental Setup.
3.2 Results
After the interactive experiments have taken place, we
compare the system performance before and after the user
interactive classification. In figure 3 we present the classifier
accuracy for the 5 test sets T3T7.Foreachdatasetwe
consider three different models created by successive users.
This is clearly represented by the columns of table 1 that
include a full sequence of Models (M0-M3). It should
be noted that the M0results are available for all test sets.
For instance the performance for T4was tested with models
M0, M3
1,M2
2and M1
3. It can be observed that almost in all
cases the results are improved after the interactive labeling
is performed. However it is interesting to notice that in the
case of T7there is no improvement, which is probably due
to the fact that additional examples provided by the users
were not very relevant to the ones included in this dataset
and therefore the classifier performance remained the same.
4. CONCLUSIONS
In this work we have presented a novel environmental-
specific search engine based on interactive visual classifica-
tion. Based on the results it seems that the user interaction
during the classification phase can substantially improve the
results, since the system is optimized towards the require-
ments set by the user.
5. ACKNOWLEDGMENTS
This work is supported by the FP7 project “PESCaDO”.
6. REFERENCES
[1] C.-C. Chang and C.-J. Lin. LIBSVM: a library for
support vector machines. ACM Trans. on Intelligent
Systems and Technology, 2(3):27:1–27:27, 2011.
[2] J. Fogarty, D. Tan, A. Kapoor, and S. Winder.
CueFlik: interactive concept learning in image search.
In Proc. 26th Annual SIGCHI Conf. on Human Factors
in Computing Systems, CHI ’08, pages 29–38, New
York, NY, USA, 2008. ACM.
[3] F. Heimerl, S. Koch, H. Bosch, and T. Ertl. Visual
classifier training for text document retrieval. IEEE
Transactions on Visualization and Computer Graphics
(VisWeek 2012 Special Issue), 18(12), December 2012.
[4] R. Z. Kun Wu, Hai Jin and Q. Zhang. A vertical search
engine based on visual and textual features. In
Proceedings of the E dutainment 10, Cha ngchum,
China, August 16-18, 2010, pages 476–485.
Springer-Verlag, 2010.
[5] H. P. Luong, S. Gauch, and Q. Wang. Ontology-based
focused crawling. In Int. Conf. on Information, Process,
and Knowledge Management, pages 123–128, 2009.
[6] F. V. Paulovich, L. G. Nonato, R. Minghim, and
H. Levkowitz. Least square projection: A fast
high-precision multidimensional projection technique
and its application to document mapping. IEEE Trans.
Vis. Comput. Graph., 14(3):564–575, 2008.
[7] E. Pianta and S. Tonelli. KX: A flexible system for
keyphrase extraction. In Proc. of SemEval 2010, 2010.
[8] M. Steinbach, G. Karypis, and V. Kumar. A
comparison of document clustering techniques. In In
KDD Workshop on Text Mining, 2000.
[9] S. van den Elzen and J. van Wijk. BaobabView:
Interactive construction and analysis of decision trees.
In Proc. IEEE Conf. on Visual Analytics Science and
Technology (VAST), 2011, pages 151–160, oct. 2011.
... After an introduction into the research areas and vocabulary most related to this thesis in Chapter 2, the filter/flow selection management approach, its visual analytics capabilities, and several application domains are presented in Chapter 3. The two areas of exploiting domain knowledge to support the query construction and result interpretation of lay users is presented in Chapter 4. Both approaches have been evaluated in several ways. Chapter 5 presents the evaluation effort Vrochidis et al., 2012] and discusses their results as well as the general outcome of the thesis. An outlook on possible future work and the applicability of the approaches for 1.2 • Contribution and Structure 5 future systems is given in the final Chapter 6. ...
... Related to this approach is the work of Kun Wu and Zhang [2010] who employ visual and textual features to improve the results of image search engines, and Luong et al. [2009] who propose a classification-enhanced focused crawler to find resources for automatic ontology construction. We could show that an interactive training of the employed classifier can significantly improve the performance of the resulting environmental search engine when dealing with live web data [Vrochidis et al., 2012]. Figure 4.1 shows the process of the semi-automatic environmental node discovery. ...
Thesis
The often cited information explosion is not limited to volatile network traffic and massive multimedia capture data. Structured and high quality data from diverse fields of study become easily and freely available, too. This is due to crowd sourced data collections, better sharing infrastructure, or more generally speaking user generated content of the Web 2.0 and the popular transparency and open data movements. At the same time as data generation is shifting to everyday casual users, data analysis is often still reserved to large companies specialized in content analysis and distribution such as today's internet giants Amazon, Google, and Facebook. Here, fully automatic algorithms analyze metadata and content to infer interests and believes of their users and present only matching navigation suggestions and advertisements. Besides the problem of creating a filter bubble, in which users never see conflicting information due to the reinforcement nature of history based navigation suggestions, the use of fully automatic approaches has inherent problems, e.g. being unable to find the unexpected and adopt to changes, which lead to the introduction of the Visual Analytics (VA) agenda. If users intend to perform their own analysis on the available data, they are often faced with either generic toolkits that cover a broad range of applicable domains and features or specialized VA systems that focus on one domain. Both are not suited to support casual users in their analysis as they don't match the users' goals and capabilities. The former tend to be complex and targeted to analysis professionals due to the large range of supported features and programmable visualization techniques. The latter trade general flexibility for improved ease of use and optimized interaction for a specific domain requirement. This work describes two approaches building on interactive visualization to reduce this gap between generic toolkits and domain-specific systems. The first one builds upon the idea that most data relevant for casual users are collections of entities with attributes. This least common denominator is commonly employed in faceted browsing scenarios and filter/flow environments. Thinking in sets of entities is natural and allows for a very direct visual interaction with the analysis subject and it stands for a common ground for adding analysis functionality to domain-specific visualization software. Encapsulating the interaction with sets of entities into a filter/flow graph component can be used to record analysis steps and intermediate results into an explicit structure to support collaboration, reporting, and reuse of filters and result sets. This generic analysis functionality is provided as a plugin-in component and was integrated into several domain-specific data visualization and analysis prototypes. This way, the plug-in benefits from the implicit domain knowledge of the host system (e.g. selection semantics and domain-specific visualization) while being used to structure and record the user's analysis process. The second approach directly exploits encoded domain knowledge in order to help casual users interacting with very specific domain data. By observing the interrelations in the ontology, the user interface can automatically be adjusted to indicate problems with invalid user input and transform the system's output to explain its relation to the user. Here, the domain related visualizations are personalized and orchestrated for each user based on user profiles and ontology information. In conclusion, this thesis introduces novel approaches at the boundary of generic analysis tools and their domain-specific context to extend the usage of visual analytics to casual users by exploiting domain knowledge for supporting analysis tasks, input validation, and personalized information visualization.
Article
Data on observed and forecasted environmental conditions, such as weather, air quality and pollen, are offered in a great variety in the web and serve as basis for decisions taken by a wide range of the population. However, the value of these data is limited because their quality varies largely and because the burden of their interpretation in the light of a specific context and in the light of the specific needs of a user is left to the user herself. To remove this burden from the user, we propose an environmental Decision Support System (DSS) model with an ontology-based knowledge base as its integrative core. The availability of an ontological knowledge representation allows us to encode in a uniform format all knowledge that is involved (environmental background knowledge, the characteristic features of the profile of the user, the formal description of the user request, measured or forecasted environmental data, etc.) and apply advanced reasoning techniques on it. The result is an advanced DSS that provides high quality environmental information for personalized decision support.
Article
Full-text available
There is a large amount of meteorological and air quality data available online. Often, different sources provide deviating and even contradicting data for the same geographical area and time. This implies that users need to evaluate the relative reliability of the information and then trust one of the sources. We present a novel data fusion method that merges the data from different sources for a given area and time, ensuring the best data quality. The method is a unique combination of land-use regression techniques, statistical air quality modelling and a well-known data fusion algorithm. We show experiments where a fused temperature forecast outperforms individual temperature forecasts from several providers. Also, we demonstrate that the local hourly NO2 concentration can be estimated accurately with our fusion method while a more conventional extrapolation method falls short. The method forms part of the prototype web-based service PESCaDO, designed to cater personalized environmental information to users.
Article
Environmental and meteorological conditions are of utmost importance for the population, as they are strongly related to the quality of life. Citizens are increasingly aware of this importance. This awareness results in an increasing demand for environmental information tailored to their specific needs and background. We present an environmental information platform that supports submission of user queries related to environmental conditions and orchestrates results from complementary services to generate personalized suggestions. The system discovers and processes reliable data in the Web in order to convert them into knowledge. At runtime, this information is transferred into an ontology-structured knowledge base, from which then information relevant to the specific user is deduced and communicated in the language of their preference. The platform is demonstrated with real world use cases in the south area of Finland, showing the impact it can have on the quality of everyday life.
Conference Paper
The ACM International Workshop on Multimedia Analysis for Ecological Data (MAED'12) is held as part of ACM Multimedia 2012. MAED'12 is concerned with the processing, interpretation, and visualization of ecology-related multimedia content with the aim to support biologists in their investigations for analyzing and monitoring natural environments, with particular attention to living organisms and pollution effects.
Article
Full-text available
In this paper we present KX, a system for key-phrase extraction developed at FBK-IRST, which exploits basic linguistic annotation combined with simple statistical measures to select a list of weighted keywords from a document. The system is flexible in that it offers to the user the possibility of setting parameters such as frequency thresholds for collocation extraction and indicators for key-phrase relevance, as well as it allows for domain adaptation exploiting a corpus of documents in an unsupervised way. KX is also easily adaptable to new languages in that it requires only a PoS-Tagger to derive lexical patterns. In the SemEval task 5 "Automatic Key-phrase Extraction from Scientific Articles", KX performance achieved satisfactory results both in finding reader-assigned keywords and in the combined keywords subtask.
Conference Paper
Full-text available
Web image search is difficult in part because a handful of keywords are generally insufficient for characterizing the visual properties of an image. Popular engines have begun to provide tags based on simple characteristics of images (such as tags for black and white images or images that contain a face), but such approaches are limited by the fact that it is unclear what tags end-users want to be able to use in examining Web image search results. This paper presents CueFlik, a Web image search application that allows end-users to quickly create their own rules for re-ranking images based on their visual characteristics. End-users can then re-rank any future Web image search results according to their rule. In an experiment we present in this paper, end-users quickly create effective rules for such concepts as "product photos", "portraits of people", and "clipart". When asked to conceive of and create their own rules, participants create such rules as "sports action shot" with images from queries for "basketball" and "football". CueFlik represents both a promising new approach to Web image search and an important study in end-user interactive machine learning.
Conference Paper
Full-text available
Ontology learning has become a major area of research whose goal is to facilitate the construction of ontologies by decreasing the amount of effort required to produce an ontology for a new domain. However, there are few studies that attempt to automate the entire ontology learning process from the collection of domain-specific literature, to text mining to build new ontologies or enrich existing ones. In this paper, we present a framework of ontology learning that enables us to retrieve documents from the Web using focused crawling in a biological domain, amphibian morphology. We use a SVM (support vector machine) classifier to identify domain-specific documents and perform text mining in order to extract useful information for the ontology enrichment process. This paper reports on the overall system architecture and our initial experiments on the focused crawler and document classification.
Article
Full-text available
The problem of projecting multidimensional data into lower dimensions has been pursued by many researchers due to its potential application to data analysis of various kinds. This paper presents a novel multidimensional projection technique based on least square approximations. The approximations compute the coordinates of a set of projected points based on the coordinates of a reduced number of control points with defined geometry. We name the technique Least Square Projections (LSP). From an initial projection of the control points, LSP defines the positioning of their neighboring points through a numerical solution that aims at preserving a similarity relationship between the points given by a metric in mD. In order to perform the projection, a small number of distance calculations is necessary and no repositioning of the points is required to obtain a final solution with satisfactory precision. The results show the capability of the technique to form groups of points by degree of similarity in 2D. We illustrate that capability through its application to mapping collections of textual documents from varied sources, a strategic yet difficult application. LSP is faster and more accurate than other existing high quality methods, particularly where it was mostly tested, that is, for mapping text sets.
Article
Full-text available
This paper presents the results of an experimental study of some common document clustering techniques. In particular, we compare the two main approaches to document clustering, agglomerative hierarchical clustering and K-means. (For K-means we used a "standard" K-means algorithm and a variant of K-means, "bisecting" K-means.) Hierarchical clustering is often portrayed as the better quality clustering approach, but is limited because of its quadratic time complexity. In contrast, K-means and its variants have a time complexity which is linear in the number of documents, but are thought to produce inferior clusters. Sometimes K-means and agglomerative hierarchical approaches are combined so as to "get the best of both worlds." However, our results indicate that the bisecting K-means technique is better than the standard K-means approach and as good or better than the hierarchical approaches that we tested for a variety of cluster evaluation metrics. We propose an explanation for these r...
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
Performing exhaustive searches over a large number of text documents can be tedious, since it is very hard to formulate search queries or define filter criteria that capture an analyst's information need adequately. Classification through machine learning has the potential to improve search and filter tasks encompassing either complex or very specific information needs, individually. Unfortunately, analysts who are knowledgeable in their field are typically not machine learning specialists. Most classification methods, however, require a certain expertise regarding their parametrization to achieve good results. Supervised machine learning algorithms, in contrast, rely on labeled data, which can be provided by analysts. However, the effort for labeling can be very high, which shifts the problem from composing complex queries or defining accurate filters to another laborious task, in addition to the need for judging the trained classifier's quality. We therefore compare three approaches for interactive classifier training in a user study. All of the approaches are potential candidates for the integration into a larger retrieval system. They incorporate active learning to various degrees in order to reduce the labeling effort as well as to increase effectiveness. Two of them encompass interactive visualization for letting users explore the status of the classifier in context of the labeled documents, as well as for judging the quality of the classifier in iterative feedback loops. We see our work as a step towards introducing user controlled classification methods in addition to text search and filtering for increasing recall in analytics scenarios involving large corpora.
Article
LIBSVM is a library for support vector machines (SVM). Its goal is to help users to easily use SVM as a tool. In this document, we present all its imple-mentation details. For the use of LIBSVM, the README file included in the package and the LIBSVM FAQ provide the information.
Conference Paper
It is very difficult for general image search to get higher retrieval accuracy and relevancy with thousands of irrelevant image results. Vertical search engine can collect web information from multiple and different resources in specific domain, and provide more professional and individualized image retrieval services for various users in their domain. A new approach is to combine textual and visual features to search content-related images from Internet. The new topic identification and hybrid segmentation are proposed to effectively improve retrieval accuracy and coverage. A new SVM-based classification with RBF kernel integrates visual and textual features to implement better classification to enhance image retrieval accuracy. The proposed methods are tested with Pet Image Search Engine (PISE) database. Numerous experiments demonstrate the superior performance of developed PISE system, which is attractive in practical applications. KeywordsVertical Search-Content-Based Image Retrieval-Topic Prediction-Image Classification-Support Vector Machine
Conference Paper
We present a system for the interactive construction and analysis of decision trees that enables domain experts to bring in domain specific knowledge. We identify different user tasks and corresponding requirements, and develop a system incorporating a tight integration of visualization, interaction and algorithmic support. Domain experts are supported in growing, pruning, optimizing and analysing decision trees. Furthermore, we present a scalable decision tree visualization optimized for exploration. We show the effectiveness of our approach by applying the methods to two use cases. The first case illustrates the advantages of interactive construction, the second case demonstrates the effectiveness of analysis of decision trees and exploration of the structure of the data.