Nenad Tomašev

Nenad Tomašev
  • Ph.D.
  • Jožef Stefan Institute

About

31
Publications
25,484
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
645
Citations
Introduction
I am working on my PhD in machine learning, dealing with attempts to overcome the effects of the curse of dimensionality in k-nearest neighbor methods. It relates to the imbalanced distribution of influence, emergence of hubs and skewed inference. I have been developing new algorithms for clustering, classification, metric learning, ranking and anomaly detection.
Current institution
Jožef Stefan Institute
Additional affiliations
October 2008 - present
Institut ''Jožef Stefan''
Description
  • Development of novel, robust, machine learning methods for high-dimensional data analysis.
October 2003 - July 2008
University of Novi Sad
Description
  • Artificial Life, Stochastic Optimization, Data Mining
Education
October 2008 - September 2013
Jožef Stefan International Postgraduate School
Field of study
  • Computer Science / Machine Learning
October 2003 - July 2008
Department of Mathematics and Computer Science, Novi Sad
Field of study
  • Computer Science

Publications

Publications (31)
Article
Full-text available
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. I...
Article
Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase curse of dimensionality. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast m...
Article
Full-text available
In modern information societies, there are information systems that track and log parts of the ongoing political discourse. Due to the sheer volume of the accumulated data, automated tools are required in order to enable citizens to better interpret political statements and promises, as well as evaluate their truthfulness. We propose an approach to...
Chapter
Full-text available
Clustering evaluation plays an important role in unsupervised learning systems, as it is often necessary to automatically quantify the quality of generated cluster configurations. This is especially useful for comparing the performance of different clustering algorithms as well as determining the optimal number of clusters in clustering algorithms...
Chapter
Full-text available
Hubness has recently been established as a significant property of k-nearest neighbor (k-NN) graphs obtained from high-dimensional data using a distance measure, with traits and effects relevant to the cluster structure of data, as well as clustering algorithms. The hubness property is manifested with increasing (intrinsic) data dimensionality. The...
Article
Learning with label noise is an important issue in classification, since it is not always possible to obtain reliable data labels. In this paper we explore and evaluate a new approach to learning with label noise in intrinsically high-dimensional data, based on using neighbor occurrence models for hubness-aware k-nearest neighbor classification. Hu...
Article
Full-text available
Data reduction is a common pre-processing step for k-nearest neighbor classification (kNN). The existing prototype selection methods implement different criteria for selecting relevant points to use in classification, which constitutes a selection bias. This study examines the nature of the instance selection bias in intrin-sically high-dimensional...
Article
We present a novel tool for image data visualization and analysis, Image Hub Explorer. It is aimed at developers and researchers alike and it allows the users to examine various aspects of content-based image retrieval and object recognition under different built-in metrics and models. Image Hub Explorer provides the tools for understanding the dis...
Article
Most machine learning tasks involve learning from high-dimensional data, which is often quite difficult to handle. Hubness is an aspect of the curse of dimensionality that was shown to be highly detrimental to k-nearest neighbor methods in high-dimensional feature spaces. Hubs, very frequent nearest neighbors, emerge as centers of influence within...
Conference Paper
The emergence of hubs in k-nearest neighbor (kNN) topologies of intrinsically high dimensional data has recently been shown to be quite detrimental to many standard machine learning tasks, including classification. Robust hubness-aware learning methods are required in order to overcome the impact of the highly uneven distribution of influence. In t...
Conference Paper
Full-text available
Object recognition is an essential task in content-based image retrieval and classification. This paper deals with object recognition in WIKImage data, a collection of publicly available annotated Wikipedia images. WIKImage comprises a set of 14 binary classification problems with significant class imbalance. Our approach is based on using the loca...
Conference Paper
Full-text available
Bug duplicate detection is an integral part of many bug tracking systems. Most bugs are reported multiple times and detecting the duplicates saves time and valuable resources. We propose a novel approach to potential duplicate report query ranking. Our secondary re-ranking procedure is self-adaptive, as it learns from previous report occurrences. I...
Conference Paper
Information retrieval in multi-lingual document repositories is of high importance in modern text mining applications. Analyzing textual data is, however, not without associated difficulties. Regardless of the particular choice of feature representation, textual data is high-dimensional in its nature and all inference is bound to be somewhat affect...
Article
Full-text available
Most data of interest today in data-mining applications is complex and is usually represented by many different features. Such high-dimensional data is by its very nature often quite difficult to handle by conventional machine-learning algorithms. This is considered to be an aspect of the well known curse of dimensionality. Consequently, high-dimen...
Article
Full-text available
Hubness is a recently described aspect of the curse of dimensionality inherent to nearest-neighbor methods. This paper describes a new approach for exploiting the hubness phenomenon in k-nearest neighbor classification. We argue that some of the neighbor occurrences carry more information than others, by the virtue of being less frequent events. Th...
Conference Paper
Full-text available
Learning from high-dimensional data is usually quite a challenging task, as captured by the well known phrase curse of dimensionality.Most distance-based methods become impaired due to the distance concentration of many widely used metrics in high-dimensional spaces. One recently proposed approach suggests that using secondary distances based on th...
Conference Paper
Full-text available
Hub ness is a recently described aspect of the curse of dimensionality inherent to nearest-neighbor methods. In this paper we present a new approach for exploiting the hub ness phenomenon in k-nearest neighbor classification. We argue that some of the neighbor occurrences carry more information than others, by the virtue of being less frequent even...
Conference Paper
Full-text available
Most machine-learning tasks, including classification, involve dealing with high-dimensional data. It was recently shown that the phenomenon of hubness, inherent to high-dimensional data, can be exploited to improve methods based on nearest neighbors (NNs). Hubness refers to the emergence of points (hubs) that appear among the k NNs of many other p...
Conference Paper
Full-text available
This paper presents work towards the creation of free and redistributable datasets of correlated images and text. Collections of free images and related text were extracted from Wikipedia with our new tool WIKImage. An additional tool – WIKImage browser – was introduced to visualize the resulting dataset, and was expanded into a manual labeling too...
Conference Paper
Full-text available
High-dimensional data are by their very nature often difficult to handle by conventional machine-learning algorithms, which is usually characterized as an aspect of the curse of dimensionality. However, it was shown that some of the arising high-dimensional phenomena can be exploited to increase algorithm accuracy. One such phenomenon is hubness, w...
Conference Paper
Full-text available
Object recognition from images is one of the essential problems in automatic image processing. In this paper we focus specifically on nearest neighbor methods, which are widely used in many practical applications, not necessarily related to image data. It has recently come to attention that high dimensional data also exhibit high hubness, which ess...
Conference Paper
Full-text available
OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies. It utilizes text mining tools to make the ontology-related tasks simpler to the user. This focus on building ontologies from textual data is what we are trying to bridge. We have successfully extended OntoGen to work with image data and allow for on...
Article
Full-text available
This paper presents an approach applying social network analysis on collaborative edit log data. Semantic Web Wiki and FAO ontologies are given as case studies. A number of users that are editing the same ontology or the same pages can be viewed as a social network of people interacting via the ontology. We propose to represent the edit log files a...
Conference Paper
Full-text available
This paper presents experiments with applying social network analysis on data about editing of semantic media wiki. As a number of users are editing the same wiki pages, one can view them as a social network of people interacting via wiki pages. We propose representation of the wiki editing log files as a graph either of users that are connected if...
Conference Paper
Full-text available
CoreWar is a computer simulation devised in the 1980s where pro- grams loaded into a virtual memory array compete for control over the virtual machine. These programs are written in a special-purpose assembly language called Redcode and referred to as warriors. A great variety of environments and battle strategies have emerged over the years, leadi...
Conference Paper
Full-text available
CoreWar is a computer simulation where two programs written in an assembly language called redcode compete in a virtual memory array. These programs are referred to as warriors. Over more than twenty years of development anumber of different battle strategies have emerged, making it possible to identify different warrior types. Systems for automati...
Conference Paper
Full-text available
High-dimensional data arise naturally in many domains, and have regularly presented a great challenge for traditional data-mining techniques, both in terms of effectiveness and efficiency. Clustering becomes difficult due to the increasing sparsity of such data, as well as the increasing difficulty in distinguishing distances between data points. I...
Article
Full-text available
In this paper, an optimizer for programs written in an assembly-like language called Redcode is presented. Relevance of code optimization in evolutionary program creation strategies and code categorization is discussed. CoreWar Optimizer is the first user-friendly optimization tool for CoreWar programs offering various optimization methods and a ca...
Article
This paper explores the ways to represent images as bags of SIFT feature clusters. SIFT features themselves are widely used in image analysis because of their properties of scale and rotation invariance. The usual way to group them is to segment the image into regions and then assign features to the corresponding image parts. When images themselves...

Questions

Question (1)
Question
For crisp-labeled data there are many online repositories (UCI being one such example). I have recently been working on some approaches for fuzzy data classification, yet I lack the data to run the experiments on.
I could, of course, "fuzzify" the labels of crisp-labeled datasets artificially in some way, but such results would probably not be considered very relevant, as the fuzziness would be merely artificially introduced.
I am sure that there are some datasets with inherent class uncertainty out there, but I was unable to locate the proper resources.
Do you know where such benchmark data can be found?

Network

Cited By