
Dan Simovici- University of Massachusetts Boston
Dan Simovici
- University of Massachusetts Boston
About
195
Publications
14,224
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,585
Citations
Current institution
Publications
Publications (195)
This paper introduces a novel type-based genetic algorithm and its applications to two well-known problems: N-queen problem and finding the global minimum of the Rosenbrock function. The algorithm offers a new approach to internal structure of individuals in population of genetic algorithms.
We discuss a metric structure on the set of partitions of a finite set induced by the Gini index and two applications of this metric: the identification of determining sets for index functions using techniques that originate in machine learning, and a data compression algorithm.
Understanding the nature of many diseases, including cancer, requires locating somatically acquired rearrangements corresponding to large-scale chromosomal aberrations. Computational methods to detect inter-chromosomal rearrangements based on next generation sequencing platforms face the big challenge of accurately predicting the location of sites...
We introduce a measure
of ultrametricity
for dissimilarity
spaces and examine transformations of dissimilarities that impact this measure. Then, we study the influence of ultrametricity on the behavior of two classes of data mining algorithms (kNN classification and PAM clustering) applied on dissimilarity spaces. We show that there is an inverse v...
The dynamics of players rankings play an important role in team sports. We use Kendall’s \(\tau \) and Spearman’s \(\rho \) distances between rankings to study player scoring ranking dynamics in the NBA over the full 2014 regular season. For each team, we study the distances between sequential games, noting the differences between the two distances...
Evaluation of automatic text summarization is a challenging task due to the difficulty of calculating similarity of two texts. In this paper, we define a new dissimilarity measure-compression dissimilarity to compute the dissimilarity between documents. Then we propose a new automatic evaluating method based on compression dissimilarity. The propos...
This paper introduces the use of genetic algorithms to mine binary datasets for obtaining frequent item sets and large bite item sets, two classes of problems that are important for optimal exposure of item sets to customers and for efficient advertising campaigns. Whereas both problems can be approached in a common framework, we highlight specific...
Many datasets from real-world applications have very high-dimensional or increasing feature space. It is a new research problem to learn and maintain a classifier to deal with very high dimensionality or streaming features. In this article, we adapt the well-known emerging-pattern-based classification models and propose a semi-streaming approach. F...
This introductory chapter presents some of the main paradigms of intelligent data analysis provided by machine learning and data mining. After discussing several types of learning (supervised, unsupervised, semi-supervised, active and reinforcement learning) we examine several classes of learning algorithms (naive Bayes classifiers, decision trees,...
Data compression plays an important role in data mining in assessing the minability of data and a modality of evaluating similarities between complex objects. We focus on compressibility of strings of symbols and on using compression in computing similarity in text corpora; also we propose a novel approach for assessing the quality of text summariz...
Structural variations (SVs) are deletions, duplications and rearrangements of medium to large segments (>100 base pairs (bp)) of the genome. Such genomic mutations are often described as being the primary cause of many diseases, including cancer. Breakpoint detection using next-generation sequencing (NGS) platforms still remains an open problem sin...
We give single-operations characterizations for submodular and supermodular functions on lattices that have monotonicity properties. We associate to such functions metrics on lattices and we investigate corresponding metrics on the sets of partitions.
Combinatorics is the area of mathematics concerned with counting collections of mathematical objects. We begin by discussing several elementary combinatorial issues such as permutations, the power set of a finite sets, the inclusion-exclusion principle, and continue with more involved combinatorial techniques that are relevant for data mining, such...
The existence of directions that are preserved by linear transformations (which are referred to as eigenvectors) has been discovered by L. Euler in his study of movements of rigid bodies. This work was continued by Lagrange, Cauchy, Fourier, and Hermite. The study of eigenvectors and eigenvalues acquired increasing significance through its applicat...
The notion of norm is introduced for evaluating the magnitude of vectors and, in turn, allows the definition of certain metrics on linear spaces equipped with norms.
Clustering and classification, two central data mining activities, require the evaluation of degrees of dissimilarity between data objects.
Convex sets and functions have been studied since the nineteenth century; the twentieth century literature on convexity began with Bonnesen and Fenchel’s book [1], subsequently reprinted as [2].
Graphs model relations between elements of sets. The term “graph” is suggested by the fact that these mathematical structures can be graphically represented.
Linear spaces are among the most important and widely used mathematical structures. Linear spaces consist of elements called vectors.
Topology is an area of mathematics that investigates both the local and the global structure of space
Clustering is the process of grouping together objects that are similar. The groups formed by clustering are referred to as clusters
We introduce the notion of a partially ordered set (poset) we and define several types of special elements associated with partial orders.
The current paper presents a method to deliver non- linear projections of a data set that discriminate between existing labeled groups of data items. Inspired from traditional linear Projection Pursuit and Linear Discriminant Analysis, the new method seeks nonlinear combinations of attributes as polynomials that maximize Fisher’s criterion. The sea...
Tolerance relations are useful in soft computing in the treatment of non-disjoint clusterings, in the study of fuzzy automata, etc. After a comparative review of tolerance and equivalences of a set, we evaluate the number of tolerances situated between two equivalences, certain combinatorial aspects of the lattice of tolerances on a finite set and...
Previous research indicates that removing initial strokes from Chinese characters makes them harder to read than removing final or internal ones. In the present study, we examined the contribution of important components to character configuration via singular value decomposition. The results indicated that when the least important segments, which...
This paper introduces entropy quad-trees, which are structures derived from quad-trees by allowing nodes to split only when those correspond to sufficiently complex sub-domains of a data domain. Complexity is evaluated using an information-theoretic measure based on the analysis of the entropy associated to sets of objects designated by nodes. An a...
This paper proposes a new method to identify interesting structures in data based on the projection pursuit methodology. Past work reported in literature uses projection pursuit methods as means to visualize high-dimensional data, or to identify linear combinations of attributes that reveal grouping tendencies or outliers. The framework of projecti...
We apply polarities, axiallities and the notion of entropy to the task of identifying marketable items and the customers that should be approached in a marketing campaign. An algorithm that computes the criteria for identifying marketable items and the corresponding experimental work is also included.
Building an accurate emerging pattern classifier with a high-dimensional dataset is a challenging issue. The problem becomes even more difficult if the whole feature space is unavailable before learning starts. This paper presents a new technique on mining emerging patterns using streaming feature selection. We model high feature dimensions with st...
In this paper we present a novel parallel coordinate based clustering method using Gaussian mixture distribution models to characterize the conformational space of proteins. We detect highly populated regions which may correspond to intermediate states that are difficult to detect experimentally. The data is represented as feature vectors of N dime...
We apply techniques that originate in the analysis of market basket data sets to the study of frequent trajectories in graphs. Trajectories are defined as simple paths through a directed graph, and we put forth some definitions and observations about the calculation of supports of paths in this context. A simple algorithm for calculating path suppo...
A genetic code, the mapping from trinucleotide codons to amino acids, can be viewed as a partition on the set of 64 codons. A small set of non-standard genetic codes is known, and these codes can be mathematically compared by their partitions of the codon set. To measure distances between set partitions, this study defines a parameterised family of...
We propose a probabilistic greedy algorithm for decomposing partially specified index generation functions. These functions have numerous applications in a variety of circuit design problems. We show that finding an optimal decomposition is an intractable problem, which motivates our approach.
Counting craters is a fundamental task of planetary science because it provides the only tool for measuring relative ages of planetary surfaces. However, advances in surveying craters present in data gathered by planetary probes have not kept up with advances in data collection. One challenge of auto-detecting craters in images is to identify an im...
We propose an approximate computation technique for inter-object distances of binary data sets. Our approach is based on locality
sensitive hashing. We randomly select a number of projections of the data set and group objects into buckets based on the
hash values of these projections. For each pair of objects, occurrences in the same bucket are cou...
We evaluate the extent to which a dissimilarity space differs from a metric space by introducing the notion of metric point
and metricity in a dissimilarity space. The effect of triangular inequality violations on medoid-based clustering of objects
in a dissimilarity space is examined and the notion of rectifier is introduced to transform a dissimi...
We introduce the notion of entropy on arbitrary lattices and we study several of its properties. Explicit forms for entropy are obtained for entropies on graded lattices that satisfy certain regularity conditions. The relationships between entropies and metrics defined on lattices are also explored.
This paper introduces entropy quad-trees, which are structures derived from quad-trees by allowing nodes to split only when those correspond to sufficiently complex sub-domains of a data domain. Complexity is evaluated using an information-theoretic measure based on the analysis of the entropy associated to sets of objects designated by nodes. An a...
This paper addresses the clustering problem given the similarity matrix of a dataset. By representing this matrix as a weighted graph we transform this problem to a graph clustering/partitioning problem which aims at identifying groups of strongly inter-connected vertices. We define two distinct criteria with the aim of simultaneously minimizing th...
This paper introduces an algorithm for capturing high complexity regions of a data domain. In this work, we focus on domains in R<sup>2</sup>. In particular, we analyze 2-dimensional image domains. Two different methods for mining are considered. The first method performs an information-theoretic analysis based on entropy to find diverse areas. The...
This paper introduces entropic quadtrees, which are structures derived from quadtrees by allowing nodes to split only when
nodes point to sufficiently diverse sets of objects. Diversity is evaluated using entropy attached to the histograms of the
values of features for sets designated by the nodes.
As an application, we used entropic quadtrees to...
This paper describes an algorithm that determines the minimal sets of variables that determine the values of a discrete partial function. The algorithm is based on the notion of entropy of a partition and is able to achieve an optimal solution. A limiting factor is introduced to restrict the search, thereby providing the option to reduce running ti...
We examine the relationship between the Cooper-Herskovitz score of a Bayesian network and the conditional entropies of the nodes of the networks conditioned on the probability distributions of their parents. We show that minimizing the conditional entropy of each node of the BNS conditioned on its set of parents amounts to maximization of the CH sc...
We develop an axiomatization of a class of metrics on lattices of partitions of finite sets, which leads to a new metric axiomatization of the notion of entropy in an algebraic framework. We point to the application that this type of metrics has in data mining and machine learning.
This paper describes an algorithm that determines the minimal sets of variables that determine the values of a discrete partial function. The algorithm is based on the notion of entropy of a partition and is able to achieve an optimal solution. A limiting factor is introduced to restrict the search, thereby providing the option to reduce running ti...
Using concepts from rough set theory we investigate the existence of approximative descriptions of collections of objects that can be extracted from in data set, a problem of interest for biologists that need to find succinct descriptions of families of taxonomic units. Our algorithm is based on an anti-monotonicity of borders of object set and mak...
We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available in the form of a graph- ical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves in an interactive process of discovery. We focus on the central step of this process: giv...
We investigate a class of metrics on lattices that are compatible with the partial order defined by the lattice using the ternary relation of betweenness that can be naturally defined on a metric space. Therelationships between entropy-like functions and metrics defined on lattices are studied and we show the links that exists between various prope...
We develop an efficient algorithm for detecting frequent patterns that occur in sequence databases under certain constraints. By combining the use of bit vector representations of sequence databases with association graphs we achieve superior time and low memory usage based on a considerable reduction of the number of candidate patterns.
We propose a measure for assessing the degree of influ-ence of a set of edges of a Bayesian network on the over-all fitness of the network, starting with probability distri-butions extracted from a data set. Standard fitness mea-sures such as the Cooper-Herskowitz score or the score based on the minimum description length are computa-tionally expen...
In this paper, we investigate the problem of clustering XML documents based on their structure. We represent the paths in an XML document as a multiset and use the symmetric difference operation on multisets to define certain metrics. These metrics are then used to obtain a measure of similarity between any two documents in a collection. Our techni...
We propose an approximate computation technique for inter-object distances for binary data sets. Our approach is based on the locality sensitive hashing, scales up with the number of objects and is much faster than the "brute-force" computation of these distances.
We propose a novel algorithm for extracting the structure of a Bayesian network from a dataset. Our approach is based on generalized conditional en- tropies, a parametric family of entropies that extends the usual Shannon condi- tional entropy. Our results indicate that with an appropriate choice of a general- ized conditional entropy we obtain Bay...
This chapter presents data mining techniques that make use of metrics defined on the set of partitions of finite sets. Partitions are naturally associated with object attributes and major data mining problem such as classification, clustering and data preparation which benefit from an algebraic and geometric study of the metric space of partitions....
This chapter presents an introduction to the relational model, which is of paramount importance for data mining. We continue with certain equivalence relations (and partitions) that can be associated to sets of attributes of tables.
In mathematics, as in everyday life, one often speaks about relationships between objects and, in particular, of the idea of two objects being related or associated with each other in some way. In this chapter, we will study relations, a way of making precise the idea of an asssociation between objects. A relation will be defined to be a set of ord...
Subsets of \({\mathbb R}^n\) may have “intrinsic” dimensions that are much lower than \(n\). Consider, for example, two distinct vectors \(\mathbf {a},\mathbf {b}\in {\mathbb R}^n\) and the line \(L = \{\mathbf {a}+ t \mathbf {b}\,\mid \,t \in {\mathbb R}\}\). Intuitively, \(L\) has the intrinsic dimensionality \(1\); however, \(L\) is embedded in...
The study of topological properties of metric spaces allows us to present an introduction to the dimension theory of these spaces, a topic that is relevant for data mining due to its role in understanding the complexity of searching in data sets that have a natural metric structure.
Lattices can be defined either as special partially ordered sets or as algebras. In this chapter, we present both definitions and show their equivalence. We study several special classes of lattices: modular and distributive lattices and complete lattices. The last part of the chapter is dedicated to Boolean algebras and Boolean functions.
Association rules have received lots of attention in data mining due to their many applications in marketing, advertising, inventory control, and many other areas.
Second edition 2014. The first edition - 2008
The maturing of the field of data mining has brought about an increased level of mathematical sophistication. Such disciplines like topology, combinatorics, partially ordered sets and their associated algebraic structures (lattices and Boolean algebras), and metric spaces are increasingly applied in data mining research. This book presents these ma...
We propose a novel and efficient solution to the problem of clustering XML documents based on their structure. We use operations on multisets of paths of document trees to define certain metrics on multisets. These metrics are used for clustering real and synthesized XML documents to produce high-quality clusterings.
We propose a new technique for clustering of text documents that relies on a biclustering structure constructed on terms and documents. Our approach makes use of a greedy algorithm applied to bit sequences associated with each group of synonym terms. The use of bit sequences allows us to achieve superior time performance. Additionally, our algorith...
Eye movements are certainly the most natural and repetitive movement of a human being. The most mundane activity, such as watching television or reading a newspaper, involves this automatic activity which consists of shifting our gaze from one point to another.
Identification of the components of eye movements (fixations and saccades) is an essenti...
Clustering algorithms for multidimensional numerical data must overcome special difficulties due to the irregularities of data distribution. We present a clustering algorithm for numerical data that combines ideas from random projection techniques and density-based clustering. The algorithm consists of two phases: the first phase that entails the u...
Starting from an axiomatization of a generalization of Shannon entropy we introduce a set of axioms for a parametric family of distances over sets of partitions of finite sets. This family includes some well-known metrics used in data mining and in the study of finite functions.
Clustering is the process of grouping together objects that are similar. The similarity between objects is evaluated by using a several types of dissimilarities (particularly, metrics and ultrametrics). After discussing partitions and dissimilarities, two basic mathematical concepts important for clustering, we focus on ultrametric spaces that play...
The identification of frequent item sets and of association rules have received a lot of attention in data mining due to their many applications in marketing, advertising, inventory control, and many other areas. First the notion of frequent item set is introduced and we study in detail the most popular algorithm for item set identification: the Ap...
We introduce the notion of ∧ -a nd∨-pairs of functions on lattices as an ab- straction of the notions of metric and its related entropy for probability distributions. This approach allows us to highlight the relationships that exist between various prop- erties of metrics and entropies and opens the possibility of extending these concepts to other...
We study the extraction of characteristics of user behav- ior in video session encoded as stochastic matrices of fi- nite Markov chain. These behaviors are clustered using a dissimilarity based on the Kullbach-Leibler divergence be- tween probability distributions. The center of each cluster is regarded as the model that generates the behaviors ass...
We propose an algorithm that computes an approxima- tion of the set of frequent item sets by using the bit se- quence representation of the associations between items and transactions. The algorithm is obtained by modi- fying a hierarchical agglomerative clustering algorithm and takes advantage of the speed that bit operations af- ford. The algorit...
Starting from an axiomatization of a generalization of Shannon entropy we introduce a set of axioms for a parametric family of distances over sets of partitions of finite sets. This family includes some well-known metrics used in data mining and in the study of finite functions.
Multimedia information retrieval systems become increasingly essential with the emergence of numerical technologies. The multidimensional indexing is one of the crucial problems whose effectiveness depends on these systems. Indeed, the description of multimedia data requires a very great number of descriptors, and consequently a high-dimensional da...
Our demo focuses on eye tracking on web, image and video data. We use some state-of-the-art measurements, such as scan path, to determine how the user sees web documents, images and videos. Our approach is characterised by automatic eye/gaze tracking with non intrusive sensors, mainly infrared cameras of web, image and video documents. We analyse e...
We examine a new approach to building decision tree by intro- ducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a flnite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer...
We investigate ranges of ternary algebraic functions in lukasiewicz-Moisil algebras, where we give a characterization of algebraic functions whose ranges are intervals and we retrieve a canonical form of functions over three-element ternary lukasiewicz-Moisil algebras, a result due to Gr. C. Moisil, one of the founders of switching theory [Moi57]....
We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties
of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of
the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer le...
The development of logical formalisms is paralleled by the development of their alge-braic counterparts and the interplay between logic and algebra often plays an inspiring role for both fields. Notable examples of this interaction are the theory of Lukasiewicz-Moisil algebras that was born as an algebraic analogue of the many-valued logics in-trod...
Résumé. L'analyse des données d'expression de génes dans les fragments d'ADN est un outil important utilisé dans la recherche genomique dont les objectifs prin-cipaux s'étendent de l'étude du caractére fonctionnel des génes spécifiques et leur participation dans les processus biologiques à la reconstruction de condi-tions des maladies et leur prono...
We present a clustering algorithm for numerical data that consists of two phases: the first phase that entails the use of random projections to detect clusters, and the second phase that consists of certain post-processing techniques of clusters obtained by several random projections. Experimental results show the potential of our algorithm for ima...
We study an algorithm for feature selection that clusters attributes using a special metric and then makes use of the dendrogram of the resulting cluster hierarchy to choose the most relevant attributes. The main interest of our technique resides in the improved understanding of the structure of the analyzed data and of the relative importance of t...
We present an abstract axiomatization of generalized entropy using the notion of ordinal number and the new concept of systemic set of equivalence relations. The axiomatization applies to arbitrary sets and extends previous results obtained for the finite case.