Boris G MirkinNational Research University Higher School of Economics | HSE · Department of Data Analysis and Artificial Intelligence
Boris G Mirkin
Professor of Computer Science
About
245
Publications
42,490
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
6,819
Citations
Introduction
Clustering and interpretation of data and texts.
Additional affiliations
Education
December 1964 - December 1966
September 1959 - December 1964
Publications
Publications (245)
The gradient descent has proven to be an effective optimization strategy. The current research proposes a novel clustering methodology using this strategy to recover communities in feature-rich networks. Our adoption of this strategy did not lead to promising results, and thus to improve them, we propose a special “refinement” mechanism, which cull...
Three-Stage Cluster Modeling for the Spatiotemporal
Analysis of Coastal Upwelling
In contrast to conventional wisdom that Pearson’s chi-squared at a contingency table is a criterion of statistical independence, rather than a measure of association, this paper establishes an operational meaning of the Pearson’s chi-squared as a measure of association. Its normalised version, phi-squared, is the average change of the probability o...
This paper gives an experimentally supported review and comparison of several indices based on the conventional K-means inertia criterion for determining the number of clusters,
K
, in datasets, using the popular Silhouette width index as a benchmark. Our experiments involve a novel version of the Elbow index, defined using values of
K
two or t...
A comprehensive approach is presented to analyse season's coastal upwelling represented by weekly sea surface temperature (SST) image grids. Our three‐stage data recovery clustering method assumes that the season's upwelling can be divided into shorter periods of stability, ranges, each to be represented by a constant core and variable shell parts....
This work proposes a spatiotemporal clustering approach for the analysis of coastal upwelling from Sea Surface Temperature (SST) grid maps derived from satellite images. The algorithm, Core-Shell clustering, models the upwelling as an evolving cluster whose core points are constant during a certain time window while the shell points move through an...
The problem of community detection in a network with features at its nodes takes into account both the graph structure and node features. The goal is to find relatively dense groups of interconnected entities sharing some features in common. There have been several approaches proposed for that. We apply the so-called data recovery approach to the p...
This paper proposes a meaningful and effective extension of the celebrated K-means algorithm to detect communities in feature-rich networks, due to our assumption of non-summability mode. We least-squares approximate given matrices of inter-node links and feature values, leading to a straightforward extension of the conventional K-means clustering...
We propose an extension of the celebrated K-means algorithm for community detection in feature-rich networks. Our least-squares criterion leads to a straightforward extension of the conventional batch K-means clustering method as an alternating optimization strategy for the criterion. By replacing the innate squared Euclidean distance with cosine d...
We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its “head subject” node in the higher ranks of the taxonomy tree. The head subject is supposed to “tightly” cover the query set, possibly involving some errors referred to as “gaps” and “...
The main result of this paper is an extension of the K-means algorithm to the issue of community detection in feature-rich networks. This is based on a data-recovery criterion additively combining conventional least-squares criteria for approximation of the network link data and the feature data at network nodes. The dimension of the space at which...
Insufficient audience size is a very common problem in targeted digital advertising. Current approaches to audience extension frequently lead to much diminishing quality metrics. This is the case, for example, for so-called look-alike techniques. We present a novel method for efficient extension of target audiences. Our base is a popular taxonomy o...
A feature-rich network is a network whose nodes are characterized by categorical or quantitative features. We propose a data-driven model for finding a partition of the nodes to approximate both the network link data and the feature data. The model involves summary quantitative characteristics of both network links and features. We distinguish betw...
We explore a doubly-greedy approach to the issue of community detection in feature-rich networks. According to this approach, both the network and feature data are straightforwardly recovered from the underlying unknown non-overlapping communities, supplied with a center in the feature space and intensity weight(s) over the network each. Our least-...
GOT is a Python3 software toolkit for taxonomic content analysis of text collections. The structure of the toolkit follows an in-house methodology for processing a collection of texts using a domain taxonomy. The method includes the following steps: (1) computing matrix of relevance between texts and taxonomy leaf topics using a purely structural s...
The problem of community detection in a network with features at its nodes takes into account both the graph structure and node features. The goal is to find relatively dense groups of interconnected entities sharing some features in common. Algorithms based on probabilistic community models require the node features to be categorical. We use a dat...
The problem of community detection in a network with features at its nodes takes into account both the graph structure and node features. The goal is to find relatively dense groups of interconnected entities sharing some features in common. We apply the so-called data recovery approach to the problem by combining the least-squares recovery criteri...
The problem of community detection in a network with features at its nodes takes into account both the graph structure and node features. The goal is to find relatively dense groups of interconnected entities sharing some features in common. We apply the so-called data recovery approach to the problem by combining the least-squares recovery criteri...
We define and find a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. This generalization lifts the set to a “head subject” in the higher ranks of the taxonomy, that is supposed to “tightly” cover the query set, possibly bringing in some errors, both “gaps” and “offshoots”. Our hybrid method...
Companies’ objectives extend beyond mere profitability, to what is generally known as Corporate Social Responsibility (CSR). Empirical research effort of CSR is typically concentrated on a limited number of aspects. We focus on the whole set of CSR activities to identify any structure to that set. In this analysis, we take data from 1850 of the lar...
This paper presents a relatively rare case of an optimization problem in data analysis to admit a globally optimal solution by a recursive algorithm. We are concerned with finding a most specific generalization of a fuzzy set of topics assigned to leaves of domain taxonomy represented by a rooted tree. The idea is to “lift” the set to its “head sub...
We propose a novel method for efficient target audience augmentation in programmatic digital advertising. This method utilizes a novel ParGenFS algorithm for most adequate generalization in taxonomies which was developed by the authors in a joint work. The ParGenFS extends user segments by parsimoniously lifting them off-line as a fuzzy set over IA...
This paper proposes a novel method, referred to as ParGenFS, for �finding a most speci�c generalization of a query set represented by a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. The query set is generalized by "lifting" it to one or more "head subjects" in the higher ranks of the taxonomy. The head subjects should cov...
We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its “head subject” node in the higher ranks of the taxonomy tree. The head subject is supposed to “tightly” cover the query set, possibly bringing in some errors referred to as “gaps” and...
We give a mathematical treatment to the concept of ordinal equivalence defined relative to all m! possible permutations of parallel axes. We prove that the ordinal equivalence is determined by the pair-wise co-monotonicity equivalence relations, thus leading to simple algorithmic procedures for finding the corresponding partition. Each ordinal equi...
This paper presents an algorithm, ParGenFS, for generalizing, or "lifting", a fuzzy set of topics to higher ranks of a hierarchical taxonomy of a research domain. The algorithm ParGenFS finds a globally optimal generalization of the topic set to minimize a penalty function, by balancing the number of introduced "head subjects" and related errors, t...
We define and find a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a taxonomy. This generalization lifts the set to a “head subject” in the higher ranks of the taxonomy, that is supposed to “tightly” cover the query set, possibly bringing in some errors, both “gaps” and “offshoots”. The method global...
Empirical research effort over Corporate Social Responsibility (CSR) is typically concentrated on a limited number of aspects. We focus on the whole set of CSR activities to find out if there is a structure in those. We take data on the four major dimensions of CSR: environment, social & stakeholder, labor, and governance, from the MSCI database. T...
Ranking is an important part of several areas of contemporary research, including social sciences, decision theory, data analysis, and information retrieval. The goal of this paper is to align developments in quantitative social sciences and decision theory with the current thought in Computer Science, including a few novel results. Specifically, w...
This Chapter is about dividing a dataset or its subset in two parts. If both parts are to be clusters, this is referred to as divisive clustering. If just one part is to be a cluster, this will be referred to as separative clustering. Iterative application of divisive clustering builds a binary hierarchy of which we will be interested at a partitio...
The goals of core data analysis as a tool helping to enhance and augment knowledge of the domain are outlined. Since knowledge is represented by the concepts and statements of relation between them, two main pathways for data analysis are summarization, for developing and augmenting concepts, and correlation, for enhancing and establishing relation...
K-means is arguably the most popular cluster-analysis method. The method’s output is twofold: (1) a partition of the entity set into clusters, and (2) centers representing the clusters. The method is rather intuitive and usually requires just a few pages to get presented. In contrast, this text includes a number of less popular subjects that are mu...
linear regression and correlation coefficient for two quantitative variables (Sect. 3.2);
Before going to the thick of the multivariate summarization, this chapter first considers the concept of feature and its summarizations into histograms, density functions and centers. Two perspectives are defined, the probabilistic and vector-space ones, for defining concepts of feature centers and spreads. Also, current views on the types of measu...
Finding an appropriate generalization for a fuzzy thematic set in taxonomy / Frolov Dmitry, Mirkin Boris, Nascimento Susana, Fenner Trevor [Тext] : Working paper WP7/2018 This paper proposes a novel method, referred to as ParGenFS, for finding a most specific generalization of a query set, represented by a fuzzy set of topics assigned to leaves of...
We downloaded a collection of 17685 research papers together with their abstracts published in 17 journals related to Data Science, in our opinion, for 20 years from 1998-2017. We take the abstracts to these papers as a representative collection.
We take that part of the ACM-CCS 2012 taxonomy, which is related to Data Science, and add a few leaves...
Real-world data sets often contain mislabelled entities. This can be particularly problematic if the data set is being used by a supervised classification algorithm at its learning phase. In this case the accuracy of this classification algorithm, when applied to unlabelled data, is likely to suffer considerably. In this paper we introduce a cluste...
This text examines the goals of data analysis with respect to enhancing knowledge, and identifies data summarization and correlation analysis as the core issues. Data summarization, both quantitative and categorical, is treated within the encoder-decoder paradigm bringing forward a number of mathematically supported insights into the methods and re...
In this paper, I discuss current developments in cluster analysis to bring forth earlier developments by E. Braverman and his team. Specifically, I begin by recalling their Spectrum clustering method and Matrix diagonalization criterion. These two include a number of userspecified parameters such as the number of clusters and similarity threshold, wh...
The ideal type model by Mirkin and Satarov (1990) expresses data points as convex combinations of some `ideal type' points. However, this model cannot prevent the ideal type points being far away from the observations and, in fact, requires that. Archetypal analysis by Cutler and Breiman (1994) and proportional membership fuzzy clustering by Nascim...
The ideal type model by Mirkin and Satarov (1990) expresses data points as convex combinations of some `ideal type’ points. However, this model cannot prevent the ideal type points being far away from the observations and, in fact, requires that. Archetypal analysis by Cutler and Breiman (1994) and proportional membership fuzzy clustering by Nascim...
The intelligent Minkowski and weighted Minkowski K-means are recently developed effective clustering algorithms capable of computing feature weights. Their cluster-specific weights follow the intuitive idea that a feature with a low dispersion in a specific cluster should have a greater weight in this cluster than a feature with a high dispersion....
We present a new method for cluster analysis that finds a composite “supercluster” consisting of two non-overlapping parts: a tight core and a less connected shell. We expand this approach to data that changes over time by assuming that the core is unchangeable, while the shell depends on the time period. We define a data recovery approximation mod...
The concept of anomalous clustering applies to finding individual clusters on a digital geography map supplied with a single feature such as brightness or temperature. An algorithm derived within the individual anomalous cluster framework extends the so-called region growing algorithms. Yet our approach differs in that the algorithm parameter value...
In this paper we make two novel contributions to hierarchical clustering. First, we introduce an anomalous pattern initialisation method for hierarchical clustering algorithms, called A-Ward, capable of substantially reducing the time they take to converge. This method generates an initial partition with a sufficiently large number of clusters. Thi...
The appeal of metric evaluation of research impact has attracted considerable interest in recent times. Although the public at large and administrative bodies are much interested in the idea, scientists and other researchers are much more cautious, insisting that metrics are but an auxiliary instrument to the qualitative peer-based judgement. The g...
Often considered more of an art than a science, books on clustering have been dominated by learning through example with techniques chosen almost through trial and error. Even the two most popular, and most related, clustering methods-K-Means for partitioning and Ward's method for hierarchical clustering-have lacked the theoretical underpinning req...
Research effort has recently focused on designing feature weighting clustering algorithms. These algorithms automatically calculate the weight of each feature, representing their degree of relevance, in a data set. However, since most of these evaluate one feature at a time they may have difficulties to cluster data sets containing features with si...
The paper presents a least squares framework for divisive clustering. Two popular divisive clustering methods, Bisecting K-Means and Principal Direction Division, appear to be versions of the same least squares approach. The PDD recently has been enhanced with a stopping criterion taking into account the minima of the corresponding one-dimensional...
In this paper a novel clustering algorithm is proposed as a version of the seeded region growing (SRG) approach for the automatic recognition of coastal upwelling from sea surface temperature (SST) images.
The new algorithm, one seed expanding cluster (SEC), takes advantage of the concept of approximate clustering due to Mirkin, 1996 and Mirkin, 2...
This paper presents several definitions of “optimal patterns” in triadic data and results of experimental comparison of five triclustering algorithms on real-world and synthetic datasets. The evaluation is carried over such criteria as resource efficiency, noise tolerance and quality scores involving cardinality, density, coverage, and diversity of...
Three different approaches for evaluation of the research impact by a scientist are considered. Two of them are conventional ones, scoring the impact over (a) citation metrics and (b) merit metrics. The third one relates to the level of results. It involves a taxonomy of the research field, that is, a hierarchy representing its composition. The imp...
A two-step approach to taxonomy construction is presented. On the first step the frame of taxonomy is built manually according to some representative educational materials. On the second step, the frame is refined using the Wikipedia category tree and articles. Since the structure of Wikipedia is rather noisy, a procedure to clear the Wikipedia cat...
This paper develops an approach to the problem of multicriteria ranking referredto as multicriteria stratification. The target of stratification is an ordered partition with predefined number of classes rather than a complete ranking of the set of objects.We formulate the problem of multicriteria stratification as a task of minimization of a cost f...
This review provides a historical journey to the roots of bi-clustering and concludes as follows:"Yet in the engineering applications the criteria have clear operational meaning, so
that their minimization should be pursued rigorously. In this aspect, the book under
review is a very good source (including a set of solved examples, pp.179–194) conti...
Research effort has recently focused on designing feature weighting clustering algorithms. These algorithms automatically calcu-late the weight of each feature, representing their degree of relevance, in a data set. However, since most of these evaluate one feature at a time they may have difficulties to cluster data sets containing features with s...
Recently, a three-stage version of K-Means has been introduced, at which not only clusters and their centers, but also feature weights are adjusted to minimize the summary p-th power of the Minkowski p-distance between entities and centroids of their clusters. The value of the Minkowski exponent p appears to be instrumental in the ability of the me...
We develop a consensus clustering framework proposed three decades ago in Russia and experimentally demonstrate that our least squares consensus clustering algorithm consistently outperforms several recent consensus clustering methods.
A suffix-tree-based method for measuring similarity of a key phrase to an unstructured text is proposed. The measure involves less computation and it does not depend on the length of the text or the key phrase. This applies to:
1.
finding interrelations between key phrases over a set of texts;
2.
annotating a research article by topics from a taxon...
A method for conceptual maps construction is presented and applied to Business domains. A conceptual map is a graph, where nodes stand for domain specific concepts and edges connect associated concepts. The conceptual map reveals and visualises logical associations between concepts, which exist in the collection of texts, used to construct the conc...
Recent clustering algorithms have been designed to take into account the degree of relevance of each feature, by automatically calculating their weights. However, as the tendency is to evaluate each feature at a time, these algorithms may have difficulties dealing with features containing similar information. Should this information be relevant, th...
A least-squares data approximation approach to finding individual clusters is advocated. A simple local optimization algorithm leads to suboptimal clusters satisfying some natural tightness criteria. Three versions of an iterative extraction approach are considered, leading to a portrayal of the cluster structure of the data. Of these, probably mos...
This paper presents a further investigation into computational properties of a novel fuzzy additive spectral clustering method, Fuzzy Additive Spectral clustering (FADDIS), recently introduced by authors. Specifically, we extend our analysis to ‘difficult’ data structures from the recent literature and develop two synthetic data generators simulati...
This paper presents a clustering algorithm, namely MFWK-Means, which is a novel extension of K-Means clustering to the case of fuzzy clusters and weighted features. First, the Weighted K-Means criterion utilizing Minkowski metric is adopted to solve the problem of feature selection for high dimensional data. Then, a further extension to the case of...
We develop a consensus clustering framework developed three decades ago in Russia and experimentally demonstrate that our least squares consensus clustering algorithm consistently outperforms several recent consensus clustering methods.
An outline of a few methods in an emerging field of data analysis, "data interpretation", is given as pertaining to medical informatics and being parts of a general interpretation issue. Specifically, the following subjects are covered: Measuring correlation between categories, conceptual clustering, and generalization and interpretation of empiric...
There exists much prejudice against the within-cluster summary similarity criterion which supposedly leads to collecting all the entities in one cluster. This is not so if the similarity matrix is preprocessed by subtraction of "noise", of which two ways, the uniform and modularity, are analyzed in the chapter. Another criterion under consideration...
a b s t r a c t This paper represents another step in overcoming a drawback of K-Means, its lack of defense against noisy features, using feature weights in the criterion. The Weighted K-Means method by Huang et al. (2008, 2004, 2005) [5–7] is extended to the corresponding Minkowski metric for measuring distances. Under Minkowski metric the feature...
In this paper we describe a new method for EEG signal classification in which the classification of one subject’s EEG signals is based on features learnt from another subject. This method applies to the power spectrum density data and assigns class-dependent information weights to individual features. The informative features appear to be rather si...
An additive spectral method for fuzzy clustering is proposed. The method operates on a clustering model which is an extension of the spectral decomposition of a square matrix. The computation proceeds by extracting clusters one by one, which makes the spectral approach quite natural. The iterative extraction of clusters, also, allows us to draw sev...
We describe a novel method for the analysis of research activities of an organization by mapping that to a taxonomy tree of the field. The method constructs fuzzy membership profiles of the organization members or teams in terms of the tax-onomy's leaves (research topics), and then it generalizes them in two steps. These steps are: (i) fuzzy cluste...