Marek GagolewskiWarsaw University of Technology · Faculty of Mathematics and Information Science
Marek Gagolewski
Professor
About
115
Publications
40,890
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,159
Citations
Introduction
I do not maintain my RG profile actively. Please refer to my homepage at https://www.gagolewski.com for more up-to-date details.
Additional affiliations
April 2024 - present
Systems Research Institute of the Polish Academy of Sciences
Position
- Professor (Associate)
September 2019 - March 2024
February 2012 - March 2018
Education
October 2017 - October 2017
June 2008 - December 2011
September 2004 - June 2008
Publications
Publications (115)
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...
Internal cluster validity measures (such as the Caliski–Harabasz, Dunn, or Davies–Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping...
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimension...
Deep R Programming is a comprehensive and in-depth introductory course on one of the most popular languages for data science. It equips ambitious students, professionals, and researchers with the knowledge and skills to become independent users of this potent environment so that they can tackle any problem related to data wrangling and analytics, n...
Effective processing of character strings is required at various stages of data analysis pipelines: from data cleansing and preparation, through information extraction, to report generation. Pattern searching, string collation and sorting, normalization, transliteration, and formatting are ubiquitous in text mining, natural language processing, and...
Inequality is an inherent part of our lives: we see it in the distribution of incomes, talents, citations, to name a few. However, its intensity varies across environments: there are systems where the available resources are relatively evenly distributed but also where a small group of items or agents controls the majority of assets. Numerous indic...
Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they are meaningful in low-dimensional partitional data clustering tasks. By identifying the upper bounds for the agreement between th...
There is no, nor will there ever be, single best clustering algorithm. Nevertheless, we would still like to be able to distinguish between methods that work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures q...
There is no, nor will there ever be, single best clustering algorithm, but we would still like to be able to distinguish between methods which work well on certain task types and those that systematically underperform. Clustering algorithms are traditionally evaluated using either internal or external validity measures. Internal measures quantify d...
We propose a~new generalization of the classical Sugeno integral motivated by the Hirsch, Woeginger, and other geometrically-inspired indices of scientific impact.
The new integral adapts to the rank-size curve better as it allows for putting more emphasis on highly-valued items and/or the tail of the distribution (level measure).
We study its fund...
We study an iterative discrete information production process (IPP) where we can extend ordered normalised vectors by new elements based on a simple affine transformation, while preserving the predefined level of inequality, G, as measured by the Gini index. Then, we derive the family of Lorenz curves of the corresponding vectors and prove that it...
Inequality is an inherent part of our lives: we see it in the distribution of incomes, talents, resources, and citations, amongst many others. Its intensity varies across different environments: from relatively evenly distributed ones, to where a small group of stakeholders controls the majority of the available resources. We would like to understa...
Community detection is a critical challenge in the analysis of real-world graphs and complex networks, including social, transportation, citation, cybersecurity networks, and food webs. Motivated by many similarities between community detection and clustering in Euclidean spaces, we propose three algorithm frameworks to apply hierarchical clusterin...
Minimum spanning trees (MSTs) provide a convenient representation of datasets in numerous pattern recognition activities. Moreover, they are relatively fast to compute. In this paper, we quantify the extent to which they can be meaningful in data clustering tasks. By identifying the upper bounds for the agreement between the best (oracle) algorithm...
Agglomerative hierarchical clustering based on Ordered Weighted Averaging (OWA) operators not only generalises the single, complete, and average linkages, but also includes intercluster distances based on a few nearest or farthest neighbours, trimmed and winsorised means of pairwise point similarities, amongst many others. We explore the relationsh...
This paper aims to find the reasons why some citation models can predict a set of specific bibliometric indices extremely well. We show why fitting a model that preserves the total sum of a vector can be beneficial in the case of heavy-tailed data that are frequently observed in informetrics and similar disciplines. Based on this observation, we in...
Deep R Programming is a comprehensive and in-depth introductory course on
one of the most popular languages for data science. It equips ambitious
students, professionals, and researchers with the knowledge and skills to
become independent users of this potent environment so that they can tackle
any problem related to data wrangling and analytics, n...
The evaluation of clustering algorithms can involve running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, the fact that there can be many equally valid ways to c...
Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a high-level overview as well as discussing key concepts in detail. We explore methods for cleaning data gathered from different sources, transforming, selecting, and extracting features, performing exploratory data analysis and dimension...
We study an agent-based model for generating citation distributions in complex networks of scientific papers, where a fraction of citations is allotted according to the preferential attachment rule (rich get richer) and the remainder is allocated accidentally (purely at random, uniformly). Previously, we derived and analysed such a process in the c...
The evaluation of clustering algorithms can be performed by running them on a variety of benchmark problems, and comparing their outputs to the reference, ground-truth groupings provided by experts. Unfortunately, many research papers and graduate theses consider only a small number of datasets. Also, rarely the fact that there can be many equally...
The time needed to apply a hierarchical clustering algorithm is most often dominated by the number of computations of a pairwise dissimilarity measure. Such a constraint, for larger data sets, puts at a disadvantage the use of all the classical linkage criteria but the single linkage one. However, it is known that the single linkage clustering algo...
We analyse the usefulness of Jain’s fairness measure and the related Prathap’s bibliometric z-index as proxies when estimating the parameters of the 3DSI (three dimensions of scientific impact) model.
Internal cluster validity measures (such as the Calinski-Harabasz, Dunn, or Davies-Bouldin indices) are frequently used for selecting the appropriate number of partitions a dataset should be split into. In this paper we consider what happens if we treat such indices as objective functions in unsupervised learning activities. Is the optimal grouping...
A proper fusion of complex data is of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. It plays a crucial role in: synthetic description of data processes or whole do...
We consider a version of the D. Price’s model for the growth of a bibliographic network, where in each iteration, a constant number of citations is randomly allocated according to a weighted combination of the accidental (uniformly distributed) and the preferential (rich-get-richer) rule. Instead of relying on the typical master equation approach,...
The discrete Choquet integral with respect to various types of fuzzy measures serves as an important aggregation function which accounts for mutual dependencies between the inputs. The Choquet integral can be used as an objective (or constraint) in optimisation problems, and the type of fuzzy measure used determines its complexity. This paper exami...
There are many approaches to the modelling of citation vectors of individual authors. Models may serve different purposes, but usually they are evaluated with regards to how well they align to citation distributions in large networks of papers. Here we compare a few leading models in terms of their ability to correctly reproduce the values of selec...
Question‐and‐answer (Q&A) sites improve access to information and ease transfer of knowledge. In recent years, they have grown in popularity and importance, enabling research on behavioral patterns of their users. We study the dynamics related to the casting of 7 M votes across a sample of 700 k posts on Stack Overflow, a large community of profess...
We demonstrate that by using a triple of simple numerical summaries: an author’s productivity, their overall impact, and a single other bibliometric index that aims to capture the shape of the citation distribution, we can reconstruct other popular metrics of bibliometric impact with a sufficient degree of precision. We thus conclude that the use o...
We consider a version of the D.J.Price's model for the growth of a bibliographic network, where in each iteration a constant number of citations is randomly allocated according to a weighted combination of accidental (uniformly distributed) and preferential (rich-get-richer) rules. Instead of relying on the typical master equation approach, we form...
Making correct decisions as to whether code chunks should be considered similar becomes increasingly important in software design and education and not only can improve the quality of computer programs, but also help assure the integrity of student assessments. In this paper we test numerous source code similarity detection tools on pairs of code f...
genieclust is an open source Python and R package that implements the hierarchical clustering algorithm called Genie. This method frequently outperforms other state-of-the-art approaches in terms of clustering quality and speed, supports various distances over dense, sparse, and string data domains, and can be robustified even further with the buil...
In a recent paper published in this very journal, the “Bounded fuzzy possibilistic method” (BFPM) was proposed. We point out that there are some critical flaws in the said algorithm, which makes the results presented therein highly questionable. In particular, the method does not generate meaningful cluster membership degrees and fails to converge...
Compositional data naturally appear in many fields of application. For instance, in chemistry, the relative contributions of different chemical substances to a product are typically described in terms of a compositional data vector. Although the aggregation of compositional data frequently arises in practice, the functions formalizing this process...
The use of the Choquet integral in data fusion processes allows for the effective modelling of interactions and dependencies between data features or criteria. Its application requires identification of the defining capacity (also known as fuzzy measure) values. The main limiting factor is the complexity of the underlying parameter learning problem...
We introduce several new sports team rating models based on the gradient descent algorithm. More precisely, the models can be formulated by maximising the likelihood of match results observed using a single step of this optimisation heuristic. The proposed framework is inspired by the prominent Elo rating system, and yields an iterative version of...
In the paper "An inherent difficulty in the aggregation of multidimensional data" recently accepted for publication in this very journal, the property of orthomonotonicity is introduced. This property is proved to be weaker than monotonicity and orthogonal equivariance together and to reduce the family of idempotent functions to the family of weigh...
The growing popularity of bibliometric indexes (whose most famous example is the h index by J. E. Hirsch [J. E. Hirsch, Proc. Natl. Acad. Sci. U.S.A. 102, 16569–16572 (2005)]) is opposed by those claiming that one’s scientific impact cannot be reduced to a single number. Some even believe that our complex reality fails to submit to any quantitative...
We investigate the application of the Ordered Weighted Averaging (OWA) data fusion operator in agglomerative hierarchical clustering. The examined setting generalises the well-known single, complete and average linkage schemes. It allows to embody expert knowledge in the cluster merge process and to provide a much wider range of possible linkages....
Defined solely by means of order-theoretic operations meet (min) and join (max), weighted lattice polynomial functions are particularly useful for modelling data on an ordinal scale. A special case, the discrete Sugeno integral, defined with respect to a nonadditive measure (a capacity), enables accounting for the interdependencies between input va...
The Sugeno integral is an expressive aggregation function with potential applications across a range of decision contexts. Its calculation requires only the lattice minimum and maximum operations, making it particularly suited to ordinal data and robust to scale transformations. However, for practical use in data analysis and prediction, we require...
The problem of the piecewise linear approximation of fuzzy numbers giving outputs nearest to the inputs with respect to the Euclidean metric is discussed. The results given in Coroianu et al. (Fuzzy Sets Syst 233:26–51, 2013) for the 1-knot fuzzy numbers are generalized for arbitrary n-knot (\(n\ge 2\)) piecewise linear fuzzy numbers. Some results...
The Sugeno integral is a function particularly suited to the aggregation of ordinal inputs. Defined with respect to a fuzzy measure, its ability to account for complementary and redundant relationships between variables brings much potential to the field of biomedicine, where it is common for measurements and patient information to be expressed qua...
The constrained ordered weighted averaging (OWA) aggregation problem arises when we aim to maximize or minimize a convex combination of order statistics under linear inequality constraints that act on the variables with respect to their original sources. The standalone approach to optimizing the OWA under constraints is to consider all permutations...
This book collects the contributions presented at AGOP 2019, the 10th International Summer School on Aggregation Operators, which took place in Olomouc (Czech Republic) in July 2019. It includes contributions on topics ranging from the theory and foundations of aggregation functions to their various applications.
Aggregation functions have numerou...
The problem of penalty-based data aggregation in generic real normed vector spaces is studied. Some existence and uniqueness results are indicated. Moreover, various properties of the aggregation functions are considered.
The property of monotonicity, which requires a function to preserve a given order, has been considered the standard in the aggregation of real numbers for decades. In this paper, we argue that, for the case of multidimensional data, an order-based definition of monotonicity is far too restrictive. We propose several meaningful alternatives to this...
There is a mutual resemblance between the behavior of users of the Stack Exchange and the dynamics of the citations accumulation process in the scientific community, which enabled us to tackle the outwardly intractable problem of assessing the impact of introducing “negative” citations. Although the most frequent reason to cite an article is to hig...
In the field of information fusion, the problem of data aggregation has been formalized as an order-preserving process that builds upon the property of monotonicity. However, fields such as computational statistics, data analysis and geometry, usually emphasize the role of equivariances to various geometrical transformations in aggregation processe...
The problem of learning symmetric capacities (or fuzzy measures) from data is investigated toward applications in data analysis and prediction as well as decision making. Theoretical results regarding the solution minimizing the mean absolute error are exploited to develop an exact branch-refine-and-bound-type algorithm for fitting Sugeno integrals...
The efficacy of different league formats in ranking teams according to their true latent strength is analysed. To this end, a new approach for estimating attacking and defensive strengths based on the Poisson regression for modelling match outcomes is proposed. Various performance metrics are estimated reflecting the agreement between latent teams’...
The Sugeno integral has numerous successful applications, including but not limited to the areas of decision making, preference modeling, and bibliometrics. Despite this, the current state of the development of usable algorithms for numerically fitting the underlying discrete fuzzy measure based on a sample of prototypical values – even in the simp...
Research in aggregation theory is nowadays still mostly focused on algorithms to summarize tuples consisting of observations in some real interval or of diverse general ordered structures. Of course, in practice of information processing many other data types between these two extreme cases are worth inspecting. This contribution deals with the agg...
As cities increase in size, governments and councils face the problem of designing infrastructure and approaches to traffic management that alleviate congestion. The problem of objectively measuring congestion involves taking into account not only the volume of traffic moving throughout a network, but also the inequality or spread of this traffic o...
This proceedings volume is a collection of peer reviewed papers presented at the 8th International Conference on Soft Methods in Probability and Statistics (SMPS 2016) held in Rome (Italy).
The book is dedicated to Data science which aims at developing automated methods to analyze massive amounts of data and to extract knowledge from them. It show...
The famous Hirsch index has been introduced just ca. ten years ago. Despite that, it is already widely used in many decision-making tasks, like in evaluation of individual scientists, research grant allocation, or even production planning. It is known that the h-index is related to the discrete Sugeno integral and the Ky Fan metric introduced in th...
Economic inequality measures are employed as a key component in various socio-demographic indices to capture the disparity between the wealthy and poor. Since their inception, they have also been used as a basis for modelling spread and disparity in other contexts. While recent research has identified that a number of classical inequality and welfa...
**zobacz także: https://datawranglingpy.gagolewski.com**
Celem autorów książki jest przygotowanie Czytelnika do samodzielnego przeprowadzenia całego procesu analizy danych, od pobrania i załadowania zbioru, przez jego wstępne przetworzenie i wyczyszczenie, aż po samą analizę, wizualizację wyników i ich interpretację. Wiemy, że pewne rozwiązania, k...
Otwarte i wolnodostępne środowisko R zyskało w ostatnich latach ogromną popularność. Język R jest jednym z podstawowych narzędzi w warsztacie wielu analityków danych, statystyków, data scientists, badaczy opinii i rynku, specjalistów business intelligence czy naukowców.
Większość publikacji dostępnych na polskim i zagranicznym rynku wydawniczym sk...
The paper discusses a generalization of the nearest centroid hierarchical clustering algorithm. A first extension deals with the incorporation of generic distance-based penalty minimizers instead of the classical aggregation by means of centroids. Due to that the presented algorithm can be applied in spaces equipped with an arbitrary dissimilarity...
We discuss a generalization of the fuzzy (weighted) k-means clustering procedure and point out its relationships with data aggregation in spaces equipped with arbitrary dissimilarity measures. In the proposed setting, a data set partitioning is performed based on the notion of points’ proximity to generic distance-based penalty minimizers. Moreover...
The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in...
The use of supervised learning techniques for fitting weights and/or generator functions of weighted quasi-arithmetic means – a special class of idempotent and nondecreasing aggregation functions – to empirical data has already been considered in a number of papers. Nevertheless, there are still some important issues that have not been discussed in...
In this paper, we study the efficacy of the official ranking for international football teams compiled by FIFA, the body governing football competition around the globe. We present strategies for improving a team's position in the ranking. By combining several statistical techniques, we derive an objective function in a decision problem of optimal...
A proper fusion of complex data is of interest to many researchers in diverse fields, including computational statistics, computational geometry, bioinformatics, machine learning, pattern recognition, quality management, engineering, statistics, finance, economics, etc. It plays a crucial role in: synthetic description of data processes or whole do...
The Hirsch's h-index is perhaps the most popular citation-based measure of the scientific excellence. In 2013 G. Ionescu and B. Chopard proposed an agent-based model for this index to describe a publications and citations generation process in an abstract scientific community. With such an approach one can simulate a single scientist's activity, an...
In this study we investigate the recently introduced competition format for the top association football division in Poland (similar to one used in, e.g., Belgium and Kazakhstan). We compare it to the double round-robin tournament which is the most prevalent league format among European leagues. In a simulation study we show that the new league for...
Aggregation theory often deals with measures of central tendency of quantitative data. As sometimes a different kind of information fusion is needed, an axiomatization of spread measures was introduced recently. In this contribution we explore the properties of WDpWAM and WDpOWA operators , which are defined as weighted Lp-distances to weighted ari...
In the field of informetrics, agents are often represented by numeric sequences of non necessarily conforming lengths. There are numerous aggregation techniques of such sequences, e.g., the g-index, the h-index, that may be used to compare the output of pairs of agents. In this paper we address a question whether such impact indices may be used to...
In this paper we describe recent advances in our R code similarity detection algorithm. We propose a modification of the Program Dependence Graph (PDG) procedure used in the GPLAG system that better fits the nature of functional programming languages like R. The major strength of our approach lies in a proper aggregation of outputs of multiple plag...
The K-means algorithm is one of the most often used clustering techniques. However, when it comes to discovering clusters in informetric data sets that consist of non-increasingly ordered vectors of not necessarily conforming lengths, such a method cannot be applied directly. Hence, in this paper, we propose a K-means-like algorithm to determine gr...
Classically, unsupervised machine learning techniques are applied on data sets with fixed number of attributes (variables). However, many problems encountered in the field of informetrics face us with the need to extend these kinds of methods in a way such that they may be computed over a set of nonincreasingly ordered vectors of unequal lengths. T...
https://github.com/Rexamine/FuzzyNumbers/raw/master/devel/tutorial/FuzzyNumbers-Tutorial_current.pdf
The recently-introduced OM3 aggregation operators fulfill three appealing properties: they are simultaneously minitive, maxitive, and modular. Among the instances of OM3 operators we find e.g. OWMax and OWMin operators, the famous Hirsch's h-index and all its natural generalizations.
In this paper the basic axiomatic and probabilistic properties of...
Sugeno integral-based confidence intervals for the theoretical h-index of a fixed-length sequence of i.i.d. random variables are derived. They are compared with other estimators of such a distribution characteristic in a Pareto i.i.d. model. It turns out that in the first case we obtain much wider intervals. It seems to be due to the fact that a Su...
The producers assessment problem has many important practical instances: it is an abstract model for intelligent systems evaluating e.g. the quality of computer software repositories, web resources, social networking services, and digital libraries. Each producer’s performance is determined according not only to the overall quality of the items he/...
The theory of aggregation most often deals with measures of central tendency. However, sometimes a very different kind of a numeric vector's synthesis into a single number is required. In this paper we introduce a class of mathematical functions which aim to measure spread or scatter of one-dimensional quantitative data. The proposed definition ser...
R is a programming language and software environment for performing statistical computations and applying data analysis that increasingly gains popularity among practitioners and scientists. In this paper we present a preliminary version of a system to detect pairs of similar R code blocks among a given set of routines, which bases on a proper aggr...
A reasonable approximation of a fuzzy number should have a simple membership function, be close to the input fuzzy number, and should preserve some of its important characteristics. In this paper we suggest to approximate a fuzzy number by a piecewise linear 1-knot fuzzy number which is the closest one to the input fuzzy number among all piecewise...
**zobacz także: https://deepr.gagolewski.com**
Otwarte, wolnodostępne i bezpłatne środowisko R zyskuje w ostatnich latach coraz większą popularność i staje się bardzo poważną alternatywą dla wykorzystywanych przez wiele instytucji komercyjnych narzędzi typu SAS, STATA czy SPSS.
W książce kładziemy szczególny nacisk na wyjaśnienie najbardziej pods...
The Choquet, Sugeno, and Shilkret integrals with respect to monotone measures, as well as their generalization – the universal integral, stand for a useful tool in decision support systems. In this paper we propose a general construction method for aggregation operators that may be used in assessing output of scientists. We show that the most often...
See http://agop.rexamine.com
Recently, a very interesting relation between symmetric minitive, maxitive, and modular aggregation operators has been shown. It turns out that the intersection between any pair of the mentioned classes is the same. This result introduces what we here propose to call the OM3 operators. In the first part of our contribution on the analysis of the OM...
This article is a second part of the contribution on the analysis of the recently-proposed class of symmetric maxitive, minitive and modular aggregation operators. Recent results [M. Gagolewski and R. Mesiar, “Aggregating different paper quality measures with a generalized h-index“, J. Informetr. 6, No. 4, 566–579 (2012)] indicated some unstable be...
In this paper the relationship between symmetric minitive, maxitive, and modular aggregation operators is considered. It is shown that the intersection between any two of the three discussed classes is the same. Moreover, the intersection is explicitly characterized.It turns out that the intersection contains families of aggregation operators such...
n this paper we deal with the problem of aggregating numeric sequences of arbitrary length that represent e.g. citation records of scientists. Impact functions are the aggregation operators that express as a single number not only the quality of individual publications, but also their author's productivity.
We examine some fundamental properties of...