# Szymon JaroszewiczPolish Academy of Sciences | PAN · Institute of Computer Science

Szymon Jaroszewicz

Professor

## About

62

Publications

32,349

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

1,316

Citations

Citations since 2016

## Publications

Publications (62)

We present an approach to efficiently construct stepwise regression models in a very high dimensional setting using a multidimensional index. The approach is based on an observation that the collections of available predictor variables often remain relatively stable and many models are built based on the same predictors. Example scenarios include d...

Uplift modeling is an approach to machine learning which allows for predicting the net effect of an action (with respect to not taking the action). To achieve this, the training population is divided into two parts: the treatment group, which is subjected to the action, and the control group, on which the action is not taken. Our task is to constru...

Uplift models support decision-making in marketing campaign planning. Estimating the causal effect of a marketing treatment, an uplift model facilitates targeting communication to responsive customers and efficient allocation of marketing budgets. Research into uplift models focuses on conversion models to maximize incremental sales. The paper intr...

Uplift models support decision-making in marketing campaign planning. Estimating the causal effect of a marketing treatment, an uplift model facilitates targeting communication to responsive customers and efficient allocation of marketing budgets. Research into uplift models focuses on conversion models to maximize incremental sales. The paper intr...

Uplift models support decision-making in marketing campaign planning. Estimating the causal effect of a marketing treatment, an uplift model facilitates targeting communication to responsive customers and an efficient allocation of marketing budgets. Research into uplift models focuses on conversion models to maximize incremental sales. The paper i...

The purpose of statistical modeling is to select targets for some action, such as a medical treatment or a marketing campaign. Unfortunately, classical machine learning algorithms are not well suited to this task since they predict the results after the action, and not its causal impact. The answer to this problem is uplift modeling, which, in addi...

Uplift modeling is an area of machine learning which aims at predicting the causal effect of some action on a given individual. The action may be a medical procedure, marketing campaign, or any other circumstance controlled by the experimenter. Building an uplift model requires two training sets: the treatment group, where individuals have been sub...

Context: Better methods of evaluating process performance of OSS projects can benefit decision makers who consider adoption of OSS software in a company. This article studies the closure of issues (bugs and features) in GitHub projects, which is an important measure of OSS development process performance and quality of support that project users re...

Uplift modeling is a branch of machine learning which aims to predict not the class itself, but the difference between the class variable behavior in two groups: treatment and control. Objects in the treatment group have been subjected to some action, while objects in the control group have not. By including the control group, it is possible to bui...

Uplift modeling is a subfield of machine learning concerned with predicting the causal effect of an action at the level of individuals. This is achieved by using two training sets: treatment, containing objects which have been subjected to an action and control, containing objects onwhich the action has not been performed. An uplift model then pred...

The Wikipedia project has created one of the largest and best-known open knowledge communities. This community is a model for several similar efforts, both public and commercial, and even for the knowledge economy of the future e-society. For these reasons, issues of quality, social processes, and motivation within the Wikipedia knowledge community...

Uplift modeling is a machine learning technique which aims at predicting, on the level of individuals, the gain from performing a given action with respect to refraining from taking it. Examples include medical treatments and direct marketing campaigns where the rate of spontaneous recovery and the background purchase rate need to be taken into acc...

Uplift modeling is a branch of machine learning which aims at predicting the causal effect of an action such as a marketing campaign or a medical treatment on a given individual by taking into account responses in a treatment group, containing individuals subject to the action, and a control group serving as a background. The resulting model can th...

Nowadays Open-Source Software is developed mostly by decentralized teams of developers cooperating on-line. GitHub portal is an online social network that supports development of software by virtual teams of programmers. Since there is no central mechanism that governs the process of team formation, it is interesting to investigate if there are any...

In this paper we present PaCAL, a Python package for arithmetical computations on random variables. The package is capable of performing the four arithmetic operations: addition, subtraction, multiplication and division, as well as computing many standard functions of random variables. Summary statistics, random number generation, plots, and histog...

The main focus of research in machine learning and statistics is on building more advanced and complex models. However, in practice it is often much more important to use the right variables. One may hope that recent popularity of open data would allow researchers to easily find relevant variables. However current linked data methodology is not sui...

The paper concerns studying the quality of teams of Wikipedia authors with statistical approach. We report preparation of a dataset containing numerous behavioural and structural attributes and its subsequent analysis and use to predict team quality. We have performed exploratory analysis using partial regression to remove the influence of attribut...

Uplift modeling is a branch of Machine Learning which aims to predict not the class itself, but the difference between the class variable behavior in two groups: treatment and control. Objects in the treatment group have been subject to some action, while objects in the control group have not. By including the control group it is possible to build...

A generalization of the commonly used Maximum Likelihood based learning algorithm for the logistic regression model is considered. It is well known that using the Laplace prior (L1L1 penalty) on model coefficients leads to a variable selection effect, when most of the coefficients vanish. It is argued that variable selection is not always desirable...

Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases,
some action such as mailing an offer or treating a patient is to be taken on the classified objects, and we should model not
the class probabilities themselves, but instead, the change in class probabilities caused by the a...

Dealing with imprecise quantities is an important problem in scientific computation. Model parameters are known only approximately, typically in the form of probability density functions. Unfortunately there are currently no methods of taking uncertain parameters into account which would at the same time be easy to apply and highly accurate. An imp...

The paper presents a method of interactive construction of global Hidden Markov Models based on local patterns discovered in se- quence data. The method works by finding interesting sequences whose probability in data differs from that predicted by the model. The patterns are then presented to the user who updates the model using their un- derstand...

Marketing campaigns directed to randomly selected customers often generate huge costs and a weak response. Moreover, such campaigns tend to unnecessarily annoy customers and make them less likely to answer to future communications. Precise targeting of marketing actions can potentially results in a greater return on investment. Usually, response mo...

Most classification approaches aim at achieving high prediction accuracy on a given dataset. However, in most practical cases, some action, such as mailing an offer or treating a patient, is to be taken on the classified objects and we should model not the class probabilities themselves, but instead, the change in class probabilities caused by the...

The paper presents a method of interactive construction of global Hidden Markov Models (HMMs) based on local sequence patterns
discovered in data. The method is based on finding interesting sequences whose frequency in the database differs from that predicted by the model. The patterns are then presented to the
user who updates the model using thei...

INFORMATION-THEORETICAL AND COMBINATORIAL METHODS INDATA-MININGDecember 2003Szymon Jaroszewicz,M.Sc., Technical University of SzczecinPh.D., University of Massachusetts BostonDirected by Professor Dan A. SimoviciVarious applications of information theoretical and combinatorial methods in datamining are presented.

We study a discovery framework in which background knowledge on variables and their relations within a discourse area is available in the form of a graph- ical model. Starting from an initial, hand-crafted or possibly empty graphical model, the network evolves in an interactive process of discovery. We focus on the central step of this process: giv...

The paper presents an approach to mining patterns in numerical data without the need for discretization. The proposed method allows for discovery of arbitrary nonlinear relationships. The approach is based on finding a function of a set of attributes whose values are close to zero in the data. Intuitively such functions correspond to equations desc...

Given a medical data set containing genetic description of sodium-sensitive and non-sensitive patients, we examine it using several techniques: induction of decision rules, naive Bayes classifier, voting per-ceptron classifier, decision trees, SVM classifier. We specifically focus on induction of decision rules and so called Pareto-optimal rules, w...

We address the problem of matching imperfectly documented schemas of data streams and large databases. Instance- level schema matching algorithms identify likely correspondences between attributes by quantifying the similarity of their corresponding values. However, exact calculation of these similarities requires processing of all database records...

The paper presents minimum variance patterns: a new class of itemsets and rules for numerical data, which capture arbitrary continuous relationships between numerical attributes without the need for discretization. The approach is based on finding polynomials over sets of attributes whose variance, in a given dataset, is close to zero. Sets of attr...

In this paper we show an efficient method for inducing clas- sifiers that directly optimize the area under the ROC curve. Recently, AUC gained importance in the classification community as a mean to compare the performance of classifiers. Because most classification meth- ods do not optimize this measure directly, several classification learning me...

We examine a new approach to building decision tree by intro- ducing a geometric splitting criterion, based on the properties of a family of metrics on the space of partitions of a flnite set. This criterion can be adapted to the characteristics of the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer...

The paper discusses the properties of an attribute selection criterion for building rough set reducts based on discernibility
matrix and compares it with Shannon entropy and Gini index used for building decision trees. It has been shown theoretically
and experimentally that entropy and Gini index tend to work better if the reduct is later used for...

We examine a new approach to building decision tree by introducing a geometric splitting criterion, based on the properties
of a family of metrics on the space of partitions of a finite set. This criterion can be adapted to the characteristics of
the data sets and the needs of the users and yields decision trees that have smaller sizes and fewer le...

A new class of associations (polynomial itemsets and polyno- mial association rules) is presented which allows for discover- ing nonlinear relationships between numeric attributes with- out discretization. For binary attributes, proposed associa- tions reduce to classic itemsets and association rules. Many standard association rule mining algorithm...

We study the mining of interesting patterns in the presence of numerical attributes. Instead of the usual discretization methods, we propose the use of rank based measures to score the similarity of sets of numerical attributes. New support measures for numerical data are introduced, based on exten- sions of Kendall's tau, and Spearman's Footrule a...

We present algorithms for fast generation of short reducts which avoid building the discernibility matrix explicitly. We show how information obtained from this matrix can be obtained based only on the distributions of attribute values. Since the size of discernibility matrix is quadratic in the number of data records, not building the matrix expli...

We consider a model in which background knowledge on a given domain of interest is available in terms of a Bayesian network, in addition to a large database. The mining problem is to discover unexpected patterns: our goal is to find the strongest discrepancies between network and database. This problem is intrinsically difficult because it requires...

We characterize measures on free Boolean algebras and we examine the relationships that exist between measures and binary tables in relational databases. It is shown that these measures are completely defined by their values on positive conjunctions, and a formula that yields this value is obtained using the method of indicators. An extension of th...

The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is defined as the absolute di#erence between its support estimated from data and from the Bayesian network. E#cient algorithms are presented for finding interestingness of a collection of frequ...

Increasing interest in new pattern recognition methods has been motivated by bioinformatics research. The analysis of gene expression data originated from microarrays constitutes an important application area for classification algorithms and illustrates the need for identifying important predictors. We show that the Goodman-Kruskal coefficient can...

We introduce a numerical measure on sets of partitions of finite sets that is linked to the Goodman-Kruskal association index commonly used in statistics. This measure allows us to define a metric on such partions used for constructing decision trees. Experimental results suggest that by replacing the usual splitting criterion used in C4.5 by a met...

We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be...

We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be...

The purpose of this paper is to examine the usability of Bonferroni-type combinatorial inequalities to estimation of support of itemsets as well as general Boolean expressions. Families of inequalities for various types of Boolean expressions are presented and evaluated experimentally.

Data mining algorithms produce huge sets of rules, practically impossible to analyze manually. It is thus important to develop methods for removing redundant rules from those sets. We present a solution to the problem using the Maximum Entropy approach. The problem of eciency of Maximum Entropy computations is addressed by using closed form solutio...

The paper presents a new general measure of rule interestingness. Many known measures such as chi-square, gini gain or entropy gain can be obtained from this measure by setting some numerical parameters, including the amount of trust we have in the estimation of the probability distribution of the data. Moreover, we show that there is a continuum o...

The aim of this article is to present an axiomatization of a
generalization of Shannon's entropy starting from partitions of finite
sets. The proposed axiomatization defines a family of entropies
depending on a real positive parameter that contains as a special case
the Havrda-Charvat (1967) entropy, and thus, provides axiomatizations
for the Shann...

We characterize measures on free Boolean algebras and we examine
the relationships that exists between measures and binary tables in
relational databases. It is shown that these measures are completely
defined by their values on positive conjunctions and an algorithm that
leads to the construction of measures starting from its values on a
positive...

The aim of this paper is to present an axiomatization of a
generalization of Shannon's entropy. The newly proposed axiomatization
yields as special cases the Havrada-Charvat entropy, and thus, provides
axiomatizations for the Shannon entropy, the Gini index, and for other
types of entropy used in classification and data mining

A weak decomposition of an incompletely specified function f is a
decomposition of some completion of f. Using a graph-theoretical
characterization of functions that admit such decompositions, we present
a technique derived from the a priori algorithm that allows a data
mining approach to identifying these decompositions

We introduce the notion of entropy for a set of attributes of a table in a relational database starting from the notion of entropy for finite functions. We examine the connections that exist between conditional entropies of attribute sets and lossy decompositions of tables and explore the properties of the entropy of sets of attributes regarded as...

In this paper we present a new axiomatization of the notion of entropy of functions between finite sets and we introduce and axiomatize the notion of conditional entropy between functions. The results can be directly applied to logic functions, which can be regarded as functions between finite sets. Our axiomatizations are based on properties of en...

We address the problem of matching imperfectly docu-mented schemas of data streams and large databases. Instance-level schema matching algorithms identify likely correspondences between at-tributes by quantifying the similarity of their corresponding values. How-ever, exact calculation of these similarities requires processing of all database recor...

Cross-selling is a strategy of selling new products to a customer who has made other purchases earlier. Ex-cept for the obvious profit from extra products sold, it also increases the dependence of the customer on the vendor and therefore reduces churn. This is especially important in the area of telecommunications, characterized by high volatility...

The paper presents a method for pruning frequent itemsets based on background knowledge represented by a Bayesian network. The interestingness of an itemset is dened as the absolute dierence between its support estimated from data and from the Bayesian network. Ecien t algorithms are pre- sented for nding interestingness of a collection of frequent...

We introduce an extension of the notion of Shannon conditional entropy to a more general form of conditional entropy that captures both the conditional Shannon entropy and a similar notion related to the Gini index. The proposed family of conditional entropies generates a collection of metrics over the set of partitions of finite sets, which can be...