Marco Riani

Marco Riani
Università di Parma | UNIPR · Department of Economics and Interdepartmental Centre for Robust Statistics

Ph.D.

About

153
Publications
26,175
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,854
Citations
Introduction
Full Professor of Statistics (since 2006) – University of Parma. Director of the interdepartmental center of robust statistics for big data (Ro.S,A.) of the University of Parma since 2016. He is currently Associate Editor of “Statistical Methods and Applications” and “Metron”. He is author of more than 150 publications and two books published by Springer Verlag New York. His main research interests are robust statistics (regression, multivariate analysis, classification and time series)
Additional affiliations
November 1998 - present
Università di Parma
Position
  • Professor (Full)
October 1997 - present
Università di Parma
Position
  • Professor (Full)
Education
January 1992 - October 1995
University of Florence
Field of study
  • Statistics

Publications

Publications (153)
Article
Full-text available
Artificial Intelligence relies on the application of machine learning models which, while reaching high predictive accuracy, lack explainability and robustness. This is a problem in regulated industries, as authorities aimed at monitoring the risks arising from the application of Artificial Intelligence methods may not validate them. No measurement...
Conference Paper
Full-text available
Model-based approaches to cluster analysis and mixture modeling often involve maximizing classification and mixture likelihoods. Without appropriate constrains on the scatter matrices of the components, these maximizations result in ill-posed problems. Moreover, without constrains, non-interesting or “spurious” clusters are often detected by the EM...
Article
Full-text available
The research presented in this paper aims at providing a statistical model that is capable of estimating soil water content based on weather data. The model was tested using a long-time series of field experimental data from continuous monitoring at a test site in Oltrepò Pavese (northern Italy). An innovative statistical function was developed in...
Preprint
Full-text available
Artificial Intelligence relies on the application of machine learning models which, while reaching high predictive accuracy, lack explainability and robustness. This is a problem in regulated industries, as authorities aimed at monitoring the risks arising from the application of AI methods may not validate them. No measurement methodologies are y...
Chapter
The AVAS (Additivity And Variance Stabilization) algorithm of Tibshirani provides a non-parametric transformation of the response in a linear model to approximately constant variance. It is thus a generalization of the much used Box-Cox transformation. However, AVAS is not robust. Outliers can have a major effect on the estimated transformations bo...
Article
In this article, extensions to the recently introduced concept of pairwise overlap between mixture components are proposed. The notion of overlap is useful for studying the systematic performance of clustering algorithms. Existing methods can be used for simulating elliptical data according to pre-specified overlap characteristics. First, an approa...
Article
Outliers can have a major effect on the estimated transformation of the response in linear regression models, as they can on the estimates of the coefficients of the fitted model. The effect is more extreme in the Generalized Additive Models (GAMs) that are the subject of this paper, as the forms of terms in the model can also be affected. We devel...
Chapter
Model-based approaches to cluster analysis and mixture modelling often involve maximizing classification and mixture likelihoods. Robust clustering and mixture modelling procedures, that can resist certain amount of contaminating data, can be introduced by considering trimmed versions of those classification and mixture likelihoods. Without appropr...
Article
Full-text available
Correspondence analysis is a method for the visual display of information from two‐way contingency tables. We introduce a robust form of correspondence analysis based on minimum covariance determinant estimation. This leads to the systematic deletion of outlying rows of the table and to plots of greatly increased informativeness. Our examples are t...
Article
Full-text available
The paper introduces an automatic procedure for the parametric transformation of the response in regression models to approximate normality. We consider the Box–Cox transformation and its generalization to the extended Yeo–Johnson transformation which allows for both positive and negative responses. A simulation study illuminates the superior compa...
Article
Full-text available
A new methodology for constrained parsimonious model-based clustering is introduced, where some tuning parameter allows to control the strength of these constraints. The methodology includes the 14 parsimonious models that are often applied in model-based clustering when assuming normal components as limit cases. This is done in a natural way by fi...
Article
Information criteria for model choice are extended to the detection of outliers in regression models. For deletion of observations (hard trimming) the family of models is generated by monitoring properties of the fitted models as the trimming level is varied. For soft trimming (downweighting of observations), some properties are monitored as the ef...
Article
Mycotoxins are secondary metabolites produced by pathogenic fungi. They are found in a variety of different products, such as spices, cocoa, and cereals, and they can contaminate fields before and/or after harvest and during storage. Mycotoxins negatively impact human and animal health, causing a variety of adverse effects, ranging from acute poiso...
Article
According to Eurostat, the EU production of chemicals hazardous to health reached 211 million tonnes in 2019. Thus, the possibility that some of these chemical compounds interact negatively with the human endocrine system has received, especially in the last decade, considerable attention from the scientific community. It is obvious that given the...
Article
Asymptotic properties of robust regression estimators are well known. However, it is not always clear what is the best strategy for confidence intervals and hypothesis testing when the sample size is not very large, since the distribution of residuals coming from robust estimates has unknown properties for small samples. In the present work we prop...
Chapter
Unlike the Box-Cox transformation, that of Yeo and Johnson for the response of a linear model can be applied when the observations are not constrained to be positive. We study the extended Yeo–Johnson transformation in which positive and negative observations can be transformed with different parameter values. The procedure is illustrated for data...
Article
Full-text available
Starting with 2020 volume, the journal Metron has decided to celebrate the centenary since its foundation with three special issues. This volume is dedicated to robust statistics. A striking feature of most applied statistical analyses is the use of methods that are well known to be sensitive to outliers or to other departures from the postulated m...
Preprint
Full-text available
We propose a general approach to handle data contaminations that might disrupt the performance of feature selection and estimation procedures for high-dimensional linear models. Specifically, we consider the co-occurrence of mean-shift and variance-inflation outliers, which can be modeled as additional fixed and random components, respectively, and...
Article
Full-text available
The purpose of this paper is to show in regression clustering how to choose the most relevant solutions, analyze their stability, and provide information about best combinations of optimal number of groups, restriction factor among the error variance across groups and level of trimming. The procedure is based on two steps. First we generalize the i...
Chapter
Full-text available
We consider a classical regression model contaminated by multiple outliers arising simultaneously from mean-shift and variance-inflation mechanisms—which are generally considered as alternative. Identifying multiple outliers leads to computational challenges in the usual variance-inflation framework. We propose the use of robust estimation techniqu...
Article
Full-text available
Model-based approaches to cluster analysis and mixture modeling often involve maximizing classification and mixture likelihoods. Without appropriate constrains on the scatter matrices of the components, these maximizations result in ill-posed problems. Moreover, without constrains, non-interesting or “spurious” clusters are often detected by the EM...
Article
Full-text available
We review the sampling and results of the radiocarbon dating of the archaeological cloth known as the Shroud of Turin, in the light of recent statistical analyses of both published and raw data. The statistical analyses highlight an inter-laboratory heterogeneity of the means and a monotone spatial variation of the ages of subsamples that suggest t...
Article
Full-text available
Minimum density power divergence estimation provides a general framework for robust statistics, depending on a parameter α , which determines the robustness properties of the method. The usual estimation method is numerical minimization of the power divergence. The paper considers the special case of linear regression. We developed an alternative e...
Technical Report
Full-text available
In questo rapporto passiamo in rassegna la storia, il campionamento e i risultati della datazione al radiocarbonio del tessuto archeologico noto come la ‘Sindone di Torino’, alla luce delle recenti analisi statistiche ‘robuste’ dei dati pubblicati e dei dati grezzi, questi ultimi resi disponibili solo nel 2017 dopo una richiesta legale europea nell...
Article
We analyse data on the performance of investment funds, 99 out of 309 of which report a loss, and on the profitability of 1405 firms, 407 of which report losses. The problem in both cases is to use regression to predict performance from sets of explanatory variables. In one case, it is clear from scatter plots of the data that the negative response...
Chapter
This is a graphical procedure for determining the effect of one or more observations on the transformation parameter in the Box–Cox family of power transformations of the response variable in regression. It is based on a “Forward Search” through the data to fit subsets of increasing numbers of observations, with outliers being included toward the e...
Chapter
This is a line‐printer graphic for exhibiting outliers in multivariate data. It is based on a robust procedure, the forward search. The article contrasts the stalactite plot with more modern plots for the same purpose that use the power and versatility of contemporary computer graphics.
Article
Full-text available
Time series often contain outliers and level shifts or structural changes. These unexpected events are of the utmost importance in fraud detection, as they may pinpoint suspicious transactions. The presence of such unusual events can easily mislead conventional time series analysis and yield erroneous conclusions. In this paper we provide a unified...
Article
Monitoring the properties of single sample robust analyses of multivariate data as a function of breakdown point or efficiency leads to the adaptive choice of the best values of these parameters, eliminating arbitrary decisions about their values and so increasing the quality of estimators. Monitoring the trimming proportion in robust cluster analy...
Article
Full-text available
We assess the performance of state-of-the-art robust clustering tools for regression structures under a variety of different data configurations. We focus on two methodologies that use trimming and restrictions on group scatters as their main ingredients. We also give particular care to the data generation process through the development of a flexi...
Article
Trimming principles play an important role in robust statistics. However, their use for clustering typically requires some preliminary information about the contamination rate and the number of groups. We suggest a fresh approach to trimming that does not rely on this knowledge and that proves to be particularly suited for solving problems in robus...
Article
Full-text available
Misinvoicing is a major tool in fraud including money laundering. We develop a method of detecting the patterns of outliers that indicate systematic mis-pricing. As the data only become available year by year, we develop a combination of very robust regression and the use of ‘cleaned’ prior information from earlier years, which leads to early and s...
Article
Diagnostic tools must rely on robust high-breakdown methodologies to avoid distortion in the presence of contamination by outliers. However, a disadvantage of having a single, even if robust, summary of the data is that important choices concerning parameters of the robust method, such as breakdown point, have to be made prior to the analysis. The...
Article
Deciding the number of clusters k is one of the most difficult problems in cluster analysis. For this purpose, complexity-penalized likelihood approaches have been introduced in model-based clustering, such as the well known BIC and ICL criteria. However, the classification/mixture likelihoods considered in these approaches are unbounded without an...
Article
Full-text available
The frequentist forward search yields a flexible and informative form of robust regression. The device of fictitious observations provides a natural way to include prior information in the search. However, this extension is not straightforward, requiring weighted regression. Bayesian versions of forward plots are used to exhibit the presence of mul...
Article
The forward search is a method of robust data analysis in which outlier free subsets of the data of increasing size are used in model fitting; the data are then ordered by closeness to the model. Here the forward search, with many random starts, is used to cluster multivariate data. These random starts lead to the diagnostic identification of tenta...
Chapter
Full-text available
A striking feature of most applied statistical analyses is the use of methods that are well known to be sensitive to outliers or to other departures from the postulated model. Since data contamination is often the rule, rather than the exception, we investigate the reasons for this contradictory (and perhaps unintended) choice. We also provide empi...
Article
Heteroskedastic regression data are modelled using a parameterized variance function. This procedure is robustified using a method with high breakdown point and high efficiency, which provides a direct link between observations and the weights used in model fitting. This feature is vital for the application, the analysis of international trade data...
Article
The objective of this paper was to test whether investing activity in the futures markets of different commodities (grains, sugar, coffee, cotton, cocoa, livestock) could be identified as a source of the increasing level and volatility of agricultural commodity prices. The causal link between trading activity and market factors (returns, volatility...
Chapter
The forward search provides a flexible and informative form of robust regression. We describe the introduction of prior information into the regression model used in the search through the device of fictitious observations. The extension to the forward search is not entirely straightforward, requiring weighted regression. Forward plots are used to...
Conference Paper
Full-text available
This contribution examines a causal link between trading activity and market factors such as returns and volatility. The ratio of volume to open interest in futures contracts performs as trading activity indicator better than other parameters widely adopted in literature. This is probably because the parameter is available at daily frequency while...
Article
Full-text available
We extend the capabilities of MixSim, a framework which is useful for evaluating the performance of clustering algorithms, on the basis of measures of agreement between data partitioning and flexible generation methods for data, outliers and noise. The peculiarity of the method is that data are simulated from normal mixture distributions on the bas...
Article
Full-text available
The identification of atypical observations and the immunization of data analysis against both outliers and failures of modeling are important aspects of modern statistics. The forward search is a graphics rich approach that leads to the formal detection of outliers and to the detection of model inadequacy combined with suggestions for model enhanc...
Article
Full-text available
Motivated by the requirement of controlling the number of false discoveries that arises in several application fields, we study the behaviour of diagnostic procedures obtained from popular high-breakdown regression estimators when no outlier is present in the data. We find that the empirical error rates for many of the available techniques are surp...
Chapter
The Forward Search is used in an exploratory manner, with many random starts, to indicate the number of clusters and their membership in continuous data. The prospective clusters can readily be distinguished from background noise and from other forms of outliers. A confirmatory Forward Search, involving control on the sizes of statistical tests, es...
Article
Full-text available
We tackle the problem of obtaining the consistency factors of robust S-estimators of location and scale both in regression and multivariate analysis. We provide theoretical results, proving new formulae for their calculation and shedding light on the relationship between these factors and important statistical properties of S-estimators. We also gi...
Article
Full-text available
There are several methods for obtaining very robust estimates of regression parameters that asymptotically resist 50% of outliers in the data. Differences in the behaviour of these algorithms depend on the distance between the regression data and the outliers. We introduce a parameter $\lambda$ that defines a parametric path in the space of models...
Article
The Forward Search is a powerful general method for detecting anomalies in structured data, whose diagnostic power has been shown in many statistical contexts. However, despite the wealth of empirical evidence in favour of the method, only few theoretical properties have been established regarding the resulting estimators. We show that the Forward...
Article
Full-text available
Robust methods are little applied (although much studied by statisticians). We monitor very robust regression by looking at the behaviour of residuals and test statistics as we smoothly change the robustness of parameter estimation from a breakdown point of 50% to non-robust least squares. The resulting procedure provides insight into the structure...
Article
Full-text available
This contribution addresses clustering issues in presence of densely populated data points with high degree of overlapping. In order to avoid the disturbing effects of high dense areas we suggest a technique that selects a point in each cell of a grid defined along the Principal Component axes of the data. The selected sub-sample removes the high d...
Chapter
Multivariate outliers are usually identified by means of robust distances. A statistically principled method for accurate outlier detection requires both availability of a good approximation to the finite-sample distribution of the robust distances and correction for the multiplicity implied by repeated testing of all the observations for outlyingn...
Article
Full-text available
The twelve results from the 1988 radio carbon dating of the Shroud of Turin show surprising heterogeneity. We try to explain this lack of homogeneity by regression on spatial coordinates. However, although the locations of the samples sent to the three laboratories involved are known, the locations of the 12 subsamples within these samples are not....
Article
We extend the Forward Search approach for robust data analysis to address problems in text mining. In this domain, datasets are collections of an arbitrary number of documents, which are represented as vectors of thousands of elements according to the vector space model. When the number of variables v is so large and the dataset size n is smaller b...
Chapter
This paper summarizes results in the use of the Forward Search in the analysis of corrupted datasets, and those with mixtures of populations. We discuss new challenges that arise in the analysis of large, complex datasets. Methods developed for regression and clustering are described.
Article
We develop a Cp statistic for the selection of regression models with stationary and nonstationary ARIMA error term. We derive the asymptotic theory of the maximum likelihood estimators and show they are consistent and asymptotically Gaussian. We also prove that the distribution of the sum of squares of one step ahead standardized prediction errors...
Article
The methods of very robust regression resist up to 50% of outliers. The algorithms for very robust regression rely on selecting numerous subsamples of the data. New algorithms for LMS and LTS estimators that have increased computational efficiency due to improved combinatorial sampling are proposed. These and other publicly available algorithms are...
Article
We present the FSDA (Forward Search for Data Analysis) toolbox, a new software library that extends MATLAB and its Statistics Toolbox to support a robust and efficient analysis of complex datasets, affected by different sources of heterogeneity.As the name of the library indicates, the project was born around the Forward Search approach, but it has...
Article
Robust distances are mainly used for the purpose of detecting multivariate outliers. The precise definition of cut-off values for formal outlier testing assumes that the “good” part of the data comes from a multivariate normal population. Robust distances also provide valuable information on the units not declared to be outliers and, under mild reg...
Article
Full-text available
This article extends the analysis of multivariate transformations to linear and quadratic discriminant analysis. It shows that the standard application of deletion diagnostic techniques for validating a particular transformation suffers from masking and so may fail if several outliers are present. We therefore suggest a simple and powerful method w...
Chapter
Full-text available
This chapter tackles the topics of robustness and multivariate outlier detection for ordinal data. First, it reviews outlier detection methods in regression for continuous data, and gives an example which shows that graphical tools of data analysis or traditional diagnostic measures based on all the observations are not sufficient to detect multiva...
Chapter
In this chapter we exemplify some of the theory of Chapter 2 for four sets of data. We start with some synthetic data that were designed to contain masked outliers and so provide difficulties for least squares diagnostics based on backwards deletion. We show that the data do indeed present such problems, but that our procedure finds the hidden stru...
Article
Full-text available
One of the biggest challenges of the statisticians is to be able to eectively present and communicate statistical results (Tufte 1983, Spence 2001). This happens in particular when working on applied problems with non-statisticians. For example in the Joint Research Centre (JRC) of the European Commission the authors and other statisticians coopera...
Article
The problem of robust estimation and multivariate outlier detection of the term structure of default intensity is considered. Both the multivariate Vasicek and CIR models, embedding the Kalman filter algorithm in a forward search context, are used to estimate default intensity. The focus is not on the estimation of credit models including jumps, bu...
Chapter
Full-text available
We provide a selective view of some key statistical concepts that underlie the different approaches to multivariate outlier detection. Our hope is that appreciation of these concepts will help to establish a unified and widely accepted framework for outlier detection.
Article
The forward search provides data-driven flexible trimming of a Cp statistic for the choice of regression models that reveals the effect of outliers on model selection. An informed robust model choice follows. Even in small samples, the statistic has a null distribution indistinguishable from an F distribution. Limits on acceptable values of the Cp...
Chapter
Full-text available
The evaluation of the effectiveness of organisations can be aided by the use of cluster analysis, suggesting and clarifying differences in structure between successful and failing organisations. Unfortunately, traditional methods of cluster analysis are highly sensitive to the presence of atypical observations and departures from normality. We desc...
Article
Full-text available
It is now widely recognized that most statistical techniques are not resistant to out-liers or other deviations from the classical assumptions. Therefore, the development of highly robust and efficient statistical methods has become a goal of paramount importance in both theoretical and applied statistics. The demand for such methods has been drive...
Article
The Forward Search is a powerful general method, incorporating flexible data-driven trimming, for the detection of outliers and unsuspected structure in data and so for building robust models. Starting from small subsets of data, observations that are close to the fitted model are added to the observations used in parameter estimation. As this subs...
Article
Full-text available
We develop a C p statistic with a known distribution for the selection of models for stationary and non-stationary time series, including ARIMA and structural models, that may include regressors. To provide a unified frame-work we use the state-space approach which readily incorporates explana-tory variables. We exemplify use of the statistic throu...
Article
Full-text available
The twelve results from the 1988 radio carbon dating of the Shroud of Turin show surprising heterogeneity. We try to explain this lack of homogeneity by regression on spatial coordinates. However, although the locations of the samples sent to the three laboratories involved are known, the locations of the 12 subsamples within these samples are not....
Article
Full-text available
Using the 12 published results from the 1988 radiocarbon dating of the TS (Turin Shroud), a robust statistical analysis has been performed in order to test the conclusion by Damon et al. (1998) that the TS is mediaeval. The 12 datings, furnished by the three laboratories, show a lack of homogeneity. We used the partial information about the locatio...
Article
Full-text available
The forward search is a powerful general method for detecting multiple masked outliers and for determining their effect on inferences about models fitted to data. From the monitoring of a series of statistics based on subsets of data of increas- ing size we obtain multiple views of any hidden structure. One of the problems of the forward search has...
Article
Full-text available
Multivariate outlier detection requires computation of robust distances to be compared with appropriate cut-off points. In this paper we propose a new calibration method for obtaining reliable cut-off points of distances derived from the MCD estimator of scatter. These cut-off points are based on a more accurate estimate of the extreme tail of the...
Article
We use the forward search to provide robust Mahalanobis distances to detect the presence of outliers in a sample of multivariate normal data. Theoretical results on order statistics and on estimation in truncated samples provide the distribution of our test statistic. We also introduce several new robust distances with associated distributional res...
Article
Full-text available
We address the problem of seasonal adjustment of a nonlinear transformation of the original time series, measured on a ratio scale, which aims at enforcing two essential features: additivity and orthogonality of the components. The posterior mean and variance of the seasonally adjusted series admit an analytic finite representation only for particu...