Content uploaded by Peter Rousseeuw

Author content

All content in this area was uploaded by Peter Rousseeuw on Jun 21, 2018

Content may be subject to copyright.

A preview of the PDF is not available

We propose the bagplot, a bivariate generalization of the univariate boxplot. The key notion is the half space location depth of a point relative to a bivariate dataset, which extends the univariate concept of rank. The “depth median” is the deepest location, and it is surrounded by a “bag” containing the n/2 observations with largest depth. Magnifying the bag by a factor 3 yields the “fence” (which is not plotted). Observations between the bag and the fence are marked by a light gray loop, whereas observations outside the fence are flagged as outliers. The bagplot visualizes the location, spread, correlation, skewness, and tails of the data. It is equivariant for linear transformations, and not limited to elliptical distributions. Software for drawing the bagplot is made available for the S-Plus and MATLAB environments. The bagplot is illustrated on several datasets—for example, in a scatterplot matrix of multivariate data.

Content uploaded by Peter Rousseeuw

Author content

All content in this area was uploaded by Peter Rousseeuw on Jun 21, 2018

Content may be subject to copyright.

A preview of the PDF is not available

... The dark blue area inside is called a bag, while the lightcolored area is called a fence. 50% of the data points were included in the bag, a fence separating inliers from outliers, and a loop indicating the points outside the bag but inside the fence [155]. The fence contains 75% of the data [153]. ...

... Types of the boxplot formed for the x and y-axis. The direction of the polygon formed in the Bag plot graph gives information about the correlation of the relevant variables[155].Figure 7shows bagplots of Concept 3 against Concept 2. A few words that are closest to the median, according toFig. 7, are ''windmil'', ''forc'', ''stator'', ''mean'', ''plural''. ...

Renewable energy management is critical for obtaining a significant number of practical benefits. Wind energy is one of the most important sources of renewable energy. It is extremely valuable to manage this type of energy well and monitor its development. Data-driven analysis of wind energy technology provides essential clues for energy management. Patent documents are extensively used to follow technology development and find exciting patterns. Patent analysis is an excellent way to conduct a data-driven analysis of the technology under concern. This study aims to define concepts related to wind energy technologies and cluster these concepts to manage wind energy well in practice. Although many efforts have been made in the literature on wind energy, no study defines the concepts related to wind energy technologies and clusters these concepts. This study proposes a text mining and clustering-based patent analysis approach to overcome the limitations of previous studies. Data-driven analysis collects and assesses patent documents related to wind energy technologies. Patent documents are collected from the United States Patent and Trademark Office. Text mining is applied to the abstracts of patent documents, and the k-means clustering algorithm is utilized to determine the distribution of the keywords among the clusters. The results of this study show that the contents of the patent documents are mostly related to the tower, and the propeller blades placed at the top of the tower should rotate smoothly with the wind speed for better energy production.

... Outlier probabilities were quantified using a separate bootstrap procedure, specifically by resampling one dataset 100 times with replacement and detecting the outliers in each resampled dataset. For the outlier detection itself, we used first the bagplot algorithm [34], which is a bivariate extension of the boxplot and is suitable for molecular high-throughput data after dimension reduction [23]. The bagplot was applied separately to the point cloud of each study group in the two-dimensional space of principal components. ...

Outliers in the training or test set used to fit and evaluate a classifier on transcriptomics data can considerably change the estimated performance of the model. Hence, an either too weak or a too optimistic accuracy is then reported and the estimated model performance cannot be reproduced on independent data. It is then also doubtful whether a classifier qualifies for clinical usage. We estimate classifier performances in simulated gene expression data with artificial outliers and in two real-world datasets. As a new approach, we use two outlier detection methods within a bootstrap procedure to estimate the outlier probability for each sample and evaluate classifiers before and after outlier removal by means of cross-validation. We found that the removal of outliers changed the classification performance notably. For the most part, removing outliers improved the classification results. Taking into account the fact that there are various, sometimes unclear reasons for a sample to be an outlier, we strongly advocate to always report the performance of a transcriptomics classifier with and without outliers in training and test data. This provides a more diverse picture of a classifier’s performance and prevents reporting models that later turn out to be not applicable for clinical diagnoses.

... They are used to describe and compare multivariate distributions w.r.t. location, dispersion, and shape, e.g. by a bagplot (Rousseeuw, Ruts and Tukey, 1999), to identify outliers of a distribution (Chen et al., 2009), to classify and cluster data (Hoberg, 2000;Lange, Mosler and Mozharovskyi, 2014a), to test for multivariate scale and symmetry (Dyckerhoff, 2002;Dyckerhoff, Ley and Paindaveine, 2015). Also, to measure multidimensional risk (Cascos and Molchanov, 2007), to handle constraints in stochastic optimization (Mosler and Bazovkin, 2014), and for quantile regression (Chakraborty, 2003;Hallin, Paindaveine andŠiman, 2010), among others. ...

... A classic idea to describe data is the "depth" concept of J. Tukey [27], who used it as a visualization technique for bivariate data distribution. This was connected to outlyingness by Barnett [2], and later Donoho and Gasko [8], and also forms the base for techniques such as the Bagplot [19]. While "convex hull peeling" [2] works well as a visualization technique for data from a single multivariate, but correlated distribution, it makes much less sense on data that contains multiple clusters. ...

Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal". In this article, we improve the effectiveness to detect outliers in dirty training data with a leave-out strategy: by temporarily omitting one candidate at a time, this point can be judged using the remaining data only. We show that this is more effective at scoring the outlierness of points than using the slack term of existing SVM-based approaches. Identified outliers can then be removed from the data, such that outliers hidden by other outliers can be identified, to reduce the problem of masking. Naively, this approach would require training N individual SVMs (and training $O(N^2)$ SVMs when iteratively removing the worst outliers one at a time), which is prohibitively expensive. We will discuss that only support vectors need to be considered in each step and that by reusing SVM parameters and weights, this incremental retraining can be accelerated substantially. By removing candidates in batches, we can further improve the processing time, although it obviously remains more costly than training a single SVM.

... Assuming that these locations corresponded to larval release sites, a protocol was developed to define the boundary containing 95% probability of larval release. First, outliers were identified and excluded using a statistical method known as a "bagplot," or bivariate boxplot (Rousseeuw et al., 1990;Verboven & Hubert, 2010). The kernel density estimation function "ksdensity" of MATLAB was applied to the remaining data to produce a probability density function (PDF) representing the spatial distribution of female lobsters bearing late-stage and spent eggs. ...

During the 1990s, coastal habitat off southeastern Massachusetts (SEMA) supported commercially viable fisheries for American lobster (Homarus americanus). Over the past two decades, landings and post‐larval settlement of lobsters in this region, which is near the southern edge of the species' range, have declined substantially, concurrent with a period of significant warming of the coastal waters off southern New England. Previous work has suggested that rising ocean temperatures may adversely impact the survival of larval and early benthic phase (EBP) lobsters and may cause adult lobsters to seek cooler offshore waters during the critical time of larval release. To investigate the manner in which the observed decline in lobster abundance may be linked to warming coastal waters, a high‐resolution hydrodynamic model was used to quantify the increase in water temperature experienced by EBP lobster off SEMA and to supply input to an individual‐based model of lobster larval transport from release areas delineated using fishery‐dependent data of late‐stage egg‐bearing lobsters. The results indicate that rising coastal water temperatures may have adversely impacted EBP lobster recruitment off SEMA by (1) causing an offshore shift in the area of larval release that resulted in a reduction in the delivery of larvae to suitable nearshore EBP habitat and (2) dramatically increasing thermal stress experienced by recently settled EBP lobsters. These findings highlight the implications of warming coastal waters on southern New England lobster population connectivity and provide insight to an understudied mechanism by which climate change affects marine species recruitment.

... Studies often calculate descriptive statistics (e.g., mean, standard deviation) of individual farm characteristics, often after clustering farms into multiple groups (Todde et al. 2016). To explore bivariate relations in a dataset, bagplots, developed as an extension of univariate boxplots (Rousseeuw et al. 1999), are a useful tool. A bagplot separates points inside a loop (i.e., inliers) from points outside (i.e., outliers) and distinguishes a region (the "bag") within the loop that contains 50% of the points (Fig. 3). ...

A variety of statistical methods have been developed for multivariate analysis of agricultural systems. Some statistical methods are rarely used to study these systems, although they can contribute to issues such as identifying atypical farms, modeling relations among variables and describing farms with common characteristics. To address these issues, we reviewed studies that applied kernel density estimation (KDE), copula modeling and extreme value theory (EVT) to French dairy farm data. KDE helped identify joint value ranges of forage production and milk production or greenhouse gas emissions that most farms in specific French region were likely to have. Copula modeling formalized the shapes of relations among farm characteristics, while EVT distinguished production strategies and management practices of farms that produced extreme amounts of forage. The present study reviews studies that applied these three methods, recommends when to use the latter and discusses their contribution to improving the understanding of dairy farms.

... The picture points out that this step converts the I CCs shape outlier problem into a magnitude outlier detection problem based on T (θ ). Figure 2d presents the functional box-plot of the slopes distribution based on M B D. It is worth noticing that this phase can capture all the outliers that were not recognized in the previous step. Figure 2e, f exhibit all the outliers that were detected (in green) using SL O D in the transformed and original version of I CCs. Figure 3 shows the (non-functional) bivariate bagplot (Rousseeuw et al., 1999) of S 1 for the difficulty and discrimination parameters of the parametric logistic model in Eq. (8). A bagplot is a bivariate generalization of the well-known box-plot. ...

The quality of assessment tests plays a fundamental role in decision-making problems in various fields such as education, psychology, and behavioural medicine. The first phase in the questionnaires' validation process is outliers' recognition. The latter can be identified at different levels, such as subject responses, individuals, and items. This paper focuses on item outliers and proposes a blended use of functional data analysis and item response theory for identifying outliers in assessment tests. The basic idea is that item characteristics curves derived from test responses can be treated as functions, and functional tools can be exploited to discover anomalies in item behaviour. For this purpose, this research suggests a multi-step strategy to catch magnitude and shape outliers employing a suitable transformation of item characteristics curves and their first derivatives. A simulation study emphasises the effectiveness of the proposed technique and exhibits exciting results in discovering outliers that classical functional methods do not detect. Moreover, the applicability of the method is shown with a real dataset. The final aim is to offer a methodology for improving the questionnaires' quality.

... Thus, the upper end of the tentacles should be Q3 + 1.51 QR and the lower end of the tentacles should be Q1 − 1.51 QR. If the maximum value of the dataset exceeds Q3 + 1.51 QR or the minimum value exceeds Q1 − 1.51 QR, we call these data outliers, which indicates that they have exceeded the normal range [68,69]. ...

To study global and regional environment protection and sustainable development and also to optimize mapping methods, it is of great significance to compare three existing 10 m resolution global land cover products in terms of accuracy: FROM-GLC10, the ESRI 2020 land cover product (ESRI2020), and the European Space Agency world cover 2020 product (ESA2020). However, most previous validations lack field collection points in large regions, especially in Southeast Asia, which has a cloudy and rainy climate, creating many difficulties in land cover mapping. In 2018 and 2019, we conducted a 56-day field investigation in Southeast Asia and collected 3326 points from different places. By combining these points and 14,808 other manual densification points in a stratified random sampling, we assessed the accuracy of the three land cover products in Southeast Asia. We also compared the impacts of the different classification standards, the different sample methods, and the different spatial distributions of the sample points. The results show that in Southeast Asia, (1) the mean overall accuracies of the FROM-GLC10, ESRI2020, and ESA2020 products are 75.43%, 79.99%, and 81.11%, respectively; (2) all three products perform well in croplands, forests, and built-up areas; ESRI2020 and ESA2020 perform well in water, but only ESA2020 performs well in grasslands; and (3) all three products perform badly in shrublands, wetlands, or bare land, as both the PA and the UA are lower than 50%. We recommend ESA2020 as the first choice for Southeast Asia’s land cover because of its high overall accuracy. FROM-GLC10 also has an advantage over the other two in some classes, such as croplands and water in the UA aspect and the built-up area in the PA aspect. Extracting the individual classes from the three products according to the research goals would be the best practice.

We estimate the influence of various factors on life satisfaction of Czech
seniors in a large survey sample. We find that good health, more education and
awareness of voluntary work participation, employee satisfaction and being currently employed are the main factors that contribute to being satisfied with the current quality of life in the group of Czech seniors. Surprisingly enough, increasing selfreported financial sufficiency is negatively associated with the quality of life. The main factors contributing to the life dissatisfaction are associated with being socially
separated. The worst outcomes are recorded for those living in social homes and
living alone. Any reported expectations of expected life changes (both positive and
negative expectations) are associated with lower probability of life satisfaction

Anomaly detection is a branch of machine learning and data analysis which aims at identifying observations that exhibit abnormal behaviour. Be it measurement errors, disease development, severe weather, production quality default(s) (items) or failed equipment, financial frauds or crisis events, their on-time identification, isolation and explanation constitute an important task in almost any branch of industry and science. By providing a robust ordering, data depth -- statistical function that measures belongingness of any point of the space to a data set -- becomes a particularly useful tool for detection of anomalies. Already known for its theoretical properties, data depth has undergone substantial computational developments in the last decade and particularly recent years, which has made it applicable for contemporary-sized problems of data analysis and machine learning. In this article, data depth is studied as an efficient anomaly detection tool, assigning abnormality labels to observations with lower depth values, in a multivariate setting. Practical questions of necessity and reasonability of invariances and shape of the depth function, its robustness and computational complexity, choice of the threshold are discussed. Illustrations include use-cases that underline advantageous behaviour of data depth in various settings.

Classical least squares regression consists of minimizing the sum of the squared residuals. Many authors have produced more robust versions of this estimator by replacing the square by something else, such as the absolute value. In this article a different approach is introduced in which the sum is replaced by the median of the squared residuals. The resulting estimator can resist the effect of nearly 50% of contamination in the data. In the special case of simple regression, it corresponds to finding the narrowest strip covering half of the observations. Generalizations are possible to multivariate location, orthogonal regression, and hypothesis testing in linear models.

The half-space depth of a point θ relative to a bivariate data set {x 1 ,⋯,x n } is given by the smallest number of data points contained in a closed half-plane of which the boundary line passes through θ. A straightforward algorithm for the half-space depth needs O(n 2 ) steps. The simplicial depth of θ relative to {x 1 ,⋯,x n } is given by the number of data triangles Δ(x i ,x j ,x k ) that contain θ; this appears to require O(n 3 ) steps. The algorithm proposed here computes both depths in O(nlogn) time, by combining geometric properties with certain sorting and updating mechanisms. Both types of depth can be used for data description, bivariate confidence regions, p-values, quality indices and control charts. Moreover, the algorithm can be extended to the computation of depth contours and bivariate sign test statistics.

In this paper we construct an exact algorithm for computing depth contours of a bivariate data set. For this we use the half-space depth introduced by Tukey. The depth contours form a nested collection of convex sets. The deeper the contour, the more robust it is with respect to outliers in the point cloud. The proposed algorithm has been implemented in a program called ISODEPTH, which needs little computation time and is illustrated on some real data examples. Finally, it is shown how depth contours can be used to construct robustified versions of classification techniques based on convex hulls.

Many statistical methods involve summarizing a probability distribution by a region of the sample space covering a specified probability. One method of selecting such a region is to require it to contain points of relatively high density. Highest density regions are particularly useful for displaying multimodal distributions and, in such cases, may consist of several disjoint subsets - one for each local mode. In this paper, I propose a simple method for computing a highest density region from any given (possibly multivariate) density f(x) that is bounded and continuous in x. Several examples of the use of highest density regions in statistical graphics are also given. A new form of boxplot is proposed based on highest density regions; versions in one and two dimensions are given. Highest density regions in higher dimensions are also discussed and plotted.

The boxplot has proven to be a very useful tool for summarizing univariate data. Several options of bivariate boxplot-type constructions are discussed. These include both elliptic and asymmetric plots. An inner region contains 50% of the data, and a fence identifies potential outliers. Such a robust plot shows location, scale, correlation, and a resistant regression line. Alternative constructions are compared in terms of efficiency of the relevant parameters. Additional properties are given and recommendations made. Emphasis is given to the bivariate biweight M estimator. Several practical examples illustrate that standard least squares ellipsoids can give graphically misleading summaries.

This note introduces the rangefinder box plot, an extension of the familiar box plot.

Box plots display batches of data. Five values from a set of data are conventionally used; the extremes, the upper and lower hinges (quartiles), and the median. Such plots are becoming a widely used tool in exploratory data analysis and in preparing visual summaries for statisticians and nonstatisticians alike. Three variants of the basic display, devised by the authors, are described. The first visually incorporates a measure of group size; the second incorporates an indication of rough significance of differences between medians; the third combines the features of the first two. These techniques are displayed by examples.