ArticlePDF Available

A Five Step Procedure for Outlier Analysis in Data Mining

Authors:

Abstract and Figures

Nowadays, outlier detection is primarily studied as an independent knowledge discovery process merely because outliers might be indicators of interesting events that have never been known before. Despite the advances seen, many issues of outlier detection are left open or not yet completely resolved. Outlier detection is an important data mining task. It deserves more attention from data mining community. There are "good" outliers that provide useful information that can lead to the discovery of new knowledge and "bad" outliers that include noisy data points. Distinguishing between different types of outliers is an important issue in many applications. It requires not only an understanding of the mathematical properties of data but also relevant knowledge in the domain context in which the outliers occur. We propose a novel five step procedure for outlier analysis along with a comprehensive review of existing outlier detection techniques. The paper ends by addressing some important issues and open questions that can be subject of future research. This paper would be helpful in devising the choice of outlier analysis techniques for unsupervised machine learning research.
Content may be subject to copyright.
European Journal of Scientific Research
ISSN 1450-216X Vol.75 No.3 (2012), pp. 327-339
© EuroJournals Publishing, Inc. 2012
http://www.europeanjournalofscientificresearch.com
A Five Step Procedure for Outlier Analysis in Data Mining
V. Ilango
Department of Computer Applications
New Horizon College of Engineering, Bangalore-560103, India
E-mail: banalysist@yahoo.com
Tel: +91-080-6629777
R. Subramanian
Department of Computer Science, Pondicherry University, Pondicherry, India
E-mail: rsmanian.csc@pondiuni.edu.in
V. Vasudevan
Department of Information Technology, Kalasalingam University, Srivilliputtur, India
E-mail: v.vasudevan@klu.ac.in
Abstract
Nowadays, outlier detection is primarily studied as an independent knowledge
discovery process merely because outliers might be indicators of interesting events that
have never been known before. Despite the advances seen, many issues of outlier detection
are left open or not yet completely resolved. Outlier detection is an important data mining
task. It deserves more attention from data mining community. There are “good” outliers
that provide useful information that can lead to the discovery of new knowledge and “bad”
outliers that include noisy data points. Distinguishing between different types of outliers is
an important issue in many applications. It requires not only an understanding of the
mathematical properties of data but also relevant knowledge in the domain context in
which the outliers occur. We propose a novel five step procedure for outlier analysis along
with a comprehensive review of existing outlier detection techniques. The paper ends by
addressing some important issues and open questions that can be subject of future research.
This paper would be helpful in devising the choice of outlier analysis techniques for
unsupervised machine learning research.
Keywords: Univariate, Multivariate, Parametric, Nonparametric, detection rate, false
alarm and ROC curve
1. Introduction
Outliers are present in virtually every data set in any application domain, and the identification of
outliers has a hundred years long history. Number definition are compiled and expressed in [104]. The
important definitions are quoted here.An observation (or subset of observations) which appears to be
inconsistent with the remainder of that set of data” [7]. An outlying observation, or 'outlier', is one that
appears to deviate markedly from other members of the sample in which it occurs [32]. An outlier is an
observation which deviates so much from other observations as to arouse suspicions that it was
generated by a different mechanism [38]. Outlier detection methods have been suggested for numerous
A Five Step Procedure for Outlier Analysis in Data Mining 328
applications, such as credit card fraud detection, clinical trials, voting irregularity analysis, data
cleansing, network intrusion, severe weather prediction, geographic information systems, athlete
performance analysis, and other data-mining tasks, fraud detection, medicine, public health, sports
statistics, detecting measurement errors, loan application processing ,intrusion detection, activity
monitoring , network performance ,fault diagnosis , structural defect detection , satellite image analysis
, time-series data analysis , medical condition monitoring and pharmaceutical research [43]. Outliers
may lead to the discovery of unexpected knowledge. In the 1880s when the English physicist Rayleigh
measured nitrogen from different sources, he found that there were small discrepancies among the
density measurements. After closer examination, he discovered that the density of nitrogen obtained
from the atmosphere was always greater than the nitrogen derived from its chemical compounds by a
small but definite margin. He reasoned from this anomaly that the aerial nitrogen must contain a small
amount of a denser gas. This discovery eventually led to the successful isolation of the gas argon, for
which he was awarded the Nobel Prize for Physics in 1904.which he considered as outlier that
produced good outcome [30]. [9] [17] [23] [92] [105] [34] [60] provide an extensive survey of outlier
detection techniques developed in machine learning and statistical domains. Our survey tries to provide
a structured and comprehensive overview of the outlier analysis techniques. Outlier will arise due to
natural variability of data set, measurement error as well as any recording error done by the users and
execution error [45]. Robust estimation to find the presence of outliers in the given sample is a critical
problem [100].
This study covers the answer for the following questions. a) How to define abnormality
detection, b) How to minimize computational cost (processing time, storage and I/O) c) How to
eliminate or minimize the impact of outlier in performance of information system, and to discover new
knowledge from hidden data. The important issues associated with outliers are detecting the outlier and
deciding what to do once it has been detected. Outlier detection involves identifying the time of
occurrence, which may not be known, as well as recognizing the type of outlier [57]. A key challenge
in outlier detection is that it involves exploring the unseen space. It is hard to enumerate all possible
normal behaviors in an application. Handling noise in outlier detection is a challenge. Noise may
distort the normal objects and blur the distinction between normal objects and outliers. It may help to
hide outliers and reduce the effectiveness of outlier detection [87]. The scope of this paper is modest to
provide bird’s eye view of outlier analysis techniques that focuses on unsupervised learning methods.
The contributions of this paper are listed below: Section two describes the five step procedures for
outlier analysis. Part (a) we have briefly explained different data sets and data cleaning measures. Part
(b) broadly describes various techniques for outlier detection based on unsupervised approach. Part (c)
discusses the methods of outlier representation. Part (d) narrates the techniques of how to profile and
describe the detected outlier. Part (e) explores the interesting measures for evaluation of outlier. Finally
paper concludes with suggestion for future research. To the best of our knowledge, this survey is an
attempt to provide a structured and a comprehensive-overview of outlier analysis techniques
Figure 1: Five Step Procedure of Outlier Analysis
Data
Sets
Data
Cleaning
Outlier
Detection
Representation
of Outlier
Outlier
Handling/Evaluation
Profiling of
Detected Outlier
329 V. Ilango, R. Subramanian and V. Vasudevan
2. Outlier Analysis Procedure
We have proposed in figure.1 five step outlier analysis procedures starting from data sets, data
cleaning, outlier detection, representation, profiling, handling and evaluation. Each step is explained in
detail as follows. a). Data sets are important for outlier analysis. There are different types of data set
such as: Nominal, ordinal, interval, ratio, binary, continuous, discrete, Transaction Data, Spatial Data,
Spatio-Temporal Data, and Sequence Data and Time Series data [70]. Data Cleaning: Indentifying
missing values is one of the data cleaning process. Missing values create difficulties for data analysis.
The following measures can be used to process the missing values such as: Ignoring the record, can fill
missing values manually, use global constant to fill in the missing values, use the attribute mean to fill
in the missing values, use the attribute mean for all samples belonging to the same class as the given
tuple [69] [76]. b). Outlier Detection Techniques: In the last decade numerous outlier detection
methods have been proposed. The layout of outlier mining techniques has been explained in figure.2.
The main focus is given on unsupervised outlier detection methods. Some of the outlier detection
techniques can be used for generic purpose and some of them can be used for specific purpose [11]
[41]. Outlier detection approaches can be classified into these three categories: supervised, semi-
supervised and unsupervised. Techniques trained in supervised mode assume the availability of a
training data set which has labeled instances for normal as well as anomaly class. Typical approach in
such cases is to build a predictive model for normal vs. anomaly classes [66]. The unsupervised
approach [86] of outlier detection does not require training data. This approach takes as input a set of
unlabelled data and attempts to find outlier within the data. Many semi-supervised [10][85] techniques,
assume that the training data has labeled instances for only the normal class, can be adapted to operate
in an unsupervised mode by using a sample of the unlabeled data set as training data. Table 1 explains
the advantages, drawback, techniques and tools for the above mentioned outlier detection methods.
Unsupervised learning approaches can be further grouped into ii).parametric and non-parametric
methods [28]. Parametric method: These methods assume that the whole data can be modeled to one
standard statistical (normal) distribution. A point that deviates significantly from the data model is
declared as an outlier. Non-parametric method: These methods make no assumption on the statistic
properties of data and instead identify outliers based on the full dimensional distance measure between
points.
Figure 2: Hierarchical Structure of Outlier Detection Methods
A Five Step Procedure for Outlier Analysis in Data Mining 330
Table 1: Machine Learning Outlier Techniques
Supervised Semi-supervised Unsupervised
Require knowledge of both
normal and anomaly
class.Build classifier to
distinguish between normal
and known anomalies
Require knowledge of normal
class only. Use modified
classification model to learn
the normal behavior and then
detect any deviations from
normal behavior as
anomalous
Assume the normal objects are somewhat ``clustered'‘
into multiple groups, each having some distinct
features .An outlier is expected to be far away from any
groups of normal objects
Advantages:
Models that can be easily
understood.High accuracy in
detecting many kinds of known
anomalies
Models that can be easily
understood. Normal behavior
can be accurately learned
The unsupervised techniques typically suffer from
higher false alarm rate, because often times the
underlying Assumptions do not hold true.
Drawbacks:
Require both labels from both
normal and anomaly class.
Cannot detect unknown and
emerging anomalies
Require labels from normal
class Possible high false
alarm rate - previously
unseen (yet legitimate) data
records may be recognized as
anomalies
Cannot detect collective outlier effectively. Normal
objects may not share any strong patterns, but the
collective outliers may share high similarity in a small
area
Techniques: K-mean,EM,
Wavecluster,CLAD,CBLOF,LOF,COF,DBSCAN,SNN
Artificial neural network,
Bayesian statistics ,Rule based
model, RBF,Ripper,SOM
Decision tree learning
One class SVM,Hidden
Markove Model
Software Tools :Weka ,SPSS,SAS,Tanagra, SYSTAT, MATLAB, Minitab, Liseral, MEDCALC,R-package
Table 2: Comparison Of Outlier Detection Techniques For Simple Data Sets
Outlier Property Outlier Detection Technique
Property Data Sets Property
Reference
Outlier Type Outlier Degree Technique
Based on
No of Outlier
Detected at Once Data Dimension Data Type
Global Local Scalar Outlierness One Multiple Univariate Multivariate
Moderate High
Distribution Numeric 32
Distribution Numeric 7
Distribution Numeric 27
Distribution Mixed-type 51
Depth Numeric 80
Graph Mixed-type 59
Clustering Numeric 22
Clustering Numeric 61
Clustering Numeric 102
Clustering Numeric 21
Distance Numeric 52
Distance Numeric 77
Distance Numeric 8
Density Numeric 14
Density Numeric 20
Density Numeric 50
Density Numeric 96
Density Numeric 95
Density Numeric 83
Density Numeric 81
Density Numeric 78
Density Numeric 28
331 V. Ilango, R. Subramanian and V. Vasudevan
Table 2: Comparison Of Outlier Detection Techniques For Simple Data Sets - continued
Density Numeric 53
NN Numeric 37
NN Numeric 29
SVM Numeric 63
SVM Numeric 62
Table 5: Comparison of univariate, bivariate and multivariate data
Univariate Bivariate Multivariate
Definition
Describes a case in terms of a
single variable - the distribution of
attributes that comprise it. Does
not deal with causes or
relationships
Analysis of two variables
simultaneously. Focus is on
the variables and the
empirical relationships.
Deals with causes or
relationships.
Analysis of two or more variables
simultaneously.
Graphical
Representation
Bar graph, histogram, pie chart,
line graph, box-and-whisker plot,
Stem-and-leaf Plot, Q-Q Plots,
violin plot
Scatter plot
Multivariate Profile, Andrew’s
Fourier Transformation,
Chernoff’s faces, Scatter plot,
Contour Plots
Methods
Single and Sequential Procedure
Inward and Outward Procedure
Univariate Robust measure
-
Masking effect
Swamping effect
Multivariate Robust Measures
Computational
Measures
Measure of Central Tendency
Measures of dispersion
Measures of Skewness
One-way ANOVA
Index Numbers
Simple correlation & Regression
Simple Regression
Simple Correlation
Two-way ANOVA
Association of attributes
Multiple regression & correlation
Multiple discriminant analysis
Multi-ANOVA
Canonical analysis
Factor analysis
Cluster analysis & PCA
Outliers are considered as those points that are distant from their own neighbors in the data set.
Compared to parametric methods, these non-parametric methods are more flexible and autonomous
due to the fact that they require no data distribution knowledge [5][107]. Some of the computational
measures for parametric and nonparametric test are discussed in table 4.
Table 4: Parametric and nonparametric test for outlier
Feature Parametric Nonparametric
Two samples – compare mean value for
some variable of interest t-test for independent samples
Wald-Wolfowitz runs test, Mann-
Whitney U test, Kolmogorov-
Smirnov two sample test
Multiple groups Analysis of variance (ANOVA/
MANOVA)
Kruskal-Wallis analysis of ranks,
Median test
Compare two variables measured in the
same sample t-test for dependent samples Sign test, Wilcoxon’s matched pairs
test
If more than two variables are measured
in same sample Repeated measures ANOVA Friedman’s two way analysis of
variance, Cochran Q
Two variables of interest are categorical Correlation coefficient
Spearman R, Kendall Tau, Coefficient
Gamma, Chi square, Phi coefficient,
Fisher exact test, Kendall coefficient
of concordance
A Five Step Procedure for Outlier Analysis in Data Mining 332
Table 3: Comparison of outlier detection techniques for complex data sets
Outlier Property Outlier Detection Technique
Property Data Sets Property
Reference
Outlier Type Outlier Degree Technique
Based on
No of Outlier
Detected at Once Data Dimension Data Type
Global Local Scalar Outlierness One Multiple Univariate Multivariate
Moderate High
Subspace Numeric 3
Subspace Numeric 106
Subspace Numeric 65
Distance Numeric 4
Distance Numeric 31
Distance Numeric 18
Graph Mixed-type 25
Graph Mixed-type 58
Graph Mixed-type 39
Clustering Sequence 15
Tree Sequence 73
Distribution Spatial 84
Distribution Spatial 16
Distribution Spatial 54
Distribution Spatial 94
Distribution Spatial 47
Model Streams 102
Model Streams 101
Graph Streams 82
Density Streams 75
Density Streams 93
Clustering&
Distribution
Spatial-
temporal
19
Clustering&
Distribution
Spatial-
temporal
12
iii). Univariate, Bivariate and Multivariate Data sets: The technical description of different
data set properties are described in Table 2, Table 3 and Table 5. Univariate analysis is the simplest
form of quantitative (statistical) analysis. A basic way of presenting univariate data is to create a
frequency distribution of the individual cases, which involves presenting the number of attributes of
the variable studied for each case observed in the sample. This can be done in a table format, with a bar
chart or a similar form of graphical representation.[44]. Bivariate data involves the analysis of two
variables for determining the empirical relationship between them. Common forms of bivariate
analysis involve creating a percentage table, a scatter plot graph, or the computation of a simple
correlation coefficient [42]. Multivariate analysis (MVA) is based on the Analysis of two or more
variables simultaneously [1][35]. Table 6 narrates the classification of outlier detection methods and
their strength and weakness along with related
algorithm[2][6][33][40][46][48][49][55][56][64][67][68][71][72][74] [88][89][90][98][99]. c).Outlier
Representation Stage: Once outlier is detected, it must be represented in understandable form. The
representation can be in the visual form of graphical display [26]. The goal of visualization is the
interpretation of the visualized information. [108] explain the following principle for effective
graphical display: Apprehension, Clarity, Consistency, Efficiency, Necessity and Truthfulness. [109]
has also explained the following principle for graphical excellence: Graphical excellence consists of
complex ideas communicated with clarity, precision and efficiency. Graphical excellence requires
telling the truth about the data. Identified outlier can be represented using the above principles.
d).Profiling and Outlier Description: Examples of outliers abound in social as well as scientific
contexts. Outliers could also be indications of interesting events that have never been known before
and hence; detecting outliers may lead to the discovery of critical information contained in data. In
333 V. Ilango, R. Subramanian and V. Vasudevan
such cases, uncovering underlying cause(s) is necessary. For example, if one happens to find an UFO
(Unidentified Flying Object), throwing it away is obviously not a good idea. Studying its structure to
understand its flying mechanism is certainly much more interesting and beneficial [79]. Once the
outliers have been identified, the analyst should generate profile on each outlier observation and
carefully examine the data for the variables responsible for its being an outlier. In addition to this the
analyst can perform statistical and mathematical methods to identify the difference between outliers
and the other observations. Retention and Deletion: After the outliers have been identified, profiled and
categorized the analyst must decide on the retention or deletion of each one. Either deletion or
accommodation of outlier depends on application domain, types of data sets and researchers. Person
should have strong domain knowledge to decide about deletion and accommodation of detected
outliers. If we want to reduce the weight of the outlier, we can use the following options: firstly if we
have only a few outliers, we may simply delete those values, so they become blank or missing values.
Secondly if there are too many outliers in a variable, or if we do not need that variable, we can delete
the variable. Thirdly we can transform the values or variables. After dealing with the outlier, we re-run
the outlier analysis procedure to determine if the data are outlier free. Sometimes new outliers emerge
because they were masked by the old outliers and the data is now different after removing the old
outlier so existing extreme data points may now qualify as outliers [110]. If new outliers emerge, and
we want to reduce the influence of the outliers, we choose one of the above mentioned options again.
Then, re-run the outlier analysis to determine if any new outliers emerge or if the data are outlier free,
and repeat again [91]. e). Evaluation of the Outlier: Evaluation of detected outlier is an important task
in the data analysis. It has number of measures; some of them are discussed as follow. Detection Rate,
False Alarm Rate and ROC Curves. Intuitively, detection rate gives information about the number of
correctly identified outliers, while the false alarm rate represents the number of outliers misclassified as
normal data records. The most widely used tool to assess detection techniques' accuracy is ROC
(Receiver Operating Characteristic) curve. Computational Complexity. The efficiency of outlier
detection techniques can be evaluated by the computational cost, which is known as time & space
complexity. In addition, the amount of memory occupation required to execute outlier detection
techniques can be viewed as an important performance evaluation metrics. [24][36][60].
Table 6: Outlier Detection Approach
Approach Definition Strength Weakness Application Methods/ Algorithm
Statistical Tests Statistical methods
assume that the
normal data follow
some statistical
model. The data
not following the
model are outliers.
Utilize existing
statistical modeling
techniques to
model various type
of distributions
Most tests are for
single
attribute.With high
dimensions,
difficult to
estimate
distributions.
Parametric
assumptions often
do not hold for real
data sets
Fraud detection,
Intrusion detection,
Medical and health
parametric vs. non-
parametric
Depth-based
Approaches Search for outliers
at the border of the
data space but
independent of
statistical
distributions.
Organize data
objects in convex
hull layers Outliers
are objects on
outer layers
Depth-based
approaches avoid
the problem of
fitting to a data
distribution. No
assumption of
probability
distribution. No
distance function
required.
They are
inefficient for the
large data set with
high
dimensionality,
where the convex
hull will be harder
to discern and is
computationally
more expensive.
Environmental
monitoring,
Localization and
tracking, Logistics
and transportation
ISODEPTH,FDC,
Minimum Volume
Ellipsoid (MVE) and
Convex peeling.
A Five Step Procedure for Outlier Analysis in Data Mining 334
Table 6: Outlier Detection Approach - continued
Distance-based
Approaches The concept of
distance-based
outlier relies on the
notion of the
neighborhood of a
point, typically, the
k-nearest
neighbors.
This method
avoids the
excessive
computation that
can be associated
with fitting the
observed
distribution into
some standard
distribution and in
selecting
discordancy tests.
Distance based
method suffer from
detecting local
Outliers in a data
set with diverse
densities.
Intrusion detection,
Environmental
monitoring,
Medical and
Health care data
Index-based, Nested-
loop based, Grid-
based
Density-based
Approaches The density-based
approach estimates
the density
distribution of the
data and identifies
outliers as those
lying in low-
density regions
Density-based
techniques have
the advantage that
they can detect
outliers that would
be missed by
techniques with a
single, global
criterion
Parameter
selection for upper
bound and lower
bound is difficult.
Intrusion detection,
Environmental
monitoring,
Medical and
Health care,
localization and
tracking
Local outlier factor,
k-distance, k-distance
neighborhood, reach
ability distance
Cluster Based
Approaches
Cluster based
approach finds
groups of strongly
related objects. An
object is an outlier
if it does not
belong to any
cluster, there is a
large distance
between the object
and its closest
cluster , or it
belongs to a small
or sparse cluster
Detect outliers
without requiring
any labeled data
Work for many
types of data.
Clusters can be
regarded as
summaries of the
data. Once the
cluster are
obtained, need
only compare any
object against the
clusters to
determine whether
it is an outlier
(fast)
Effectiveness
depends highly on
the clustering
method used—they
may not be
optimized for
outlier detection.
High
computational
cost:
Fraud detection,
Intrusion detection,
Medical and
health,
Environmental
monitoring,
localization and
tracking
BIRCH, CLARANS,
DBSCAN,
GDBSCAN, OPTICS
and
PROCLUS,CBLOF
3. Conclusions
In this paper we have proposed five step procedure of outlier analysis and tried to provide a broad view
of latest techniques associated with each steps but obviously, we are unable to describe all approaches
in a single paper. The limitation of this study is focused mainly on outlier detection techniques for low-
dimensional simple static data, followed by some of recent advancements in outlier detection for high-
dimensional data. Based on our review, we observe that the notion of outlier is different for different
application domains. Outlier detection is an extremely important problem. It involves exploring unseen
spaces. Some outlier detection techniques are developed in a more generic fashion and can be ported to
various application domains while others directly target a particular application domain. Outlier
detection is an important research problem in data mining that aims to discover useful abnormal and
irregular patterns hidden in large data sets. Most existing outlier detection methods only deal with
static data with relatively low dimensionality. Recently, outlier detection for high-dimensional stream
data, ensemble outlier detection, and subspace outlier mining, addressing the issues of concept drift,
dimension reduction, and detection result visualization became a new emerging research problem.
Minimum number of research has been done using categorical data. Robust methods to be discovered
335 V. Ilango, R. Subramanian and V. Vasudevan
to explore interesting patterns. This necessitates the development of relevant approaches to handle the
issue. These are to point out that outlier detection is a very active field of data mining research and an
extensive study will bring many benefits to various practical applications as mentioned above.
References
[1] A.C. Atkinson, et.al, “Exploring Multivariate Data with the Forward Search”. Springer-Verlag
– New York. (2004).
[2] Agarwal, D., “Detecting anomalies in cross-classified streams: a Bayesian approach,” Knowl.
Inf. Syst., vol. 11, no. 1, pp. 29–44, 2006.
[3] Aggarwal and P. S. Yu. An e®ective and e±cient algorithm for high-dimensional outlier
detection. International Journal on Very Large Data Bases, 14(2):211{221, 2005.
[4] Angiulli and C. Pizzuti. Outlier mining in large high-dimensional data sets. IEEE Transactions
on Knowledge and Data Engineering, 17(2):203{215, 2005.
[5] Angela Hebel “Parametric vs Nonparametric Statistics- when to use them and which is more
powerful?” university of Maryland, April, 2002.
[6] Arning, R. Agrawal, and P. Raghavan, “A Linear Method for Deviation Detection in Large
Databases,” Proc.Int' Conf. Knowledge Discovery and Data Mining, 1996, pp. 164-169.
[7] Barnett V, Lewis T. Outliers in Statistical Data. New York, NY: John Wiley & Sons; 1994.
[8] Bay and M. Schwabacher (2003) Mining distance-based outliers in near linear time with
randomization and a simple pruning rule. In: Proceedings of ACM SIGKDD, pp 29-38.
[9] Ben-Gal I. Outlier detection. In: Maimon O, Rockach L, eds. Data Mining and Data Discovery
Handbook: A Complete Guidance for Practitioners and Researchers. Springer, US; 2005,
I:131–146.
[10] Bennett K, Demiriz A (1998) Semi-supervised support vector machines. Adv Neural Inf
Process Syst 12:368–374.
[11] Bhuyan, M. H., Bhattacharyya, D. K., and Kalita,J. K. (2011) Rodd: An effective reference
based outlier detection technique for large datasets. LNCS-CCIS, 133, Part I, 76–84.
[12] Birant, A. Kut (2006) Spatio-temporal outlier detection in large database", In: Proceedings of
ITI.
[13] Borgwardt, K. M., Ong, C. S., Sch¨onauer, S., Vishwanathan, S.V. N., Smola, A. J., and
Kriegel, H.-P., “Protein function prediction via graph kernels,” Bioinformatics, vol. 21, no. 1,
pp. 47–56, 2005.
[14] Breunig MM, Kriegel H-P, Ng RT, Sander J (2000) LOF: identifying density-based local
outliers. In: Proceedings of ACM SIGMOD, pp 93-104.
[15] Budalakoti, S. Cruz, A. N. Srivastava, R. Akella, E. Turkov (2006) Anomaly detection in large
sets of high-dimensional symbol sequences. NASA TM.
[16] C.T. Lu, D. Chen, and Y. Kou (2003) Detecting spatial outliers with multiple attributes.In:
Proceedings of ICTAI, pp 122-128.
[17] Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey, ACM Comput Surv 2009, 41,
Article 15:1–58.
[18] Chaudhary, A. S. Szalay, and A. W. Moore (2002) Very fast outlier detection in large
multidimensional data sets. In: Proceedings of ACM SIGMOD Workshop on DMKD.
[19] Cheng and Z. Li (2006) A multiscale approach for spatio-temporal outlier detection.
Transactions in GIS, vol. 10, no. 2, pp 253-263.
[20] Chiu, A. W. Fu (2003) Enhancements on local outlier detection. In: Proceedings of IDEAS.
[21] D. Ren, I. Rahal, W. Perrizo (2004) A vertical outlier detection algorithm with clusters as by-
product. In: Proceedings of ICTAI.
[22] D. Yu, G. Sheikholeslami, and A. Zhang (2002) Findout: ¯nding outliers in very large datasets.
In: Journal of Knowledge and Information Systems, vol. 4, no. 3, pp. 387-412.
A Five Step Procedure for Outlier Analysis in Data Mining 336
[23] Dav d Tan ar, “Research and Trends in Data Mining Technologies and Applications”, Monash
Un vers ty, Australia, Idea Group Publishing,2007.
[24] E Achtert, H.-P. Kriegel, L. Reichert, E. Schubert,R. Wojdanowski, and A. Zimek. Visual
evaluation of outlier detection models. In Proc. DASFAA, 2010.
[25] E. Otey, A. Ghoting, S. Parthasarathy (2006) Fast distributed outlier detection in mixed-
attribute data sets. Data Mining and Knowledge Discovery, vol. 12, no. 2-3, pp 203-228.
[26] Eberle, W. and Holder, L., “Anomaly detection in data represented as graphs,” Intell. Data
Anal., vol. 11, no. 6, pp. 663–689, 2007.
[27] Eskin (2000) Anomaly detection over noisy data using learned probability distributions.In:
Proceedings of Machine Learning.
[28] Fan, H., Za¨ıane, O. R., Foss, A., and Wu, J., “A nonparametric outlier detection for effectively
discovering top-n outliers from engineering data,” in PAKDD, pp. 557–566, 2006.
[29] Fu. J, X. Yu (2006) Rotorcraft acoustic noise estimation and outlier detection. In: Proceedings
of IJCNN, pp 4401-4405.
[30] Gongxian Cheng, “Outlier Management In Intelligent Data Analysis”, Thesis report, Birkbeck
College, University of London, 2000, p.17.
[31] Ghoting, S. Parthasarathy, and M. Otey. Fast mining of distance-based outliers in high-
dimensional datasets. Data Mining and Knowledge Discovery, 16(3):349–364, June 2008.
[32] Grubbs, Frank (1969) Procedures for detecting outlying observations in samples. Techno-
metrics, vol. 11, no. 1, pp. 1-21.
[33] H. V. Nguyen and V. Gopalkrishnan. Feature extraction for outlier detection in high-
dimensional spaces. In Proceedings of the 4th International Workshop on Feature Selection in
Data Mining, pages 64-73, 2010.
[34] Hadi AS, Imon A, Werner M. Detection of outliers.Wiley Interdiscip Rev Comput Stat 2009,
1:57–70.
[35] Hair. et.al , “ Multivariate data analysis”, Pearson Education Pte Ltd, Fifth edition, ISBN:81-
297-0021-2.
[36] Hans-Peter Kriegel Peer Kr¨oger Erich Schubert Arthur Zimek, “Interpreting and Unifying
Outlier Scores”, 11th SIAM International Conference on Data Mining (SDM), Mesa, AZ, 2011.
[37] Harkins, H. He, G. J. Willams, R. A. Baster (2002) Outlier detection using replicator neural
networks. In: Proceedings of DaWaK, pp 170-180.
[38] Hawkins DM. Identification of outliers. New York, NY: Chapman and Hall; 1980.
[39] He, S. Deng, X. Xu (2005) An optimization model for outlier detection in categorical data. In:
Proceedings of ICIC, pp 400-409.
[40] Hewahi, N. Saad, M.: Class Outliers Mining: Distance Based-Approach, International Journal
of Intelligent Systems and Technologies, Vol. 2, No. 1, pp 55-68, 2007.
[41] High Wycombe, and Buckinghamshire, Missing Values, Outliers, Robust Statistics & Non-
parametric Methods, Shaun Burke, RHM Technology Ltd, , UK.
[42] Ho, Robert, “Handbook of univariate and multivariate data analysis and interpretation with
SPSS”, ISBN 1-58488-602-1, 2006 by Taylor & Francis Group.
[43] Hodge VJ, Austin J. A survey of outlier detection methodologies. Artif Intell Rev 2004, 22:85–
126.
[44] http://en.wikipedia.org/wiki/Univariate_analysis
[45] Huang, H.Y., Lin, J.X., Chen, C.C and Fan, M.H.: Review of Outlier Detection. In: Application
Research of Computers, 8–13. (2006).
[46] J. X. Yu, W. Qian, H. Lu, A. Zhou (2006) Finding centric local outliers in
categorical/numerical spaces. Knowledge Information System, vol. 9, no. 3, pp 309-338.
[47] J. Zhao, C.-T. Lu, and Y. Kou (2003) Detecting region outliers in meteorological data. In:
proceedings of ACM GIS, pp 49-55.
[48] Janeja VP, Atluri V. Spatial outlier detection in heterogeneous neighborhoods. Intell Data Anal
2009, 13:85–
337 V. Ilango, R. Subramanian and V. Vasudevan
[49] Jiangab, F., Suia, Y., and Caoa, C. (2008) A rough set approach to outlier detection.
International Journal of General Systems, 37, 519–536.
[50] Jin, A.K.H. Tung, and J. Han (2001) Mining top-n local outliers in large databases. In:
Proceedings of ACM SIGKDD, pp. 293-298.
[51] K. Yamanishi, J. Takeuchi, G. Williams, and P. Milne (2000) On-line unsupervised learning
outlier detection using finite mixtures with discounting learning algorithms. In: Proceedings of
KDD, pp 250-254.
[52] Knorr E, Ng R(1998) Algorithms for mining distance-based outliers in large data sets.
In:Proceedings of VLDB, pp 392-403.
[53] Kollios, D. Gunopulos, N. Koudas, S. Berchtold (2003) Efficient biased sampling for
approximate clustering and outlier detection in large data sets. Knowledge and Data
Engineering, vol. 15, no. 5, pp 1170-1187.
[54] Kou, C. Lu, D. Chen (2006) Spatial weighted outlier detection. In: Proceeding of SDM.
[55] Koufakou, M. Georgiopoulos, and G. Anagnostopoulos. Detecting outliers in high-dimensional
datasets with mixed attributes. In International Conference on Data Mining (DMIN 2008),
L.Vegas,NV, 14-17 2008.
[56] Kriegel, H.-P., hubert, M. S., and Zimek, A., “Angle-based outlier detection in high-
dimensional data,” in KDD ’08: Proceeding of the 14th ACM SIGKDD international
conference on Knowledge discovery and data mining, (New York, NY, USA), pp. 444–452,
ACM, 2008.
[57] Kriegel, H.-P., Kröger, P., Schubert, E., and Zimek, A., “ Outlier detection techniques”, In Proc
of SIAM International Conference on Data Mining, 2010.
[58] L. Wei, W. Qian, A. Zhou, W. Jin, J. X. Yu (2003) HOT: hypergraph-based outlier test for
categorical data. In: Proceedings of PAKDD, pp 399-410.
[59] Laurikkala, M. Juhola, E. Kentala (2000) Informal identification of outliers in medical data. In:
Proceedings of IDAMAP.
[60] Lazarevic, A., Ozgur, A., Ertoz, L., Srivastava, J. and Kumar, V. (2003) 'A comparative study
of anomaly detection schemes in network intrusion detection', SIAM Conference on Data
Mining.
[61] M. F. Jiang, S. S. Tseng, C. M. Su (2001) Tw-phase clustering process for outliers detection.
Pattern Recognition Letters, 22 (6-7): 691-700.
[62] M. I. Petroveskiy (2003) Outlier detection algorithms in data mining system. Programming and
Computer Software, vol. 29, no. 4, pp 228-237.
[63] M. J. Tax and R. P. W. Duin (1999) Support vector domain description. Pattern Recognition
Letters, vol. 20, pp 1191-1199.
[64] M. Knorr, R. T. Ng, and V. Tucakov. Distance-based outliers: Algorithms and appli-cations.
International Journal on Very Large Data Bases, 8(3-4):237, 2000.
[65] M. Shyu, S. Chen, K. Sarinnapakorn, L. W. Chang (2003) A novel anomaly detection scheme
based on principal component classi¯er. In: Proceedings of ICDM, pp172-179.
[66] M. V. Joshi, R. C. Agarwal, and V. Kumar, “Mining needle in a haystack: classifying rare
classes via two-phase rule induction,” in Proceedings of the 7th ACM SIGKDD international
conference on Knowledge discovery and data mining. ACM Press, 2001, pp. 293–298.
[67] M¨orchen, F., “Unsupervised pattern mining from symbolic temporal data,” SIGKDD Explor.
Newsl., vol. 9, no. 1, pp. 41–55, 2007.
[68] McQuarrie A, Tsai CL. Outlier detections in AR models. J Comput Graph Stat 2003, 12:450–
471.
[69] Micheline Kamber and Jiawei Han, Data Mining: Concepts and Techniques.Morgan Kaufmann
Publishers, 2 ed., Mar. 2006.
[70] N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Addison-Wesley, 2005.
[71] Ng, B. (2006) Survey of anomaly detection methods. Technical Report UCRL-TR-225264.
Lawrence Livermore National Laboratory, University of California, California USA.
A Five Step Procedure for Outlier Analysis in Data Mining 338
[72] NIST/SEMATECH e-Handbook of Statistical Methods,
http://www.itl.nist.gov/div898/handbook
[73] P. Sun, S. Chawla, B. Arunasalam (2006) Mining for outliers in sequential databases. In:
Proceedings of SIAM, pp 94-105.
[74] Papdimitriou, P., Dasdan, A., and Garcia-Molina, H., “Web Graph Similarity for Anomaly
Detection,” technical report, Stanford, 22 Jan. 2008.
[75] Pokrajac, A. Lazarevic, L. J. Latechi (2007) Incremental local outlier detection for data streams.
In: Proceedings of CIDM.
[76] R.J.A. and Rubin D.B. “Statistical Analysis with missing data”, Wiley-Interscience and related
papers, (2002).
[77] Ramaswamy S, Rastogi R, Shim K (2000) efficient algorithms for mining outliers from large
data sets. In: Proceedings of ACM SIGMOD, pp 427-438.
[78] Ren, B. Wang, W. Perrizo (2004) RDF: a density-based outlier detection method using vertical
data representation. In: Proceedings of ICDM, pp 503-506.
[79] Nguyen Hoang Vu, “Outlier Detection Based on Neighborhood Proximity”, Dissertation report,
Nanyang Technological University, June, 2010, p.1.
[80] Rousseeuw and A. Leroy. Robust Regression and Outlier Detection, 3rd edn. John Wiley and
Sons, 1996.
[81] S. Kim, S. Cho (2006) Prototype based outlier detection. In: Proceedings of IJCNN, pp 820-
826.
[82] S. Muthukrishnan, R. Shah, J. S. Vitter (2004) Mining deviants in time series data streams. In:
Proceedings of SSDBM.
[83] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, and C. Faloutsos. (2003) LOCI: fast outlier
detection using the local correlation integral. In: Proceedings of ICDE, pp 315-326.
[84] S. Shekhar, C.-T. Lu, and P. Zhang (2001) A uni¯ed approach to spatial outliers
detection.GeoInformatica, 7(2): 139-166.
[85] S. Y. Jiang, X. Song, H. Wang, J. J. Han, and Q. H. Li. A clustering-based method for
unsupervised intrusion detections. Pattern Recognition Letters, 27:802–810, 2006.
[86] S. Zanero and S. M. Savaresi, “Unsupervised learning techniques for an intrusion detection
system,” in Proceedings of the 2004 ACM symposium on Applied computing, 2004, pp. 412 –
419.
[87] Sabyasachi Basu · Martin Meckesheimer “Automatic outlier detection for time series: an
application to sensor data-Knowl Inf Syst (2007) 11(2): 137–154, DOI 10.1007/s10115-006-
0026-6.
[88] Salvador, S. and Chan, P., “Learning states and rules for detecting anomalies in time series,”
Applied Intelligence, vol. 23, no. 3, pp. 241–255, 2005.
[89] Serneels, S. and Verdonck, T., “Principal component regression for data containing outliers and
missing elements,” Comput. Stat. Data Anal., vol. 53, no. 11, pp. 3855–3863, 2009.
[90] Shaft, U. and Ramakrishnan, R., “Theory of nearest neighbors indexability,” ACM Trans.
Database Syst., vol. 31, no. 3, pp. 814–838, 2006.
[91] http://www.psychwiki.com/images/7/79/lab1datascreening.doc
[92] Steinwart, I., Hush, D., and Scovel, C., “A classification framework for anomaly detection,” J.
Mach. Learn. Res., vol. 6, pp. 211–232, 2005.
[93] Subramaniam, S., Palpanas, T., Papadopoulos, D., Kalogeraki, V., and Gunopulos, D., “Online
outlier detection in sensor data using nonparametric models,” in VLDB ’06: Proceedings of the
32nd international conference on Very large data bases, pp. 187–198, VLDB Endowment,
2006.
[94] Sun, S. Chawla (2004) On local spatial outliers. In: Proceedings of ICDM, pp 209-216.
[95] T. Hu, S. Y. Sung (2003) Detecting pattern-based outliers. Pattern Recognition Letters, 24 (16):
3059-3068.
339 V. Ilango, R. Subramanian and V. Vasudevan
[96] Tang, J., Chen, Z., Fu, A. W., and Cheung, D. W.(2006) Capabilities of outlier detection
schemes in large datasets, framework and methodologies. Knowledge and Information Systems,
11, 45–84.
[97] Vinueza, A. and Grudic, G. 2004. Unsupervised outlier detection and semi-supervised learning.
Tech. Rep. CU-CS-976-04, Univ. of Colorado at Boulder. May.
[98] Witten and Frank, “ Data Mining-Practical Machine learning Tools and Techniques”, Morgan
Kaufmann Publishers, Second Edition, 2005, ISBN: 0-12-088407-0.
[99] Y. Tao, X. Xiao, and S. Zhou. Mining distance-based outliers from large databases in any
metric space. In Proceedings of the 12th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, pages 394-403, 2006.
[100] Walter R. Mebane and Jasjeet S. Sekhon, “Robust Estimation and Outlier Detection for over
dispersed Multinomial Models of Count Data”, American Journal of Political Science, Vol. 48,
No. 2, April 2004, Pp. 392–411.
[101] Yamanishi, J. Takeuchi (2006) A unifying framework for detecting outliers and change points
from non-stationary time series data. Knowledge and Data Enginnering, vol. 18, no. 4, pp 482-
492.
[102] Z. He, X. Xu, S. Deng (2003) Discovering cluster based local outliers. Pattern Recognition
Letters, 24 (9-10): 1651-1660.
[103] Zhang, K., Shi, S., Gao, H. and Li, J. (2007) 'Unsupervised outlier detection in sensor networks
using aggregation tree', Proceedings of ADMA.
[104] Zhang, Y., Meratnia, N. and Havinga, P. J. M. (2007) 'A taxonomy framework for unsupervised
outlier detection techniques for multi-type data sets', Technical Report, University of Twente.
[105] Zhang, Y., Meratnia, N., and Havinga, P. (2010) Outlier detection techniques for wireless
sensor networks:A survey. IEEE Communications Survey & Tutorials,12(2), 159 – 170.
[106] Zhu, H. Kitagawa, C. Faloutsos (2005) Example-based robust outlier detection in high
dimensional datasets. In: Proceedings of ICDM, pp 829-832.
[107] http://www.biomedcentral.com/1471-2288/5/35
[108] http://www.datavis.ca/gallery/accent.php.
[109] http://idt.stanford.edu/idt1999/students/mzuno/portfolio/work/reports/tufte
[110] http://216.22.10.76/wiki/Dealing_with_Outliers
... Isotopic peaks roughly differ by the mass of a neutron and are hence separated by Fig. 10 Zoom in on two annotated masses from the spectrum in Fig. 7 (a) At low masses the monoisotopic peak is the most intense peak in the isotopic cluster. (b) At higher masses it is no longer the monoisotopic peak that is the most intense peak in the isotopic distribution 2.10 MALDI-TOF MS MALDI-TOF MS has been used extensively together with 2D-electrophoresis [91] and still plays a role in proteomics. One of the biggest problems with the MALDI-TOF MS for identification of proteins is the difficulty in obtaining highly significant search result in all cases. ...
... Most of the sites which are mapped through the mass-spectrometric studies have not been studied for their functional importance in disease pathogenesis. Most of the phosphorylated sites are known to be poorly conserved throughout species and there is a possibility that the identified sites have no functional inference at all [91]. Large-scale functional studies are required for the implication of the sites in clinical perspective. ...
... 2. Detecting outliers can be a task of a clustering algorithm [91]. This approach allows the outlier detection based on all the variables used in the algorithm, whereas the more classical methods only identifies outliers based on each variable at a time. ...
Article
Full-text available
Background: Type 2 Diabetes (T2D) diagnosis is based solely on glycaemia, even though it is an endpoint of numerous dysmetabolic pathways. Type 2 Diabetes complexity is challenging in a real-world scenario; thus, dissecting T2D heterogeneity is a priority. Cluster analysis, which identifies natural clusters within multidimensional data based on similarity measures, poses a promising tool to unravel Diabetes complexity. Methods: In this review, we scrutinize and integrate the results obtained in most of the works up to date on cluster analysis and T2D. Results: To correctly stratify subjects and to differentiate and individualize a preventive or therapeutic approach to Diabetes management, cluster analysis should be informed with more parameters than the traditional ones, such as etiological factors, pathophysiological mechanisms, other dysmetabolic co-morbidities, and biochemical factors, that is the millieu. Ultimately, the above-mentioned factors may impact on Diabetes and its complications. Lastly, we propose another theoretical model, which we named the Integrative Model. We differentiate three types of components: etiological factors, mechanisms and millieu. Each component encompasses several factors to be projected in separate 2D planes allowing an holistic interpretation of the individual pathology. Conclusion: Fully profiling the individuals, considering genomic and environmental factors, and exposure time, will allow the drive to precision medicine and prevention of complications.
... From an efficiency perspective the system will be better at understanding many patterns compared to a statistical approach [5]. When integration and transformation are performed on large data sources [6] using traditional methods such as extraction of outlier data [7], the amount of manual work is high while with a system that has been proposed above, understanding data happens on a machine level which abstracts a lot of the decisions that are involved. The use of intelligent data methods to manipulate the way it is used is evolving and we are at a transition point heading towards machine dependent architecture. ...
... Incorrect handling of these outliers in machine learning may lead to a longer training process of data, less accurate models, and eventually degrading results [44]. In some techniques like [45,46], it is preferred to visualize the data initially to decide the impact of outliers. In some techniques [47], a univariate technique is used to search for data that contain extreme values on just a single variable. ...
Thesis
Full-text available
This thesis works with the eye-tracking data in the project 'Schau mir in die Augen'. Abstract The study includes working with eye movements for user identification. The main objective of the thesis is to improve the accuracy by modifying a pre-existing pipeline to recognize and take advantage of outliers. For this goal, the reasons for outliers and their particularities are investigated. Various methods are used and compared for detecting the outliers. It is found that removing the outliers from the pipeline decreases the accuracy. Instead, it is studied to include them in the pipeline. For this, a new classifier is added into the pipeline along with fixations and saccades, which deal with outliers. All the experiments are performed on the Bio-Tex dataset and lead to an improvement of accuracy. The final pipeline is tried on the Bio-Ran dataset, but there are no conclusive results, and more experiments need to be performed. Two classifiers are used in the thesis, majorly is Random Forest, and the other is Radial Basis Function Network. The Bio-Tex shows an improvement in accuracy by considering the outliers for the two classifiers. For Bio-Ran, there are no conclusive results obtained. Outliers play a vital role in eye movements and show the potential to improve the accuracy in the user identification field with more research.
... Outliers and noises may occur in an individual or multiple features, so both cases must be considered. Outliers can be distinguished simply through graphical representations or other techniques (Bansal et al., 2016;Huang et al., 2016b;Kriegel et al., 2010;Kwak and Kim, 2017;Nesa et al., 2018;Souza and Amazonas, 2015;Velchamy et al., 2012). Usually noises can't be simply observed, because they seem to be correct data to someone unfamiliar with MD (Libralon et al., 2009), so a noisy data mustn't be removed immediately because it may be an abnormal or unseen situation. ...
Article
Full-text available
Abstract: Electronic Health Records (EHRs) form major part of Medical Big Data (MBD) and are enormous resources of knowledge. Mining EHRs can lead us to new generations of medicine (e.g. precision medicine). But actually it is not simply possible because EHRs are unsuitable for mining. Naturally any raw data is dirty but some special challenges make EHRs more susceptible to be dirty. To extract more precise and reliable knowledge we must pre-process EHRs. Performing appropriate pre-processing techniques which are based on specific properties of EHRs will provide high quality and more utilisable data. Here we introduce PEPMED, a systematic pre-processing approach that consists of three main stages. Each stage includes hybrid methods to deal with challenges of dirty data. Four well-known subgrouping methods were performed on both raw and pre-processed data to evaluate the approach. We used precision value and overall accuracy for measurements. Results show that PEPMED dramatically improved accuracy.
... Outliers and noises may occur in an individual or multiple features, so both cases must be considered. Outliers can be distinguished simply through graphical representations or other techniques (Bansal et al., 2016;Huang et al., 2016b;Kriegel et al., 2010;Kwak and Kim, 2017;Nesa et al., 2018;Souza and Amazonas, 2015;Velchamy et al., 2012). Usually noises can't be simply observed, because they seem to be correct data to someone unfamiliar with MD (Libralon et al., 2009), so a noisy data mustn't be removed immediately because it may be an abnormal or unseen situation. ...
Article
Electronic Health Records (EHRs) form major part of Medical Big Data (MBD) and are enormous resources of knowledge. Mining EHRs can lead us to new generations of medicine (e.g. precision medicine). But actually it is not simply possible because EHRs are unsuitable for mining. Naturally any raw data is dirty but some special challenges make EHRs more susceptible to be dirty. To extract more precise and reliable knowledge we must pre-process EHRs. Performing appropriate pre-processing techniques which are based on specific properties of EHRs will provide high quality and more utilizable data. Here we introduce PEPMED, a systematic pre-processing approach that consists of three main stages. Each stage includes hybrid methods to deal with the challenges of dirty data. Four well-known subgrouping methods were performed on both raw and pre-processed data to evaluate the approach. We used precision value and overall accuracy for measurements. Results show that PEPMED dramatically improved accuracy.
... "Outlier is an observation that is significantly different from the other values in a dataset." [8]. outliers also defined the type of dataset as nominal, ordinal, interval and ratio scaled. ...
... Many surveys such as those presented in Refs. [1][2][3][4][5][6][7][8][9] have been conducted to classify the outlier detection methods for numerical data. ...
Article
Full-text available
In the data era, outlier detection methods play an important role. The existence of outliers can provide clues to the discovery of new things, irregularities in a system, or illegal intruders. Based on the data, outlier detection methods can be classified into numerical, categorical, or mixed-attribute data. However, the study of the outlier detection methods is generally conducted for numerical data. Meanwhile, many real-life facts are presented in mixed-attribute data. In this paper, the researcher presents a survey of outlier detection methods for mixed-attribute data. The methods are classified into four types, namely, categorized, enumerated, combined, and mixed outlier detection methods for mixed-attribute data. Through this classification, the methods can be easily analyzed and improved by applying appropriate functions.
Article
Full-text available
The objective of the paper is to present the techniques of Artificial Intelligence based on deep learning that can be applied to detect fractures in bones on X-rays. The paper comprises of discussions of various entities. Initially, there is a discussion on data formulation and processing. Following which, distinguished image processing techniques are presented for fracture detection. Later, there is an analysis of conventional and current neural network methodologies for fracture detection techniques. Furthermore, there is a comparative analysis for the same. Finally, in the end, a discussion is presented in the paper regarding problems and challenges confronted by researchers for fracture detection. The study shows, deep learning techniques provide accuracy in the diagnosis than the conventional methods in fracture detection on X-rays. The paper leads to a path for the researchers to deal with difficulties and issues encountered with the fracture detection on X-rays while using deep learning techniques.
Conference Paper
Pollen contains highly allergic proteins. One of the major causes of allergic diseases is the pollen in the air we breathe. Sensitivity of asthma patients to allergens is vital. Therefore, asthma patients need to be more careful and avoid the factors that trigger asthma. In this study, first steps were taken to develop an artificial intelligence-based decision support system to improve the quality of life of asthma patients. The ratio of pollens in nature is a long process reported by using different measurement instruments and calculations for hours in laboratories. In this study, Adaptive Network Based Fuzzy Extraction System (ANFIS) and normalization methods were applied on meteorological data to estimate the breathing levels of asthma patients according to the number of pollens in the air. The data were tested with artificial neural networks method and it was found that the model tested in artificial neural networks produced better results that are very close to real values when compared to the results in ANFIS.
Chapter
One of the most basic challenges in urban areas is providing sustainable access to adequate quantities of quality water in order to sustain livelihoods, human well-being, and socio-economic development. Water poverty affects an important share of low income urban and rural population in the forms of limited, time-consuming and unsafe access to the resource as well as a high incidence of waterborne diseases. Universalizing access to potable water and sanitation by being efficient and avoiding waste of resources may be the most important challenge of water networks in future years. ‘Smart water’ consists in a group of emerging technological solutions that help water managers operate more efficiently and, in a smaller scale, also help consumers tracking and managing their water usage. The Internet of Things, cloud-based information storage and data analytics (Big Data) are at the core of that. A smart water system is based on a network of sensors embedded with electronics and software that allow getting real-time data of any measurable parameters such as level, flow, pressure, temperature, noise correlations or even water quality parameters, and make them available online. Furthermore, the management of data through statistical tools and algorithms can allow pattern recognition and modeling of the system, thus optimizing the operational performance of the water supply network and reducing pipe bursts, leakages and energy waste in the pumps.
Article
We develop a robust estimator-the hyperbolic tangent (tanh) estimator-for overdispersed multinomial regression models of count data. The tanh estimator provides accurate estimates and reliable inferences even when the specified model is not good for as much as half of the data. Seriously ill-fitted counts-outliers-are identified as part of the estimation. A Monte Carlo sampling experiment shows that the tanh estimator produces good results at practical sample sizes even when ten percent of the data are generated by a significantly different process. The experiment shows that, with contaminated data, estimation fails using four other estimators: the nonrobust maximum likelihood estimator, the additive logistic model and two SUR models. Using the tanh estimator to analyze data from Florida for the 2000 presidential election matches well-known features of the election that the other four estimators fail to capture. In an analysis of data from the 1993 Polish parliamentary election, the tanh estimator gives sharper inferences than does a previously proposed heteroskedastic SUR model.
Book
Activities in data warehousing and mining are constantly emerging. Data mining methods, algorithms, online analytical processes, data mart and practical issues consistently evolve, providing a challenge for professionals in the field. Research and Trends in Data Mining Technologies and Applications focuses on the integration between the fields of data warehousing and data mining, with emphasis on the applicability to real-world problems. This book provides an international perspective, highlighting solutions to some of researchers' toughest challenges. Developments in the knowledge discovery process, data models, structures, and design serve as answers and solutions to these emerging challenges.
Article
INFERENTIAL STATISTICS AND TEST SELECTION INTRODUCTION TO SPSS MULTIPLE RESPONSE T-TEST FOR INDEPENDENT GROUPS PAIRED-SAMPLES T-TEST ONE-WAY ANALYSIS OF VARIANCE, WITH POST HOC COMPARISONS FACTORIAL ANALYSIS OF VARIANCE GENERAL LINEAR MODEL (GLM) MULTIVARIATE ANALYSIS GENERAL LINEAR MODEL: REPEATED MEASURES ANALYSIS CORRELATION LINEAR REGRESSION FACTOR ANALYSIS RELIABILITY MULTIPLE REGRESSION STRUCTURAL EQUATION MODELING NONPARAMETRIC TESTS APPENDIX: Summary of SPSS Syntax Files References
Book
Unlike the other chapters in the book, this chapter contains little data analysis. The emphasis is on theory and on the description of the search. In the first half of the chapter we provide distributional results on estimation, testing and on the distribution of quantities such as squared Mahalanobis distances from samples of size n.
Article
We develop a robust estimator—the hyperbolic tangent (tanh) estimator—for overdispersed multinomial regression models of count data. The tanh estimator provides accurate estimates and reliable inferences even when the specified model is not good for as much as half of the data. Seriously ill-fitted counts—outliers—are identified as part of the estimation. A Monte Carlo sampling experiment shows that the tanh estimator produces good results at practical sample sizes even when ten percent of the data are generated by a significantly different process. The experiment shows that, with contaminated data, estimation fails using four other estimators: the nonrobust maximum likelihood estimator, the additive logistic model and two SUR models. Using the tanh estimator to analyze data from Florida for the 2000 presidential election matches well-known features of the election that the other four estimators fail to capture. In an analysis of data from the 1993 Polish parliamentary election, the tanh estimator gives sharper inferences than does a previously proposed heteroskedastic SUR model.