ArticlePDF Available

Monte Carlo Simulation for Outlier Identification Studies in Geodetic Network: An Example in A Levelling Network Using Iterative Data Snooping

Authors:

Abstract and Figures

Today with the fast and powerful computers, large data storage systems and modern softwares, the probabilities distribution and efficiency of statistical testing algorithms can be estimated using computerized simulation. Here, we use Monte Carlo simulation (MCS) to investigate the power of the test and error probabilities of the Baarda’s iterative data snooping procedure as test statistic for outlier identification in the Gauss-Markov model. The MCS discards the use of the observation vector of Gauss-Markov model. In fact, to perform the analysis, the only needs are the Jacobian matrix; the uncertainty of the observations; and the magnitude intervals of the outliers. The random errors (or residuals) are generated artificially from the normal statistical distribution, while the size of outliers is randomly selected using standard uniform distribution. Results for simulated closed leveling network reveal that data snooping can locate an outlier in the order of magnitude 5σ with high success rate. The lower the magnitude of the outliers, the lower is the efficiency of data snooping in the simulated network. In general, considering the network simulated, the data snooping procedure was more efficient for α=0.01 (1%) with 82.8% success rate.
Content may be subject to copyright.
| 21
Geoplanning
Vol 6, No 1, 2019, 21-30 Journal of Geomatics and Planning
E-ISSN: 2355-6544
http://ejournal.undip.ac.id/index.php/geoplanning
doi: 10.14710/geoplanning.6.1.21-30
Monte Carlo Simulation for Outlier Identification Studies in
Geodetic Network: An Example in A Levelling Network Using
Iterative Data Snooping
M.T. Matsuoka a,c , V.F. Rofatto a,c, I. Klein b, A. F. S. Gomes a , M.P. Guzatto b
a Institute of Geography, Federal University of Uberlândia (UFU), Monte Carmelo, Brazil
b Landing Surveying Program, Federal Institute of Santa Catarina (IFSC), Florianopolis, Brazil
c Graduate Program of Remote Sensing, Federal University of Rio Grande do Sul (UFRGS), Porto Alegre, Brazil
Abstract: Today with the fast and powerful computers, large data storage systems and
modern softwares, the probabilities distribution and efficiency of statistical testing algorithms
can be estimated using computerized simulation. Here, we use Monte Carlo simulation (MCS) to
investigate the power of the test and error probabilities of the Baarda’s iterative data snooping
procedure as test statistic for outlier identification in the Gauss-Markov model. The MCS discards
the use of the observation vector of Gauss-Markov model. In fact, to perform the analysis, the
only needs are the Jacobian matrix; the uncertainty of the observations; and the magnitude
intervals of the outliers. The random errors (or residuals) are generated artificially from the
normal statistical distribution, while the size of outliers is randomly selected using standard
uniform distribution. Results for simulated closed leveling network reveal that data snooping can
locate an outlier in the order of magnitude 5σ with high success rate. The lower the magnitude of
the outliers, the lower is the efficiency of data snooping in the simulated network. In general,
considering the network simulated, the data snooping procedure was more efficient for α=0.01
(1%) with 82.8% success rate.
Copyright © 2019 GJGP-UNDIP
This open access article is distributed under a
Creative Commons Attribution (CC-BY-NC-SA) 4.0 International license.
How to Cite (APA 6 th Style):
Matsuoka, M.T., Rofatto, V.F., Klein, I., Gomes. A.F.S, Guzatto, M.P. (2019). Monte Carlo Simulation for Outlier Identification Studies in Geodetic
Network: An Example in a Levelling Network Using Iterative Data Snooping. Geoplanning: Journal of Geomatics and Planning, 6(1), 21-30.
doi:10.14710/geoplanning.6.1.21-30.
1. INTRODUCTION
Data snooping is the best established method for identification of gross errors in geodetic data analysis.
This method is due to (Baarda, 1968). Here, it is assumed that outliers are observations contaminated by
gross errors (blunders), following the statement of Lehmann (2012) that in Geodesy, ‘outliers are most
often caused by gross errors and gross errors most often cause outliers’. In practice, Data Snooping
procedure is applied iteratively, identifying and removing an outlier at a time. The method is applied until
no observations are identified. Here, this procedure will be called Iterative Data Snooping. Since data
snooping is based on a statistical hypothesis testing, it may lead to a false decision as follows:
Type I error or false alert (probability level α) – Probability of detecting an outlier when there is none;
Type II error or missed detection (probability level β) Probability of non-detecting an outlier when
there is at least one; and
Type III error or wrong exclusion (probability level κ) Probability of misidentifying a non-outlying
observation as an outlier, instead of the outlying one.
Lehmann & Voß-Böhme (2017) mention that while the rate of type I decision error can be selected by
the user, the rate of type II decision error cannot. They also point out that a test statistic with a low rate of
type II is said to be powerful. However, without considering the Type III error, there is a high risk of over-
estimating the successful identification probability. Besides that, we highlight that the Iterative Data
Snooping procedure can identify more observations than real number of outliers (we call here “over-
Article Info:
Received: 13 April 2018
in revised form: 30 March 2019
Accepted: 30 May 2019
Available Online: 30 August 2019
Keywords:
Geodetic Network, Outlier, Monte
Carlo Simulation
Corresponding Author:
Marcelo Tomio Matsuoka
Universidade Federal de Uberlândia,
Monte Carmelo, Brazil
Email: tomio@ufu.br
OPEN ACCESS
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
22 |
identification”). Thus, we consider a powerful statistical test when the rates of type II and type III errors as
well as the over-identification are simultaneously minimized for a given probability level α.
From this point, we posed the following problem: how to compute the probabilities levels above?
Unlike Baarda, we have fast computers at our disposal. In this paper we show that the statistical quantities
can be determined by frequency distributions of computer random experiments performed using random
numbers. This is known as Monte Carlo Simulation (MCS). MCS methods are used whenever the functional
relationships are analytically not tractable, as is the case for Iterative Data Snooping procedure (Rüdiger
Lehmann, 2012b). The MCS has already been applied in outlier detection (Lehmann & Scheffler, 2011;
Lehmann, 2012; Klein et al., 2012; Klein et al., 2015; Erdogan, 2014; Niemeier & Tengen, 2017)
The studies presented in this paper are a continuation of the first experiments presented by Rofatto et
al. (2017). However, unlike Rofatto et al., (2017), here in this paper we evaluate the proposed method in a
geodetic network with uncorrelated observations and also we analyze the power of the test of Iterative
Data Snooping procedure when outliers of magnitude equal to the MDB (Minimal Detectable Bias) are
inserted into the geodetic network.
The outline of the paper is as follows: first the paper show a theoretical background of Iterative Data
Snooping procedure in the Gauss Markov model. Next, the MCS approach is introduced as tool to analyze
the power of the test and the probabilities of decision errors (type II, type III and over-identification) of the
Iterative Data Snooping procedure. Then, the efficiency of the Iterative Data Snooping is demonstrated by
means of the Monte Carlo method on the example of simulated closed leveling network. The mathematical
model generally adopted in geodetic data analysis is the linear(ized) Gauss-Markov model, given by Koch
(1999):
e y Ax=−
.......(1)
where e is the n x 1 random error vector, A is the n x u design (or Jacobian) matrix with full rank column, x is
the u x 1 unknown parameters vector and y is the n x 1 observations vector. The most employed solution
for a redundant system of equations (
nu
) is the weighted least squares estimator (WLSE) for the vector
of unknowns (
ˆ
x
):
1
ˆ( ) ( )
TT
x A WA A Wy
=
........(2)
In which W is the n x n weight matrix of the observations, taken as
21
0y
W
=
, where
is the
variance factor (here it is assumed as known) and
y
is the covariance matrix of the observations; if
y
is
diagonal, one speaks of weighted LSE (WLSE); if it is full, generalized LSE (GLSE). More details about LSE
estimation in Ghilani (2017). A geometric interpretation of the LSE can be found in Teunissen (2003) and
Klein et al. (2011).
The least-squares method is the Best Linear Unbiased Estimator (BLUE) for the unknown parameters and
it is also a maximum likelihood solution when the observation errors follow a central Gaussian distribution
(Teunissen, 2003). However, the least squares is no longer optimal in the presence of grossly erroneous
observations (Baarda, 1968). In other words, despite optimal properties for least square, they lack
robustness or insensitivity to outliers in observations (Huber, 1992; Rousseeuw & Leroy, 1987; Lehmann,
2013). In recent years, two categories of advanced techniques for the treatment of observations
contaminated by outliers have been developed: robust adjustment procedures (Wilcox, 2011; Klein et al.,
2015) and outlier detection based on statistical tests (Klein et al., 2016) . The first one is outside the scope
of this paper. Besides the undoubted advantages of robust adjustment, the outlier tests are also used. The
following advantages of outlier analysis are mentioned by Lehmann (2013);
Detected outliers provide the opportunity to investigate causes of gross measurement errors;
Detected outliers can be re-measured; and
If the outliers were discarded from the observations then the standard adjustment software, which
operates according to the least squares principle, can be used.
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
| 23
Data snooping procedure is a particular case of maximum likelihood ratio test when only one outlier (i.e.
q = 1) is present in the data set at a time (Baarda, 1968; Berber & Hekimoglu, 2003; Lehmann, 2012) Thus, it
is formulated by the following test hypotheses (Baarda, 1968; Teunissen, 2006):
0: { } vs : { } ; 0
Ay
H E y Ax H E y Ax c= = +  
.......... (3)
Where cy is outlier model for q=1, i.e. the n x 1 unit vector with 1 in its ith entry and zeros in the
remaining (e.g.
 
10 0 1 0 0
nx
T
y
c=
), and is a scalar value with the gross error (outlier) at ith
observation being tested. Therefore, in the null hypothesis (H0), it is assumed that there are no outliers in
the observations, while in the alternative hypothesis (HA), is it assumed that the ith observation being tested
(
i 1, ,for n=
) is contaminated by gross error of magnitude .
If we consider one outlying observation in at certain known locations (q = 1), then the likelihood ratio test
for data snooping (Tq = 1) is given by Teunissen (2006):
0
1 1 1 1 1
ˆ
1 0 0
ˆˆ
T ( )
T T T
q y y y y e y y y y
e c c c c e
− −
==   
.......... (4)
where
0
ˆ
e
and
0
ˆ
e
is the estimated random error vector and a posteriori covariance matrix of the
estimated random error computed by LSE into H0, respectively. Under H0, observation errors are zero-
mean (multivariate) normally distributed. The null hypothesis is rejected if the following test statistic (Tq =
1) of the ith observation being tested exceeds a given critical value Κα , i.e.:
0
01
2 2 1 1 2
ˆ
0 1 (1,0) 1 (1, )
Reject H if: T
:T ~ ; :T ~ , with
q
T
q A q y y e y y
K
H H c c
 
=
−−
==
=   
........ (5)
Important to mention that the critical value follows from a chi-squared distribution with one degree
freedom at a significance level of in a one-tailed test. Baarda (1968) and Teunissen (2006) demonstrate
that if q = 1, then the test statistics (equation 4) can also be formulated based on a standard normal
distribution in a two-tailed test (so-called w-test). Both the chi-squared and normal distribution tests are
equivalent. Usually in geodesy, the value of is set between 0.1% and 1% (Kavouras, 1982; Aydin &
Demirel, 2004; Lehmann, 2013). Furthermore, data snooping contains multiple alternative hypotheses, as
each observation is individually tested. Therefore, the only observation considered contaminated by outlier
is the one whose test statistic satisfies the inequalities Tq=1 > Κα. In the case that two or more observations
exceed the critical value Κα only the observation with the largest Tq=1 is flagged as an outlier. After having
identified the observation most suspected of being an outlier (at given ), it is excluded usually from the
model, and the WLSE and data snooping procedure are applied iteratively until there are no further outliers
identified in the observations (Berber & Hekimoglu, 2003).
The power of the test (γ) is the probability of correctly identifying the outliers. In the case of a round of
Data Snooping, the power of the test depends on the type II and type III errors, for a given level of
significance (α) (i.e. γ = 1 + κ)). Considering the “Iterative Data Snooping”, the power of the test also
depends on over-identification error, and it is given by γ = 1 (β + κ + over-identification). Baarda's
conventional reliability theory considers only a single alternative hypothesis (Baarda, 1968), and therefore,
it is based only on type I and II errors. Type III error is addressed by Förstner (1983) considering two
alternative hypotheses. Yang et al. (2013) extended the solution given by Förstner (1983), and presented an
analytical solution for type III error considering multiple alternative hypotheses and the presence of an
outlier (i.e. for a round of Data Snooping). Examples of the efficiency of the analytical solution presented by
Yang et al. (2013) can be found in Klein et al. (2015).
The focus of this paper is the Iterative Data Snooping. An analytical solution to the probabilities of
decision error and power of the test for Iterative Data Snooping has not yet been developed and is of
rather difficult solution. A well-established procedure to compute the probabilities levels is the Monte
Carlo Simulation (MCS). As pointed out by Lehmann (2012), in essence the MCS replaces random variables
by computer generated pseudo random numbers, probabilities by relative frequencies and expectations by
arithmetic means over large sets of such numbers. A computation of one set of pseudo random numbers is
a Monte Carlo experiment. In Geodesy, Monte Carlo Simulation has been applied in some studies since the
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
24 |
pioneering work of (Hekimoglu & Koch, 1999). For example, Lehmann & Scheffler (2011) have already
applied MCS in data snooping to determine the optimal level of error probability α (type I error). Here, on
the other hand, we proposed to use MCS in the “Iterative Data Snooping” to compute the follows
probabilities levels: power of the test; type II error and type III error. In addition to these probabilities, we
also compute the rate of experiments where the Iterative Data Snooping procedure identified more outliers
than simulated (we call “over-identification”- i.e., q > 1). In the next section we show how to obtain these
probabilities levels experimentally. Thus, we can analyze the efficiency of the Iterative Data Snooping
testing procedure based on MCS as promised by the title of this paper.
2. DATA AND METHODS
In order to analyze the Iterative Data Snooping procedure, the MCS was applied to compute the
probabilities levels. To do so, a sequence of m random errors vector
, 1, ,=
K
e k m
of a desired
statistical distribution is generated. The m” is known as the number of Monte Carlo experiments. Usually,
assume that the random errors of the good measurements are normally distributed with expectation zero.
Thus, we generate the random errors using multivariate normal distribution, since the assumed stochastic
model for random errors is based on covariance matrix of the observations, i.e.
2
0
~ (0, )y
eN
.
On the other hand, an outlier (q=1) is selected based on magnitude intervals of the outliers for each m
Monte Carlo experiments. Positive and negative outliers are clipped between 3σ and 3.5σ, 3.5σ and 4σ; 4σ
and 4.5σ; 4.5σ and 5σ; 5 and 5.5σ; 5.5σ and 6σ; 6σ and 6.5σ; 6.5σ and 7σ; 7σ and 7.5σ; 7.5σ and 8σ; 8σ and
8.5σ; 8.5σ and 9σ in each experiment is the standard deviation of the observation). Here, we use the
standard uniform distribution to select the outlier magnitude. The uniform distribution is a rectangular
distribution with constant probability and implies the fact that each range of values that has the same
length on the distributions support has equal probability of occurrence (Lehmann & Scheffler, 2011). For
example, for 10,000 Monte Carlo experiments, if the one choices a magnitude interval of the outliers of |3σ
to 9σ|, the probability of a 3σ error occurring is virtually the same as -3σ, and so on. At each iteration of the
simulation, a specific observation is chosen to receive a gross error based on the discrete uniform
distribution (i.e., all observations have the same probability of being selected). Random and gross errors
are assumed to be independent (by definition) and both are combined to the total error as follow
(Kavouras, 1982):
, 0
y
ec
= +  
...... (6)
where
is the n x 1 total error vector, e is n x 1 random errors vector and cy is outlier model for q=1 (see
expression 3), and is a scalar value with the outlier at ith observation being tested. We assume that
>e. Before computing statistical test Tq=1 (expression 4) it is necessary to relate the random error vector e
and total error vector ε, since this statistical test depends on the estimated random error vector
0
ˆ
e
. In the
sense of LSE, this relationship is given by Kavouras (1982): in which R is the n x n redundancy matrix and I is
the n x n identity matrix.
0
ˆ=eR
, ....... (7)
1
()
TT
R I A A WA A W
=−
.......... (8)
In the equation 7 the reader should be informed that the multiplication of the redundancy matrix (R)
and the total error
provides the estimated random error vector
0
ˆ
e
. Now, the
0
ˆ
e
is not only composed by
random errors, but also it has one of its elements contaminated by an outlier. Now it becomes possible to
compute the test statistic Tq=1 considering the relation given by equation 4.
The significance level is varied, taken as α = 0.001 (0.1%), α = 0.01 (1%), α = 0.025 (2.5%), α = 0.05 (5%)
and α = 0.1 (10%). Each simulation has a unique combination of significance level and interval of magnitude
of outliers. We ran 10,000 experiments for each simulation and compute the probabilities levels of type II
error, type III error, the power of the test and the number of over-identification (more outliers identified
than simulated) in Iterative Data Snooping, totaling 12 x 5 x 10,000 = 600,000 Monte Carlo simulations. It is
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
| 25
important to emphasize that the proposed method does not depend on the unknown parameters vector or
the vector of observations.
3. RESULTS AND DISCUSSION
In order to demonstrate the analysis of the efficiency of data snooping, we simulated a closed leveling
network. The goal is to illustrate how to use MCS approach to compute statistical quantities numerically;
further considerations about levelling networks are outside the scope of this study.
We consider a closed levelling network, with one control station (benchmark), and 4 points with
unknown heights (A, B, C and D), totaling four minimally constrained points as shown in Figure 1. The
benchmark is fixed, and the distances of the adjacent and non-adjacent stations are approximately 240 m
and 400 m, respectively. The equipment used is a spirit level with nominal standard deviation for a single
staff reading of 0.02 mm/m. Lines of sight distances are kept at 40 m. Thus, each total height difference
i
h
between adjacent or non-adjacent stations is made of, respectively, three or five partial height
differences (p). Each partial height difference, in turn, involves one instrument setup and two sightings:
forward and back. The standard deviation for each
i
h
equals to
2 4 0 0 .0 2 mm /m 2 0. 8 mm
ii
pp

=  = 
, where p is 3 or 5. The readings are assumed
uncorrelated and
2
0
= 1.
Figure 1. Simulated leveling network
For each unknown point, there are four height difference measurements. Thus, there are n = 10
observations, u = 4 unknowns, and n - u = 6 degrees of freedom in this simulation. The design matrix (A) has
dimension 10 x 4 and the covariance matrix of observations has dimension
10 x1 0
. Each station is
involved in four height differences, so there are three redundant observations for the determination of
each unknown.
In the sense of reliability, the minimum and maximum redundancy numbers of the observations in the
network are 0.46 and 0.75, respectively. This means that the ability of the outlier detection is not uniform
in every part of the network. We also compute the Minimal Detectable Bias (MDB) as an indicator of
system internal reliability. The MDB is derived from a local test proposed by Baarda (1968), which makes a
decision between the null and a unique alternative hypothesis. By definition the MDB is based on Type I
(false alert) and Type II (missed detection) error. The conventional MDBs are ranged from 4.7σ to for
α=0.001; 3.9σ to for α=0.01; 3.5σ to 4.5σ for α=0.025; 3.2σ to for α=0.05; finally, 2.8σ and 3.6σ for
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
26 |
α=0.1. The computations of these MDBs were based on probability of Type II error of 0.2 (Baarda, 1968). In
addition, the maximum positive and negative correlation between the test statistics are 0.61 (between
4
h
and
5
h
) and -0.58 (between
2
h
and
3
h
), respectively. The correlation coefficient is presented by
Förstner (1983).
Applying the method presented in section 3, the success and error probabilities of Iterative Data
Snooping were estimated. Figure 2 shows the success rate (number of experiments that only outlying
observation was identified), i.e. the power of the Iterative Data Snooping testing procedure for one
simulated outlier (γ). The misidentifications rates are showed in the Figures 3-4. The misidentifications are
divided in two types of classes are counted in the simulations: number of experiments where the procedure
yielded none observation identification (type II error - β); number of experiments in which the procedure
detected a single observation but wrong identification (type III error - κ). In addition to these classes, we
consider “over-identification”, i.e. the number of experiments where the procedure detects more outliers
than simulated.
Figure 2. Success rate (power of the test) of the iterative data snooping testing procedure for simulated
leveling network vs. magnitude intervals of the outliers for each probability level α.
Figure 3. Type II error for simulated leveling network vs. magnitude intervals of the outliers for each
probability level α.
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
| 27
Figure 4. Type III error for simulated leveling network vs. magnitude intervals of the outliers for each
probability level α.
Figure 5. Over-identification vs. magnitude intervals of the outliers for each probability level α.
Figures 35 show that, in general, the lower the magnitude of the outliers, the lower is the efficiency of
data snooping in the simulated network. This is expected. However, there is not direct relation between the
power of the test and significance level (α) for the network analyzed. Here, we do not recommend to use
α=0.1(10%), because many good observations are eliminated. In this case, as shown in the Figure 5, the
over-identification rate stands out in relation to the other types of errors. Furthermore, in this case (α=0.1),
the power of the test is virtually independent of the outlier size (see Figure 1).
It appears that higher values for α are not recommended for outliers of greater magnitude and that
lower values for α are not recommended for outliers of smaller magnitude. Therefore, these results show
the importance of a correct choice of α, as pointed out by Lehmann (2012); it also highlights the challenges
in controlling the error rate in multiple hypotheses tests. Regarding the three classes of misidentification
rates, in general, an increase in the magnitude interval of outliers, leads to a slight increase in the over-
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
28 |
identification rate (more outliers being identified than simulated) and a cutback in the type II error. This
fact is due to the error propagation among all residuals. The rate of cases with correct number of outliers
but with wrong identification (type III error) also decreases when increasing the magnitude interval of
outliers.
We can observe (Figure 2) from the interval of magnitude outlier (5σ - 5.5σ) the value of the power of
the test (success rate) is practically stable for all levels of significance (except for α=0.001): approximately,
50% (α=0.1), 70% ( α=0.05), 83% (α=0.025) and 92% (α=0.01). For α=0.001, the success rate is greater than
90% from the interval of magnitude outlier (5.5σ - 6σ).
In general, considering the network simulated, the Iterative Data Snooping procedure is efficient for
outliers greater than 5σ, with a mean success rate of 76.4% for α=0.001(0.1%); 82.8% for α=0.01 (1%); 78%
for α=0.025 (2.5%); and 67.9% for α=0.05 (5%). Therefore, with an appropriate choice of α, results show
that data snooping can locate an outlier in the order of magnitude 5σ with high success rate. However, the
number of outliers to be considered, that also affects the efficiency of Iterative Data Snooping, requires
further investigation.
In order to compare the power of the test of Iterative Data Snooping and the conventional power of the
test (80%), the method described in section 3 was also applied considering the outliers with the size of its
MDBs for each significance level. As pointed out, the MDBs ranged from 4.7σ to 6σ for α=0.001; 3.9σ to
for α=0.01; 3.5σ to 4.5σ for α=0.025; 3.2σ to for α=0.05; finally, 2.8σ and 3.6σ for α=0.1. Such MDBs
were based on the conventional power of the test of 0.8 (80%). The probabilities of committing different
types of errors and the power of the Iterative Data Snooping considering these MDBs for each significance
level are showed in the Table 1.
Table 1. Probabilities of Iterative Data Snooping (%) considering the size of conventional MDBs
Significance levels α
Power of the test %
Type II Error %
Type III error %
Over-identification %
0.001
77.09
19.97
2.5
0.45
0.01
70.72
17.66
6.76
4.86
0.025
62.94
15.7
10.35
11.01
0.05
53.05
12.66
12.82
21.47
0.1
38.41
9.15
15.52
36.92
In Table 1, it is noticeable that the higher the significance level, the greater the divergence between the
power of the Iterative Data Snooping and the power used to compute the MDB (i.e. 80%). The explanation
for this difference is that the computation of the MDB depends only on Type I and Type II error, while the
Iterative Data Snooping also considers the probability of Type III error and over-identification. In future
research, it is intended to investigate a function that relates the power of the test considering the MDB to
the power of Iterative Data Snooping.
To conclude, it is important to mention that outlying observation can be presented among the detected
observations in the over-identification case. If all detected observations are wrong, the over-identification
case could be classified as type III error. The over-identification case will be investigated in more details in
future studies.
4. CONCLUSIONS
Many methods of quality control for geodetic data analysis have been developed and investigated since
the pioneering work of Baarda (1968). However, these methods still deserve further investigation. Thus, the
goal of this paper was to analyze the data snooping testing procedure to locate an outlier by means of the
MCS. The MCS discards the use of the observation vector of Gauss-Markov model. In fact, to perform the
analysis, the only needs are the geometrical network configuration (given by Jacobian matrix); the
uncertainty of the observations (given by nominal standard deviation of the equipment); and the
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
| 29
magnitude intervals of the outliers. The random errors (or residues) are generated artificially from the
normal statistical distribution, while the size of outliers is selected using standard uniform distribution.
Iterative Data Snooping shows high success rates in the experiments of a simulated levelling network for
single outlier randomly generated between four and five standard deviations. However, the efficiency of
Iterative Data Snooping significantly decreases for outlier smaller than five standard deviations. The
efficiency of the data snooping also depends on the significance level (α). Here, the optimal value for the
significance level was 0.01 (1%) for the simulated network. When we insert the MDB as an outlier in the
geodetic network, we verified that the higher the significance level, the greater the difference between the
power of the test of Iterative Data Snooping and the power of the test used for the MDB computation. In
future research, it is intended to investigate a function that relates the power of the test considering the
MDB to the power of Iterative Data Snooping.
Finally, we show that Monte Carlo Simulation is a feasible method to compute the probabilities level
associated to a statistical testing procedure regardless of the statistical tables. Future studies should
consider various issues: the performance of the data snooping in cases of linearized (originally non-linear)
models; it should consider geodetic networks in the sense of multiple outliers; the development of
reliability measures; and the method performance in different networks with various geometry and varying
redundancy. There are others approaches to identify multiple outliers in the observations, such as the
recent proposal of Lehmann & Lösler (2016) using the p-value concept and the Sequential Likelihood Ratio
Tests for Multiple Outliers (SLRTMO) presented by Klein et al. (2016). A suggestion for future work is to
increase the power of the test (success rate) of Iterative Data Snooping procedure by means of a unifying
testing procedure relating the iterative Data Snooping (a single outlier at time) with approaches for
multiple outliers identification such as those presented in Lehmann & Lösler (2016) and Klein (2016).
5. ACKNOWLEDGMENTS
The authors thank CNPq for financial support provided to the first author (proc. n. 305599/2015-1) and
to Fapemig by the scientific initiation fellowship of the fourth author.
6. REFERENCES
Aydin, C., & Demirel, H. (2004). Computation of Baarda’s lower bound of the non-centrality parameter.
Journal of Geodesy, 78(78), 437441. [Crossref]
Baarda, W. (1968). A testing procedijre for use in geodetic networks. Netherlands Geodetic Commission,
2(5). [Crossref]
Berber, M., & Hekimoglu, S. (2003). What is the reliability of conventional outlier detection and robust
estimation in trilateration networks? Survey Review, 37(290), 308318. [Crossref]
Erdogan, B. (2014). An outlier detection method in geodetic networks based on the original observations.
Boletim de Cincias Geodsicas, 20(3), 578589. [Crossref]
Förstner, W. (1983). Reliability and discernability of extended Gauss-Markov models. Deut. Geodact.
Komm. Seminar on Math. Models of Geodetic Photogrammetric Point Determination with Regard to
Outliers and Systematic Errors p 79-104 (SEE N 84-26069 16-43).
Ghilani, C. D. (2017). Adjustment computations: spatial data analysis. John Wiley & Sons.
Hekimoglu, S., & Koch, K. R. (1999). How can reliability of the robust methods be measured? Proceedings of
the Third Turkish-German Joint Geodetic Days, Istanbul, 179196.
Huber, P. J. (1992). Robust Estimation of a Location Parameter. In Springer Series in Statistics (pp. 492518).
[Crossref]
Kavouras, M. (1982). On the Detection of Outliers and the Determination of Reliability in Geodetic Networks.
1982. m. sc. e. Thesis-Department of Geodesy and Geomatics Engineering, University of New
Brunswick.
Klein, I, Matsuoka, M. T., Guzatto, M. P., de Souza, S. F., & Veronez, M. R. (2014). On evaluation of different
methods for quality control of correlated observations. Survey Review, 47(340), 2835. [Crossref]
Klein, I, Matsuoka, M. T., Guzatto, M. P., & Nievinski, F. G. (2016). An approach to identify multiple outliers
based on sequential likelihood ratio tests. Survey Review, 49(357), 449457. [Crossref]
Matsuoka, et al. / Geoplanning: Journal of Geomatics and Planning, Vol 6, No 1, 2019, 21-30
doi: 10.14710/geoplanning.6.1.21-30
30 |
Klein, Ivandro, Matsuoka, M. T., & Guzzato, M. P. (2015). How to estimate the minimum power of the test
and bound values for the confidence interval of data snooping procedure. Boletim de Cincias
Geodsicas, 21(1), 2642. [Crossref]
Klein, Ivandro, Matsuoka, M. T., Souza, S. F. de, & Collischonn, C. (2012). Design of geodetic networks
reliable against multiple outliers. Boletim de Ciências Geodésicas, 18(3), 480507.
Klein, Ivandro, Matsuoka, M. T., Souza, S. F. de, & Veronez, M. R. (2011). Adjustment of observations: a
geometric interpretation for the least squares method. Boletim de Ciências Geodésicas, 17(2), 272
294.
Koch, K.-R. (1999). Parameter Estimation and Hypothesis Testing in Linear Models. [Crossref]
Lehmann, Rdiger, & Scheffler, T. (2011). Monte Carlo-based data snooping with application to a geodetic
network. Journal of Applied Geodesy, 5(34). [Crossref]
Lehmann, Rüdiger. (2012). Improved critical values for extreme normalized and studentized residuals in
Gauss Markov models. Journal of Geodesy, 86(12), 11371146. [Crossref]
Lehmann, Rüdiger. (2012). On the formulation of the alternative hypothesis for geodetic outlier detection.
Journal of Geodesy, 87(4), 373386. [Crossref]
Lehmann, Rüdiger, & Lösler, M. (2016). Multiple Outlier Detection: Hypothesis Tests versus Model Selection
by Information Criteria. Journal of Surveying Engineering, 142(4), 4016017. [Crossref]
Lehmann, Rüdiger, & Voß-Böhme, A. (2017). On the statistical power of Baarda’s outlier test and some
alternative. Journal of Geodetic Science, 7(1), 6878. [Crossref]
Niemeier, W., & Tengen, D. (2017). Uncertainty assessment in geodetic network adjustment by combining
GUM and Monte-Carlo-simulations. Journal of Applied Geodesy, 11(2), 6776. [Crossref]
Rofatto, V. F., Matsuoka, M. T., & Klein, I. (2017). An Attempt to Analyse Baarda’s Iterative Data Snooping
Procedure based on Monte Carlo Simulation. South African Journal of Geomatics, 6(3), 416. [Crossref]
Rousseeuw, P. J., & Leroy, A. M. (1987). Robust Regression and Outlier Detection. [Crossref]
Teunissen, P. J. G. (2003). Adjustment theory. VSSD.
Teunissen, P. J. G. (2006). Testing theory. VSSD.
Wilcox, R. R. (2011). Introduction to robust estimation and hypothesis testing. Academic press.
Yang, L., Wang, J., Knight, N. L., & Shen, Y. (2013). Outlier separability analysis with a multiple alternative
hypotheses test. Journal of Geodesy, 87(6), 591604. [Crossref]
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
William Sealy Gosset, otherwise known as "Student", was one of the pioneers in the development of modern statistical method and its application to the design and analysis of experiments. Although there were no computers in his time, he discovered the form of the "t distribution" by a combination of mathematical and empirical work with random numbers. This is now known as an early application of the Monte Carlo simulation. Today with the fast computers and large data storage systems, the probabilities distribution can be estimated using computerized simulation. Here, we use Monte Carlo simulation to investigate the efficiency of the Baarda's iterative data snooping procedure as test statistic for outlier identification in the Gauss-Markov model. We highlight that the iterative data snooping procedure can identify more observations than real number of outliers simulated. It has a deserved attention in this work. The available probability of over-identification allows enhancing the probability of type III error as well as probably the outlier identifiability. With this approach, considering the analysed network, in general, the significance level of 0.001 was the best scenario to not make mistake of excluding wrong observation. Thus, the data snooping procedure was more realistic when the over-identifications case is considered in the simulation. In the end, we concluded that for GNSS network that the iterative data snooping procedure based on Monte Carlo can locate an outlier in the order of magnitude 4.5σ with high success rate.
Article
Full-text available
Baarda’s outlier test is one of the best established theories in geodetic practice. The optimal test statistic of the local model test for a single outlier is known as the normalized residual. Also other model disturbances can be detected and identified with this test. It enjoys the property of being a uniformly most powerful invariant (UMPI) test, but is not a uniformly most powerful (UMP) test. In this contribution we will prove that in the class of test statistics following a common central or non-central χ
Article
Full-text available
The observations in geodetic networks are measured repetitively and in the network adjustment step, the mean values of these original observations are used. The mean operator is a kind of Least Square Estimation (LSE). LSE provides optimal results when random errors are normally distributed. If one of the original repetitive observations has outlier, the magnitude of this outlier will decrease because the mean value of these original observations is used in the network adjustment and outlier detection. In this case, the reliability of the outlier detection methods decreases, too. Since the original repetitive observations are independent, they can be used in the adjustment model instead of the estimating mean value of them. In this study, to show the effects of the estimating mean value of the original repetitive observations, a leveling network that contains both outward run and backward run observations were simulated. Tests for outlier, Huber and Danish methods were applied to two different cases. First, the mean values of the original observations (outward run and return run) were used; and then all original observations were considered in the outlier detection. The reliabilities of the methods were measured by Mean Succes Rate. According to the obtained results, the second case has more reliable results than first case.
Book
This revised book provides a thorough explanation of the foundation of robust methods, incorporating the latest updates on R and S-Plus, robust ANOVA (Analysis of Variance) and regression. It guides advanced students and other professionals through the basic strategies used for developing practical solutions to problems, and provides a brief background on the foundations of modern methods, placing the new methods in historical context. Author Rand Wilcox includes chapter exercises and many real-world examples that illustrate how various methods perform in different situations. Introduction to Robust Estimation and Hypothesis Testing, Second Edition, focuses on the practical applications of modern, robust methods which can greatly enhance our chances of detecting true differences among groups and true associations among variables. * Covers latest developments in robust regression * Covers latest improvements in ANOVA * Includes newest rank-based methods * Describes and illustrated easy to use software.
Article
In this article first ideas are presented to extend the classical concept of geodetic network adjustment by introducing a new method for uncertainty assessment as two-step analysis. In the first step the raw data and possible influencing factors are analyzed using uncertainty modeling according to GUM (Guidelines to the Expression of Uncertainty in Measurements). This approach is well established in metrology, but rarely adapted within Geodesy. The second step consists of Monte-Carlo-Simulations (MC-simulations) for the complete processing chain from raw input data and pre-processing to adjustment computations and quality assessment. To perform these simulations, possible realizations of raw data and the influencing factors are generated, using probability distributions for all variables and the established concept of pseudo-random number generators. Final result is a point cloud which represents the uncertainty of the estimated coordinates; a confidence region can be assigned to these point clouds, as well. This concept may replace the common concept of variance propagation and the quality assessment of adjustment parameters by using their covariance matrix. It allows a new way for uncertainty assessment in accordance with the GUM concept for uncertainty modelling and propagation. As practical example the local tie network in “Metsähovi Fundamental Station”, Finland is used, where classical geodetic observations are combined with GNSS data.
Article
One of the main challenges in the quality control of geodetic measurements is the reliable identification of multiple outliers. Within this context, the goal of this paper is to present a procedure designated here as Sequential Likelihood Ratio Tests for Multiple Outliers (SLRTMO). To verify its performance, a levelling network was simulated involving one, two and three (simultaneous) outliers. Also a GNSS network involving one and two (simultaneous) outliers was analysed. Results showed that SLRTMO is efficient for single and multiple outliers, simulated with magnitude greater than five standard deviations, with a mean success rate of 79.6% for these cases. Furthermore, the maximum number of outliers to be tested has to be defined according to the redundancy of the network so as to ensure the performance of SLRTMO.
Article
The detection of multiple outliers can be interpreted as a model selection problem. The null model, which indicates an outlier free set of observations, and a class of alternative models, which contain a set of additional bias parameters. A common way to select the right model is the usage of a statistical hypothesis test. In geodesy Baarda's data snooping is most popular. Another approach arises from information theory. Here, the Akaike information criterion (AIC) is used to select an appropriate model for a given set of observations. AIC is based on the Kullback-Leibler divergence, which describes the discrepancy between the model candidates. Both approaches are discussed and applied to test problems: The fitting of a straight line and a geodetic network. Some relationships between data snooping and information criteria are elaborated. In a comparison it turns out that the information criteria approach is more simple and elegant. But besides AIC there are many alternative information criteria selecting different outliers, and it is not clear, which one is optimal.
Article
the complete guide to adjusting for measurement error-expanded and updated. No measurement is ever exact. Adjustment Computations updates a classic, definitive text on surveying with the latest methodologies and tools for analyzing and adjusting errors with a focus on least squares adjustments, the most rigorous methodology available and the one on which accuracy standards for surveys are based. This extensively updated Fifth Edition shares new information on advances in modern software and GNSS-acquired data. Expanded sections offer a greater amount of computable problems and their worked solutions, while new screenshots guide readers through the exercises. Continuing its legacy as a reliable primer, Adjustment Computations covers the basic terms and fundamentals of errors and methods of analyzing them and progresses to specific adjustment computations and spatial information analysis. Current and comprehensive, the book features. Easy-to-understand language and an emphasis on real-world applications. Analyzing data in three dimensions, confidence intervals, statistical testing, and more. An updated support web page containing a 150-page solutions manual, software (STATS, ADJUST, and MATRIX for Windows computers), MathCAD worksheets, and more at http://www.wiley.com/college/ghilani. The latest information on advanced topics such as the tau criterion used in post-adjustment statistical blunder detection. Adjustment Computations, Fifth Edition is an invaluable reference and self-study resource for working surveyors, photogrammetrists, and professionals who use GNSS and GIS for data collection and analysis, including oceanographers, urban planners, foresters, geographers, and transportation planners. It's also an indispensable resource for students preparing for licensing exams and the ideal textbook for courses in surveying, civil engineering, forestry, cartography, and geology. This a copyrighted book so I cannot share it. If you are an educator, contact your Wiley representative for a free copy.