Pei Sun’s research while affiliated with The University of Sydney and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (8)


Figure 12: False positive reduction after repeating Phase one and intersecting the result. The lines shown correspond to datasets of varying size. The size of each dataset is given in the legend, where for example 10k means that the dataset contains 10,000 points. From this graph it can be clearly seen that in all cases the reduction is of exponential magnitude after each run and intersection.  
Figure 13: False positive reduction after repeating Phase one and intersecting the result. The lines shown correspond to datasets of varying dimensionality. The dimensionality of each dataset is given in the legend. This graph shows that the reduction in the number of nonoutliers remains of an exponential magnitude independent of the dimensionality.  
Figure 14: Runtime for varying dataset size. Phase 2 is not included in these results. Our results for two and three runs of Phase 1 plus the intersection of the candidate set is shown compared to the Bay-Schwabacher algorithm.
Figure 15: Runtime for varying dataset dimensionality. Phase 2 is not included in these results. Our results for two and three runs of Phase 1 plus the intersection of the candidate set is shown compared to the BaySchwabacher algorithm.
Disk-Based Sampling for Outlier Detection in High Dimensional Data
  • Conference Paper
  • Full-text available

January 2008

·

48 Reads

·

2 Citations

Timothy de Vries

·

·

Pei Sun

·

We propose an efficient sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the first phase, we combine a “sampling ” strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the running time has linear complexity with respect to the size and dimensionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity O(CN) where C and N are the size of the candidate set and the data set respectively. The major strengths of the proposed approach are that (1) no partitioning

Download

Mining for Outliers in Sequential Databases

April 2006

·

93 Reads

·

132 Citations

The mining of outliers (or anomaly detection) in large databases continues to remain an active area of research with many potential applications. Over the last several years many novel methods have been proposed to efficiently and accurately mine for outliers. In this paper we propose a unique approach to mine for sequential outliers using Prob- abilistic Suffix Trees (PST). The key insight that underpins our work is that we can distinguish outliers from non-outliers by only examining the nodes close to the root of the PST. Thus, if the goal is to just mine outliers, then we can drasti- cally reduce the size of the PST and reduce its construction and query time. In our experiments, we show that on a real data set consisting of protein sequences, by retaining less than 5% of the original PST we can retrieve all the outliers that were reported by the full-sized PST. We also carry out a detailed comparison between two measures of sequence similarity: the normalized probability and the odds and show that while the current research literature in PST favours the odds, for outlier detection it is normalized probability which gives far superior results. We provide an information theoretic argument based on entropy to explain the success of the normalized probability measure. Finally, we describe a more efficient implementation of the PST al- gorithm, which dramatically reduces its construction time compared to the implementation of Bejerano (3).


SLOM: A new measure for local spatial outliers

April 2006

·

213 Reads

·

115 Citations

Knowledge and Information Systems

We propose a measure, spatial local outlier measure (SLOM), which captures the local behaviour of datum in their spatial neighbourhood. With the help of SLOM, we are able to discern local spatial outliers that are usually missed by global techniques, like “three standard deviations away from the mean”. Furthermore, the measure takes into account the local stability around a data point and suppresses the reporting of outliers in highly unstable areas, where data are too heterogeneous and the notion of outliers is not meaningful. We prove several properties of SLOM and report experiments on synthetic and real data sets that show that our approach is novel and scalable to large datasets.


Striking Two Birds With One Stone: Simultaneous Mining of Positive and Negative Spatial Patterns

April 2005

·

45 Reads

·

15 Citations

We propose an efficient algorithm to mine positive and negative patterns in large spatial databases. The algorithm is based on exploiting a complementarity property for a certain support-like measure. This property guarantees that if a positive k-pattern is "frequent" then O (k) related negative patterns will be infrequent. For the traditional support measure this complementarity property holds true only when the minimum support is over fifty percent We also confirm the correctness of our approach using Ripley's K-Function, a standard tool in spatial statistics for analyzing point patterns. Extensive experimentation on data extracted from the Sloan Digital Sky Survey (SDSS) database demonstrates the utility of our approach to large scale data exploration.


On local spatial outliers

December 2004

·

49 Reads

·

122 Citations

We propose a measure, spatial local outlier measure (SLOM) which captures the local behaviour of datum in their spatial neighborhood. With the help of SLOM, we are able to discern local spatial outliers which are usually missed by global techniques like "three standard deviations away from the mean". Furthermore, the measure takes into account the local stability around a data point and supresses the reporting of outliers in highly unstable areas, where data is too heterogeneous and the notion of outliers is not meaningful. We prove several properties of SLOM and report experiments on synthetic and real data sets which show that our approach is scalable to large data sets.


Complex Spatial Relationships

September 2003

·

40 Reads

·

41 Citations

This paper describes the need for mining complex relationships in spatial data. Complex relationships are defined as those involving two or more of: multi-feature co-location, self-co-location, one-to-many relationships, self-exclusion and multi-feature exclusion. We demonstrate that even in the mining of simple relationships, knowledge of complex relationships is necessary to accurately calculate the significance of results. We implement a representation of spatial data such that it contains `weak-monotonic' properties, which are exploited for the efficient mining of complex relationships, and discuss the strengths and limitations of this representation.



Disk›Based Successive Sampling for Outlier Detection in High Dimensional Data

7 Reads

We propose a sampling based outlier detection method for large high-dimensional data. Our method consists of two phases. In the rst phase, we combine a ìsuccessive samplingî strategy with a simple randomized partitioning technique to generate a candidate set of outliers. This phase requires one full data scan and the run- ning time has linear complexity with respect to the size and dimen- sionality of the data set. An additional data scan, which constitutes the second phase, extracts the actual outliers from the candidate set. The running time for this phase has complexity where and are the size of the candidate set and the data set respectively. A major strength of the proposed approach is that no partitioning of the dimensions is required thus making it particularly suitable for high dimension data. Furthermore our method can handle both continuous and categorical attributes. We also present a detailed experimental evaluation of our proposed method on real and syne- thetic data sets.

Citations (7)


... A consequence is the continued appearance of new models and new methods based on a diversity of schemes and approaches to the problem at hand. One of these new propositions is the application of Rough Set Theory to outlier detection, where previous studies [20] and the results and achievements already attained in this line of research [45] serve as the main precedents. ...

Reference:

Algorithm for the detection of outliers based on the theory of rough sets
Outlier Detection: Principles, Techniques and Applications
  • Citing Article

... A Spatial-Temporal approach has been used in medical science to differentiate epidemic risk patterns [2]. Outlier detection approach has also been used for PCB Testing in industry [3].An application of such techniques has discussed in [4] for outlier detection in high dimensional data. Another application was proposed by [5] for temporal outlier detection in vehicle traffic data. ...

Disk-Based Sampling for Outlier Detection in High Dimensional Data

... Arunasalam et al. [3] classifies spatial relationships into four different types -Positive, Negative, Self-Co-location, and Complex. To discover SCPs based upon complex relationships, Verhein et al. [15] proposes non-apriori algorithm based approach. ...

Striking Two Birds With One Stone: Simultaneous Mining of Positive and Negative Spatial Patterns
  • Citing Conference Paper
  • April 2005

... However, in practice, the PCA leads to information loss due to the number of eigenvalues selected to reduce the dimension of the data set. Apart from the PCA, other methods such as feature or variable selections (Sun et al, 2006) and a host of other dimension reductions have been proposed. In this study, a unique transpose procedure is proposed to solve the dimensionality problems for Pearson correlation (PC). ...

Mining for Outliers in Sequential Databases
  • Citing Conference Paper
  • April 2006

... The goal of the density-based anomaly or local anomaly detection strategy is to locate observations that deviate from the norm in their immediate surroundings [167]. Local anomalies differ from global anomalies, which do not only include observations made locally but also those made globally [166][167][168], and [169]. Global anomalies are incongruous with the pattern given by the majority of all other observations. ...

SLOM: A new measure for local spatial outliers
  • Citing Article
  • April 2006

Knowledge and Information Systems

... They are algorithms that follow this assumption, and these are for example; the Cluster-Based Local Outlier Factor (CBLOF) which was proposed by (Pires and Santos-Pereira, 2005), (He, Xu and Deng, 2003), and (Jiang, Tseng and Su, 2001). The improvement to this algorithm has been seen in literature where extensions to the algorithms have been developed (Sun and Chawla, 2004). ...

On local spatial outliers
  • Citing Conference Paper
  • December 2004

... In terms of spatial colocation rules, several scholars have proposed methods for mining local (Celik, Kang, and Shekhar 2007;Yue et al. 2017;Yao et al. 2018) and multilevel (Deng et al. 2017) colocation patterns to address the possible deficiencies found in existing models. Based on the definition by Munro (Munro, Chawla, and Sun 2003), scholars have also proposed analytical models to detect positive and negative colocation patterns (Wan, Zhou, and Bian 2008;Jiang et al. 2010). When it comes to data type, majority of the remaining methods only support point data, although the colocation pattern detection method based on spatial events proposed by Koperski and Han (1995) supports point, line, and polygonal data. ...

Complex Spatial Relationships
  • Citing Article
  • September 2003