ArticlePublisher preview available

Online domain description of big data based on hyperellipsoid models

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract and Figures

Big data is usually massive, diverse, time-varying, and high-dimensional. The focus of this paper is on the domain description of big data, which is the basis for solving the above problems. This paper has three main contributions. Firstly, one hyperellipsoid model is proposed to analyze domain description of big data. The parameters of the hyperellipsoid model can be adaptively adjusted according to the proposed objective function without relying on manual parameter selection, which expands the application range of the model. Secondly, an improved FDPC algorithm is proposed to generate multiple hyperellipsoid models to approximate the spatial distribution of big data, thus improving the accuracy of domain description. Multiple hyperellipsoid models can not only greatly eliminate the spatial redundancy of the domain description based on one hyperellipsoid model, but also provide a feasible method for describing complex spatial distribution. Thirdly, an online domain description algorithm based on hyperellipsoid models is proposed, which improves the robustness of hyperellipsoid models on time-varying data. The parallel processing flow of the algorithm is given. In the experiment, synthetic instances and real-world datasets were applied to test the performance of hyperellipsoid models. By comparing LOF, OneClassSVM, SVDD and isolation forest, the performance of the proposed method is competitive and promising.
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
International Journal of Machine Learning and Cybernetics (2021) 12:2185–2197
https://doi.org/10.1007/s13042-021-01300-0
ORIGINAL ARTICLE
Online domain description ofbig data based onhyperellipsoid models
ZengshuaiQiu1
Received: 13 June 2020 / Accepted: 8 March 2021 / Published online: 13 April 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
Big data is usually massive, diverse, time-varying, and high-dimensional. The focus of this paper is on the domain descrip-
tion of big data, which is the basis for solving the above problems. This paper has three main contributions. Firstly, one
hyperellipsoid model is proposed to analyze domain description of big data. The parameters of the hyperellipsoid model
can be adaptively adjusted according to the proposed objective function without relying on manual parameter selection,
which expands the application range of the model. Secondly, an improved FDPC algorithm is proposed to generate multiple
hyperellipsoid models to approximate the spatial distribution of big data, thus improving the accuracy of domain description.
Multiple hyperellipsoid models can not only greatly eliminate the spatial redundancy of the domain description based on
one hyperellipsoid model, but also provide a feasible method for describing complex spatial distribution. Thirdly, an online
domain description algorithm based on hyperellipsoid models is proposed, which improves the robustness of hyperellipsoid
models on time-varying data. The parallel processing flow of the algorithm is given. In the experiment, synthetic instances
and real-world datasets were applied to test the performance of hyperellipsoid models. By comparing LOF, OneClassSVM,
SVDD and isolation forest, the performance of the proposed method is competitive and promising.
Keywords Big data· Hyperellipsoid model· Online domain description
1 Introduction
Due to the rapid development of mobile Internet, Internet
of things, cloud computing and other technologies, the era
of big data has arrived. Big data brings many challenges,
such as the representation and storage of big data, effective
information mining, big data retrieval, real-time analysis of
big data, and so on [1] 1]. However, the domain description
of big data is the basis for solving the above problems. The
domain description of big data is different from the classifi-
cation and regression problem in the field of pattern recogni-
tion. Its purpose is to determine the spatial distribution of
big data, which can be understood as one-class problem from
the perspective of pattern recognition. The evaluation of the
domain description of big data is usually outlier detection
and novelty detection.
Different data domain description methods have been
proposed by many researchers. In [3], an evidence support
method is proposed to solve the problem of data domain
description under multivariate positive distribution. The
method can produce more contrasting outlier scores in high
dimensional datasets however the method is limited by the
joint probability density of high dimensional data. In [4],
this paper introduces a concept called Local Projection Scor-
ing (LPS) to indicate the degree of deviation of an obser-
vation from its neighborhood. The LPS is obtained from
the neighborhood information by a low rank approximation
technique. Observations of high LPS are likely to be promis-
ing candidates for outliers. However, the proposed method
relies on manual parameter selection, which greatly limits
its scope of application.
In order to improve the robustness of the model under
different data distributions, a novel convolutional neural
network (CNN) was proposed in [5]. Since the method is
based on convolutional neural networks, the model has
strong nonlinearity to more accurately describe the diver-
sity of data. An important drawback of this method is that
it takes a long time to train the model, and it is difficult
to ensure the real-time performance of the model update.
Considering the priori information of negative samples in
practical applications, a novel outlier detection approach
is presented [6]. This method introduces a likelihood
* Zengshuai Qiu
qzengshuai@jiangnan.edu.cn
1 Jiangnan University, Wuxi, Jiangsu, China
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
In this paper, we propose a real-time image superpixel segmentation method with 50fps by using the Density-Based Spatial Clustering of Applications with Noise (DBSCAN) algorithm. In order to decrease the computational costs of superpixel algorithms, we adopt a fast two-step framework. In the first clustering stage, the DBSCAN algorithm with color-similarity and geometric restrictions is used to rapidly cluster the pixels, and then small clusters are merged into superpixels by their neighborhood through a distance measurement defined by color and spatial features in the second merging stage. A robust and simple distance function is defined for obtaining better superpixels in these two steps. The experimental results demonstrate that our real-time superpixel algorithm (50fps) by the DBSCAN clustering outperforms the state-of-the-art superpixel segmentation methods in terms of both accuracy and efficiency.
Article
As one of the efficient approaches to deal with big data, divide-and-conquer distributed algorithms, such as the distributed kernel regression, bootstrap, structured perception training algorithms, and so on, are proposed and broadly used in learning systems. Some learning theories have been built to analyze the feasibility, approximation, and convergence bounds of these distributed learning algorithms. However, less work has been studied on the stability of these distributed learning algorithms. In this paper, we discuss the generalization bounds of distributed learning algorithms from the view of algorithmic stability. First, we introduce a definition of uniform distributed stability for distributed algorithms and study the distributed algorithms' generalization risk bounds. Then, we analyze the stability properties and generalization risk bounds of a kind of regularization-based distributed algorithms. Two generalization distributed risks obtained show that the generalization distributed risk bounds for the difference between their generalization distributed and empirical distributed/leave-one-computer-out risks are closely related to the size of samples n and the amount of working computers m as O(m/n 1/2 ) . Furthermore, the results in this paper indicate that, for a good generalization regularized distributed kernel algorithm, the regularization parameter λ should be adjusted with the change of the term m/n 1/2 . These theoretic discoveries provide the useful guidance when deploying the distributed algorithms on practical big data platforms. We explore our theoretic analyses through two simulation experiments. Finally, we discuss some problems about the sufficient amount of working computers, nonequivalence, and generalization for distributed learning. We show that the rules for the computation on one single computer may not always hold for distributed learning.
Article
Being one of the most multifaceted cyber-physical systems, smart grids are arguably more prone to cyber-threats. A covert data integrity assault on a communications network may be lethal to the reliability and safety of smart grid operations. Intelligently designed to sidestep the traditional bad data detector in power control centers, this type of assault can compromise the integrity of the data, causing a false estimation of the state that further severely distresses the entire power system operation. In this paper, we propose an unsupervised machine learningbased scheme to detect covert data integrity assaults in smart grid communications networks utilizing non-labeled data. The proposed scheme employs a state-of-the-art algorithm, called isolation forest and detects covert data integrity assaults based on the hypothesis that the assault has the shortest average path length in a constructed random forest. To tackle the dimensionality issue from the growth in power systems, we use a principal component analysis–based feature extraction technique. Evaluation of the proposed scheme is carried out through standard IEEE 14-bus, 39-bus, 57-bus, and 118-bus systems. Simulation results show that the proposed scheme is proficient at handling non-labeled historical measurement datasets and results in a significant improvement in attack detection accuracy.
Article
We present a novel Convolutional Neural Network (CNN) based approach for one class classification. The idea is to use a zero centered Gaussian noise in the latent space as the pseudo-negative class and train the network using the cross-entropy loss to learn a good representation as well as the decision boundary for the given class. A key feature of the proposed approach is that any pre-trained CNN can be used as the base network for one class classification. The proposed One Class CNN (OC-CNN) is evaluated on the UMDAA-02 Face, Abnormality-1001, FounderType-200 datasets. These datasets are related to a variety of one class application problems such as user authentication, abnormality detection and novelty detection.
Article
We introduce an online outlier detection algorithm to detect outliers in a sequentially observed data stream. For this purpose, we use a two-stage filtering and hedging approach. In the first stage, we construct a multi-modal probability density function to model the normal samples. In the second stage, given a new observation, we label it as an anomaly if the value of aforementioned density function is below a specified threshold at the newly observed point. In order to construct our multi-modal density function, we use an incremental decision tree to construct a set of subspaces of the observation space. We train a single component density function of the exponential family using the observations, which fall inside each subspace represented on the tree. These single component density functions are then adaptively combined to produce our multi-modal density function, which is shown to achieve the performance of the best convex combination of the density functions defined on the subspaces. As we observe more samples, our tree grows and produces more subspaces. As a result, our modeling power increases in time, while mitigating overfitting issues. In order to choose our threshold level to label the observations, we use an adaptive thresholding scheme. We show that our adaptive threshold level achieves the performance of the optimal pre-fixed threshold level, which knows the observation labels in hindsight. Our algorithm provides significant performance improvements over the state of the art in our wide set of experiments involving both synthetic as well as real data.
Article
How to tackle high dimensionality of data effectively and efficiently is still a challenging issue in machine learning. Identifying anomalous objects from given data has a broad range of real-world applications. Although many classical outlier detection or ranking algorithms have been witnessed during the past years, the high-dimensional problem, as well as the size of neighborhood, in outlier detection have not yet attracted sufficient attention. The former may trigger the distance concentration problem that the distances of observations in high-dimensional space tend to be indiscernible, whereas the latter requires appropriate values for parameters, making models high complex and more sensitive. To partially circumvent these problems, especially the high dimensionality, we introduce a concept called local projection score (LPS) to represent deviation degree of an observation to its neighbors. The LPS is obtained from the neighborhood information by the technique of low-rank approximation. The observation with high LPS is a promising candidate of outlier in high probability. Based on this notion, we propose an efficient and effective outlier detection algorithm, which is also robust to the parameter k of k nearest neighbors. Extensive evaluation experiments conducted on twelve public real-world data sets with five popular outlier detection algorithms show that the performance of the proposed method is competitive and promising.
Article
In this paper, we propose a general model to address the overfitting problem in online similarity learning for big data, which is generally generated by two kinds of redundancies: 1) feature redundancy, that is there exists redundant (irrelevant) features in the training data; 2) rank redundancy, that is non-redundant (or relevant) features lie in a low rank space. To overcome these, our model is designed to obtain a simple and robust metric matrix through detecting the redundant rows and columns in the metric matrix and constraining the remaining matrix to a low rank space. To reduce feature redundancy, we employ the group sparsity regularization, i.e., the '2;1 norm, to encourage a sparse feature set. To address rank redundancy, we adopt the low rank regularization, the max norm, instead of calculating the SVD as in traditional models using the nuclear norm. Therefore, our model can not only generate a low rank metric matrix to avoid overfitting, but also achieves feature selection simultaneously. For model optimization, an online algorithm based on the stochastic proximal method is derived to solve this problem efficiently with the complexity of O(d 2 ). To validate the effectiveness and efficiency of our algorithms, we apply our model to online scene categorization and synthesized data and conduct experiments on various benchmark datasets with comparisons to several state-of-the-art methods. Our model is as efficient as the fastest online similarity learning model OASIS, while performing generally as well as the accurate model OMLLR. Moreover, our model can exclude irrelevant / redundant feature dimension simultaneously.
Article
Recognizing the samples belonging to one class in a heterogeneous data set is a very interesting but tough machine learning task. Some samples of the data set can be actual outliers or members of other classes for which training examples are lacking. In contrast to other kernel approaches present in the literature, in this work, the problem is faced defining a one-class kernel machine that delivers the probability for a sample to belong to the support of the distribution and that can be efficiently trained by a hybrid sequential minimal optimization-expectation maximization algorithm. Due to the analogy to the import vector machine and to the one-class approach, we named the method import vector domain description (IVDD). IVDD was tested on a toy 2-D data set in order to characterize its behavior on a set of widely used benchmarking UCI data sets and, lastly, challenged against a real world outlier detection data set. All the results were compared against state-of-the-art closely related methods such as one-class-SVM and Support Vector Domain Description, proving that the algorithm is equally accurate with the additional advantage of delivering the probability estimate for each sample. Finally, a few variants aimed at providing memory savings and/or computational speed-up in the light of big data analysis are briefly sketched.