A preview of this full-text is provided by Springer Nature.
Content available from International Journal of Machine Learning and Cybernetics
This content is subject to copyright. Terms and conditions apply.
Vol.:(0123456789)
1 3
International Journal of Machine Learning and Cybernetics (2021) 12:2185–2197
https://doi.org/10.1007/s13042-021-01300-0
ORIGINAL ARTICLE
Online domain description ofbig data based onhyperellipsoid models
ZengshuaiQiu1
Received: 13 June 2020 / Accepted: 8 March 2021 / Published online: 13 April 2021
© The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2021
Abstract
Big data is usually massive, diverse, time-varying, and high-dimensional. The focus of this paper is on the domain descrip-
tion of big data, which is the basis for solving the above problems. This paper has three main contributions. Firstly, one
hyperellipsoid model is proposed to analyze domain description of big data. The parameters of the hyperellipsoid model
can be adaptively adjusted according to the proposed objective function without relying on manual parameter selection,
which expands the application range of the model. Secondly, an improved FDPC algorithm is proposed to generate multiple
hyperellipsoid models to approximate the spatial distribution of big data, thus improving the accuracy of domain description.
Multiple hyperellipsoid models can not only greatly eliminate the spatial redundancy of the domain description based on
one hyperellipsoid model, but also provide a feasible method for describing complex spatial distribution. Thirdly, an online
domain description algorithm based on hyperellipsoid models is proposed, which improves the robustness of hyperellipsoid
models on time-varying data. The parallel processing flow of the algorithm is given. In the experiment, synthetic instances
and real-world datasets were applied to test the performance of hyperellipsoid models. By comparing LOF, OneClassSVM,
SVDD and isolation forest, the performance of the proposed method is competitive and promising.
Keywords Big data· Hyperellipsoid model· Online domain description
1 Introduction
Due to the rapid development of mobile Internet, Internet
of things, cloud computing and other technologies, the era
of big data has arrived. Big data brings many challenges,
such as the representation and storage of big data, effective
information mining, big data retrieval, real-time analysis of
big data, and so on [1] 1]. However, the domain description
of big data is the basis for solving the above problems. The
domain description of big data is different from the classifi-
cation and regression problem in the field of pattern recogni-
tion. Its purpose is to determine the spatial distribution of
big data, which can be understood as one-class problem from
the perspective of pattern recognition. The evaluation of the
domain description of big data is usually outlier detection
and novelty detection.
Different data domain description methods have been
proposed by many researchers. In [3], an evidence support
method is proposed to solve the problem of data domain
description under multivariate positive distribution. The
method can produce more contrasting outlier scores in high
dimensional datasets however the method is limited by the
joint probability density of high dimensional data. In [4],
this paper introduces a concept called Local Projection Scor-
ing (LPS) to indicate the degree of deviation of an obser-
vation from its neighborhood. The LPS is obtained from
the neighborhood information by a low rank approximation
technique. Observations of high LPS are likely to be promis-
ing candidates for outliers. However, the proposed method
relies on manual parameter selection, which greatly limits
its scope of application.
In order to improve the robustness of the model under
different data distributions, a novel convolutional neural
network (CNN) was proposed in [5]. Since the method is
based on convolutional neural networks, the model has
strong nonlinearity to more accurately describe the diver-
sity of data. An important drawback of this method is that
it takes a long time to train the model, and it is difficult
to ensure the real-time performance of the model update.
Considering the priori information of negative samples in
practical applications, a novel outlier detection approach
is presented [6]. This method introduces a likelihood
* Zengshuai Qiu
qzengshuai@jiangnan.edu.cn
1 Jiangnan University, Wuxi, Jiangsu, China
Content courtesy of Springer Nature, terms of use apply. Rights reserved.