Conference PaperPDF Available

A Survey on Deep Learning in Big Data

Authors:

Figures

Content may be subject to copyright.
A Survey on Deep Learning in Big Data
Mehdi Gheisari, Guojun Wang∗† , Md Zakirul Alam Bhuiyan†‡
School of Computer Science and Educational Software, Guangzhou University, Guangzhou, China, 510006
Department of Computer and Information Sciences, Fordham University, New York, NY, 10458
Correspondence to: csgjwang@gzhu.edu.cn
Abstract—Big Data means extremely huge large data sets that
can be analyzed to find patterns, trends. One technique that can
be used for data analysis so that able to help us find abstract
patterns in Big Data is Deep Learning. If we apply Deep Learning
to Big Data, we can find unknown and useful patterns that were
impossible so far. With the help of Deep Learning, AI is getting
smart. There is a hypothesis in this regard, the more data, the
more abstract knowledge. So a handy survey of Big Data, Deep
Learning and its application in Big Data is necessary. In this
paper, we provide a comprehensive survey on what is Big Data,
comparing methods, its research problems, and trends. Then a
survey of Deep Learning, its methods, comparison of frameworks,
and algorithms is presented. And at last, application of Deep
Learning in Big Data, its challenges, open research problems
and future trends are presented.
Index Terms—Big Data; Deep learning; Deep Learning Chal-
lenges; Machine Learning; Deep Learning Methods; Big Data
Challenges.
I. INTRODUCTION
If we want to have a glance at the data generation history
from 1960, we can see this trend in overall: 1960-1990, rela-
tional databases; 1990-2000, OLAP technology; 2000- 2010,
column based data storages and cloud computing; and 2010-
2016, Big Data applications. These days, Knowledge plays a
key role to get success. Many companies need more abstract
knowledge. This need can be satisfied by a combination of two
major domains: Big Data and Deep Learning. Each device can
generate data. This situation will become worse if each device
can be connected to other devices to use their information.
In other words, with the emergence of Internet of Things,
we are facing with huge amount of data that needed to be
stored and managed, one Example of Big Data. in brief, with
the advances in digital devices such as digital sensors, large
amounts of data have been generated at a fast speed that
resulted in an area named Big Data. Big Data is not only
about producing data from sensors; It can be provided by
humans, texts, images and so on. Big Data has a great impact
on technologies and computing. In other words, we have more
data these days that current methods cannot deal with these
data. In simple word, the term of Big Data means collecting,
processing and presenting the results of huge amounts of data
that come at high speed in a variety of formats. Traditional
Machine Learning tools have shortcoming when they face with
Big Data and want to solve Big Data area problems [1]. The
following figure shows comparisons of ML techniques and
their drawbacks.
For example, we can apply Deep Learning that is a tool for
understanding higher abstract knowledge in most steps of Big
Data area problems. But preferably it needs high volumes of
data. If we want to become more successful in this competitive
area, we need to find abstract patterns. The more pattern, the
more success. In this regard, we need to know the application
of Deep Learning in Big Data, how to use it that is the aim
of this paper. Authors contribution is:
1) A handy introduction to Big Data and comparing its
methods.
2) A handy introduction to Deep Learning, comparing algo-
rithms and frameworks.
3) The authors describe the application of Deep Learning in
Big Data.
Section 2 presents Big Data steps, challenges, and future
trends. Section 3 shows Deep Learning and Machine Learning
tools, frameworks. Section 4 describes the application of
Deep Learning in Big Data, future trends, and open research
problems. Conclusion and future work will be presented in
section 5.
II. BI G DATA
The rise of Big Data has been caused by increase of data
storage capability, increase of computational power, and more
data volume accessibility. Most of the current technologies
that are used to handle Big Data challenges are focusing
on six main issues of that called Volume, Velocity, Variety,
Veracity, Validity, and Volatility. The first one is Volume that
means we are facing with huge amounts of data that most of
traditional algorithms are not able to deal with this challenge.
For example, Each minute 15h of videos are uploaded to
Facebook so that collects more than 50 TB per day. With
respect to the amounts of data generating each day, we can
predict the growth rate of data in next years [2]. The data
growth is 40 percent per year. Each year around 1.2 ZB data
are produced. Huge companies such as Twitter, Facebook, and
Yahoo have recently begun tapping into large volume data
benefits. The definition of high volume is not specified in
predefined term and it is a relative measure depends on the
current situation of the enterprise [3]. The second challenge
is Variety that in brief means we are facing with variety types
of file formats and even unstructured ones such as PDFs,
emails, audios and so on. These data should be unified for
further processes [4]. The third V is Velocity that means data
are coming in a very fast manner, the rate at which data are
coming is striking, that may hang the system easily. It shows
the need for real-time algorithms. The next two Vs (Veracity
and Validity) have major similarities with each other, mean
Fig. 1. comparison between machine learning techniques
data must be as clean, trustworthy, usefulness, result data
should be valid, as possible for later processing phases. The
more data sources and types, the more difficult sustaining trust
[5]. And the last V is the Volatility that means how much
time data should remain in the system so that they are useful
for the system. McKinsey added Value as the seventh V that
means the amount of hidden knowledge inside Big Data [6].
We also can consider open research problems from another
viewpoint as follows, six parameters: Availability, Scalability,
Integrity, Heterogeneity, Resource Optimization, and Velocity
(related to stream processing). Labrinidis and Jagadish in [7]
described some challenges and research problems with respect
to Scalability, Heterogeneity aspects of Big Data management.
Other parameters such as availability and integrity are covered
in [8]. These parameters are defined as follows: -Availability:
Means data should be accessible and available whenever
and wherever user requests data even in the case of failure
occurrence. Data analysis methods should provide availability
to support large amounts of data along with a high-speed
stream of data [9].
-Scalability: refers if a system supports large amounts of
increasing data efficiently or not. Scalability is an important
issue mostly from 2011 for industrial applications to scale well
in limited memory.
-Data Integrity: points to data accuracy. The situation be-
comes worse when different users with different privileges
change data in the cloud. Cloud is in charge of managing
databases. Therefore, users have to obey cloud policy for data
integrity [10].
-Heterogeneity: refers to different types of data such as
structured, unstructured and semi-structured [11].
-Resource Optimization: means using existing resources ef-
ficiently. A precise policy for resource optimization is needed
for guaranteeing distributed access to Big Data.
-Velocity: means the speed of data creation and data analy-
sis. The increased amount of digital devices like smart phones,
tablets caused the increase of speed of data generation. Thus,
the need for real-time analyses is obligatory. These are very
application dependent that means can differ for each applica-
tion to another application. And from steps point of view, Big
Data area can be divided into three main Phases: Big Data
preprocessing, means doing some preliminary actions toward
data with the aim of data preparation such as data cleansing
and so on. Big Data storage means how data should be stored.
Big Data management means how we should manage data in
order to get best achievement such as clustering, classification
and so on [12].
A. Preprocessing
For better decision-making, we should provide quality data
for data analyzing step. In other words, the quality of data
is critical to quality decision. We should also verify data
before decision. Preprocessing data means transforming, in-
consistency, incomplete data that have many errors into an
appropriate format for further analyses. In other words, data
must be structured prior to analysis stage [13]. For example,
in one database we may have STUEDNTID and in the other,
we may have Student Identifier. It prepares data for further
processing and analysis. There are some steps for achieving
preprocessing section goal as described as follows:
1. Data cleansing: Removing inaccuracies, incompleteness,
and inconsistencies of data.
2. Data transformation: Means doing additional processes
like aggregation, or transformation. This step has a striking
influence on future steps.
3. Data integration: It provides a single view over distributed
data from different sources.
4. Data transmission: Defines a method for transferring raw
data to storage system such as object storage, data center or
distributed cloud storage.
5. Data reduction: reducing the size of large databases for
real-time applications [14].
6. Data discretization: It is a notable step for decision tree
learning process. It refers to attribute intervals so that obtained
values will be reduced [10] .
The following sub-sections present more detail about some
preprocessing steps:
1) Data Transmission: Data transmission is one step of
preprocessing phase. It means sending raw data to data storage.
One example of proposed method in this area is sending data
through a high-capacity pipe from data source to data center.
This type of transmission needs to know networks architecture
along with transportation protocol.
2) Data Cleansing: In simple word means detecting incom-
plete, and irrational data. We can modify or delete these kinds
of data in order to achieve quality improvement for further
processing steps. Maletic and Marcus took into consideration
five stages in order to achieve clean data: 1) recognizing types
of errors 2) finding error instances 3) correct error instances
and error types 4) update data input procedure in order to
reduce further errors that may occur 5) checking data affairs
like limitations, formats, and rationalities. Data cleansing is an
indispensable and principal part of data analysis step. In brief,
there are two main problems in data cleansing step: i)Data are
imprecise ii) Data are incomplete(there are missing parts in
the dataset) and we should address these problems as much
as we can.
3) Stream Processing: Processing of stream data is a
challenge that researchers have faced in Big Data area. The
stream requirements are completely different with traditional
batch processing. In more detail, there are some emerging
applications producing large amounts of dedicated data to
servers in order to real-time processing. One example is stock
trading that we should use real-time processing in order to
achieve an enhanced decision. While large volumes of data
are received by servers for processing, we are not able to use
traditional centralized techniques. There are some applications
in this regard called Distributed Stream Processing Systems
(DSPS) [15]. But most of the people use traditional centralized
databases in order to analyze such huge amounts of data due
to lack of tools. As mentioned earlier, there are many open
research topics in the stream processing part that are described
as follows:
1-Data Mobility: It means that the number of steps that are
required to get the final result.
2-Data Division or Partitioning: The algorithms are used for
partitioning data. In the brief, partitioning strategies should be
used in order to achieve better data parallelism.
3-Data Availability: We should propose a technique that
guarantees data availability in case of failures occurrence.
4-Query Processing: We should propose a query processor
for distributed data processing efficiently with considering
data streams. One possibility of this is doing deterministic
processing (always get the same answer) and another one is
non-deterministic (the output depends on the current situation)
one.
5-Data Storage: Another open research problem in Big Data
is how to store data for future usage.
6-Stream Imperfections: Techniques dealing with data
stream imperfections like delayed messages or out-of-order
messages.
B. Data Storage
Storing data in petabyte scale is a challenge not only for
researchers but also for internet organizations. These days
we can hardly adapt existing databases to Big Data usage.
Although Cloud Computing reveals a shift to a new computing
paradigm, it cannot assure consistency easily when storing Big
Data in cloud storage. It is not a good way to waste data since
it may contribute to better decision-making. So it is critical to
have a storage management system in order to provide enough
data storage, and optimized information retrieval [13].
1) Replication: Replication is a big activity that makes
data available and accessible whenever user asks. When data
are variable, the accuracy of each replicated copy is much
more challenging. The two factors that we should consider
in replication are replication sites and consistency. These two
factors play more important role in Big Data environment as
managing these huge amounts of data are more difficult than
usual form [16].
2) Indexing: For large databases, it is not wise to retrieve
stored data and searching data in sequential form like an un-
ordered array [17]. Indexing data improves the performance of
storage manager. So proposing a suitable indexing mechanism
is challenging. There are three challenges in indexing area 1)
multi-variable and multi-site searching 2) performing different
types of queries 3) data search when they are numerical.
Authors in [18] proposed a new method for keyword searching
in data stream environment. It uses a tree based index structure
and adopts sharing of a single list of event indices to speed
up query responses. Index load time is a challenge now
same as space consumption [19]. A Support Vector Machine
indexing algorithm was introduced in [20] for video data with
the aim of modeling human behavior. It changes transition
probability calculation mechanism and applies different states
to determine the score of input data. While it produces a
relatively accurate query result with minimum time, it is
time-consuming in learning process. A fuzzy-based method
can be used for indexing of moving objects where indexing
images are captured during object’s movements. It provides a
trade-off between query response time and index regeneration.
The index supports data variables and it is scalable. And as
experiments show it has better performance than the previous
algorithms of other moving index techniques.
C. Big Data Management and Processing
There are four types of data models we have faced in Big
Data area: 1- data that we can store them in relational 2- semi-
structured data same as XML 3- graph data such as those we
use for social media and the last one is unstructured data such
as text data, hand-written articles [21]. One important question
here is why are not we able to use traditional databases such as
Relational Databases in Big Data? One of the basic answers is
that most of the relational databases are not designed to scale
to thousands of loosely coupled machines [22]. Because of
two reasons, companies tended to leave traditional databases:
the first one is traditional databases are not scalable and the
second one is that it is very expensive if we want to use non-
distributed traditional databases along with adding layers on
top. So companies decided to implement their own file system
(HDFS), distributed storage systems(Google Bigtable [23]),
distributed programming frameworks (MapReduce), and even
distributed database management systems(Apache Cassandra)
[24]. Furthermore, Big Data management is a complex process
especially when data are gathered from heterogeneous sources
to be used for decision-making and scoping out a strategy. Au-
thors in [25] noted that about 75 percent of organizations apply
at least one form of Big Data. Big Data management area
brought new challenges in terms of data fusion complexity,
storage of data, analytical tools and shortage of governance.
We also can categorize Big Data management processes into
two main categories as authors [26] reported (1) Big Data
science and (2) Big Data infrastructure. Science means study-
ing techniques regarding data acquisition, conditioning, and
evaluation. Its infrastructure is focused on the improvement of
existing technologies same as managing, analyzing, visualizing
of data [27].
Table 1 describes well-known methods with regard to stor-
age, pre-processing and processing steps of Big Data and
compare them from six features as described above.
For example, in an application that heterogeneity is not as
important as velocity, we can use SOHAC algorithm, the first
row of the table, as part of our method.
1) Classification and Clustering:
a) Classification: Unstructured data will be stored in a
distributed database such as SimpleDB, Cassandra, or Hbase.
After storing data, these data are processed by using data min-
ing techniques. Mathematical methods can also be involved
in analysis step such as classification, classifying objects into
different predefined groups, using decision trees, statistics,
Linear programming, Neural Networks [28].
b) Clustering: Another method is Clustering that means
creating groups of objects based on their meaningful attributes
so that large amounts of data sets are able to be represented
in a few data sets, summarizing gathered data into groups
where data with similar features are near to each other. It
reduces the needs for high storage resources by accommo-
dating large amounts of data in limited storage space that is
still a challenging work [29]. One proposed solution for this
challenge is Storage-Optimizing Hierarchical Agglomerative
Clustering (SOHAC)that is proposed by [30]. They introduced
a new storage structure that requires less storage space than
usual time. Basic type of this algorithm was previously. But
it was limited in computing high-dimensional data. A single
matrix contained data is decomposed into sub-matrices. Based
on the features, the new matrices reduce value in each row
[27]. Useful data will be recorded since redundant data will be
neglected in order to save more space. The algorithm takes into
consideration hierarchical agglomerative strategy. In abstract
level, at first, it defines clusters for each object. Then clusters
will be merged with each other with the aim of forming K
clusters as defined in the initialization of process [31].
III. MACH IN E LEA RN IN G AN D DEE P LEARNING
a) Machine Learning: In general, we have two types
of learning: 1- Shallow learning( Machine Learning, learning
without explicitly programming) such as decision trees, Sup-
port Vector Machines(SVMs) that it is likely to fall short when
we want to extract useful information from huge amounts of
data and even if they would not fall short, they will not have
satisfied accuracy.
An important question here is with all the different algo-
rithms in the ML, how can we choose the best one for our
purpose? If we want to predict or forecast a target value,
then we should use supervised learning techniques such as
Neural Networks (NN) that we know the correct answers
previously. In other words, supervised learning problems are
categorized into ”regression” and ”classification” problems.
In a regression problem, we are trying to predict output of
continuous values, meaning that we are trying to map input
variables to some continuous functions. In a classification
problem, we are instead trying to predict results into discrete
outputs [32]. In other words, we are trying to map input
variables into discrete categories [33]. In brief, First, we need
to consider our goal. What are we trying to get out of that?
(Do you want a probability that it might rain tomorrow, or you
want to find groups of voters with similar interests?) What
data do you have or can you collect? If you are trying to
predict or forecast a target value, then you need to look into
supervised learning. If not, then unsupervised learning is the
place you want to be. If you have chosen supervised learning,
what is your target value? Is it a discrete value like Yes/No,
1/2/3, A/B/C, or Red/Yellow/Black? If so, then you may look
into classification. If the target value can take on a number
of values, say any value from 0.00 to 100.00, or -999 to 999,
or + to -, then you need to look into regression. If you are
not trying to predict a target value, then you need to look
into unsupervised learning. Are you trying to fit your data
into some discrete groups? If so and that is all you need,
you should look into clustering. Do you need to have some
numerical estimate of how strong the fit is into each group? If
you answer yes, then you probably should look into a density
estimation algorithm [4].
b) Deep Learning: We need new insight from data, not
only for top-level executives but also can be used for providing
better services to customers. One tool for reaching this aim is
Deep Learning (DL). Deep Learning is a promising avenue
of research into automated complex feature extraction at a
high level of abstraction. Deep Learning is about learning
multiple levels of representations and abstractions that help
to make sense of data such as images, sound, and text. One
of the unique characteristics of deep learning algorithms is its
ability to utilize unlabeled data during training [34]. We are
able to discover intermediate or abstract representations which
Fig. 2. Categorizing Big Data management problems and current researches
TABLE I
ACOMPARISON OF BIG DATA METH ODS
are carried out using unsupervised learning in a hierarchical
fashion, One level each time then higher-level features are
defined based on lower-level features. It can improve clas-
sification modeling results and it has a major capability for
generalization of learning. One example of DL is extracting
invariant features of a person from an image. In simple
word, it produces more sensible knowledge from raw data
and it is called our observation variety. It generally learns
data features in a greedy layer-wise manner. In addition, It
implements a layered, hierarchical architecture of learning
leads to richer data representation. It stacks up non-linear
feature extractors with the aim of getting better machine
learning results such as a better classification model, invariant
property of data representation. It has outstanding result in
a variety of applications like speech recognition, computer
vision, and Natural language processing (NLP), debate winner
prediction in elections based on public opinion, enables the
ability to analyze and predict traffic jams faster in congestion,
finding a new mechanism that effects complex traffic systems.
Most of traditional machine learning algorithms cannot ex-
tract non-linear patterns. DL generates learning patterns and
also generates relationships beyond neighbors. DL not only
provides complex representations of data, but it also makes
machines independent from human [35]. It extracts useful
information (representation, features) from unsupervised data
without human intervention. In simple word, DL consists of
consecutive layers that each layer provides a local abstract
in its output. Each layer poses a nonlinear transformation
on its input; we have a complicated abstract representation
of data in the last layer output. The more layers data go
through, the more complicated and abstract representation we
get. The final representation is a high non-linear transformation
of data. Dl does not try to extract predefined representations;
in reverse, it tries to disentangle factors of variation in data to
find invariant patterns. For learning compact representations,
Deep Learning models are better than shallow learning models.
The compact representations are efficient because they need
less computation. It makes it possible to learn nonlinear
representations of huge amounts of data [11].
IV. APP LI CATION OF DEEP LEARNING IN BIG DATA
If we want to have a look of application of Deep Learning
in Big Data, DL deals mainly with two V’s of Big Data char-
acteristics: Volume and Variety. It means that DL are suited
for analyzing and extracting useful knowledge from both large
huge amounts of data and data collected from different sources
[36]. One example of application of Deep Learning in Big
Data is Microsoft speech recognition (MAVIS) that is using
DL enables searching of audios and video files through human
voices and speeches [20] [37]. Another usage of DL on Big
Data environment is used by Google company for Image
search service. They used DL for understanding images so that
can be used for image annotation and tagging that is useful
for image search engines and image retrieval or even image
indexing. When we want to apply Dl, we face some challenges
that we need to address them same as:
1) Deep Learning for High Volumes of Data
1.1. The first one is whether we should use all entire Big
Data input or not. In general, we apply DL algorithms
in a portion of available Big Data for training goal and
we use the rest of data for extracting abstract repre-
sentations and from another point of view, question is
that how much volume of data is needed for training
data.
1.2. Another open problem is domain adaptation, in ap-
plications which training data is different from the
distribution of test data. If we want to look at this
problem from another viewpoint, we can point to how
we can increase the generalization capacity of DL;
generalizing learnt patterns where there is a change
between input domain and target domain.
1.3. Another problem is defining criteria for allowing data
representations to provide useful future semantic mean-
ings. In simple word, each extracted data representation
should not be allowed to provide useful meaning. We
must have some criteria to obtain better data represen-
tations.
1.4. Another one is that most of the DL algorithms need
a specified loss and we should know what is our aim
to extract, sometimes it is very difficult to understand
them in the Big Data environment.
1.5. The other problem is that most of them do not provide
analytical results that can be understandable easily. In
other words, because of its complexity, you cannot
analyze the procedure easily. This situation becomes
worse in a Big Data environment.
1.6. Deep Learning seems suitable for the integration of
heterogeneous data with multiple modalities due to its
capability of learning abstract representations.
1.7. The last but not the least major problem is that they
need labeled data. If we can not provide labeled data,
they will have bad performance. One possible solution
for this is that we can use reinforcement learning, the
system gathers data by itself, and the only need for us
is giving rewards to the system.
2) Deep Learning for High Variety of Data
These days, data come in all types of formats from a
variety sources, probably with different distributions. For
example, the rapidly growing multimedia data coming
from the web and mobile devices include a huge col-
lection of images, videos and audio streams, graphics
and animations, and unstructured text, each with different
characteristics. There are open questions in this regard
that need to be addressed as some of them presented as
follows:
2.1. given that different sources may offer conflicting infor-
mation, how can we resolve the conflicts and fuse the
data from different sources effectively and efficiently?
2.2. if the system performance benefits from significantly
enlarged modalities?
2.3. in which level deep learning architectures are appro-
priate for feature fusion of heterogeneous data?
3) Deep Learning for High Velocity of Data
Data are generating at extremely high speed and need to
be processed at fast speed. One solution for learning from
such high-velocity data is online learning approaches that
can be done by deep learning. Only limited progress in
online deep learning has been made in recent years. There
are some challenges in this matter such as:
3.1. Data are often non-stationary, data distributions are
changing during the time.
3.2. the big question is whether we can benefit from Big
Data along with deep architectures for transfer learning
or not.
V. FUTURE TR EN DS A ND OP EN RESEARCH PRO BL EM S
Some future research topics may be categorized as follow-
ing:
A. Big Data Preprocessing
One challenge is data integrity that means sharing data
among users efficiently. Even though, data integration def-
inition is not much clear in most of the applications. For
example, authors in tried to state that with the aim of using
a single system for two different companies with different
products, we need to find out how the combined data can
operate in a single environment and integrated system. They
believed that data integration is much harder than Artificial
intelligence. So the two challenging research topics in this field
are generating and facilitating integration tools. The quality
of data is not predetermined. After using data, we are able to
find its quality. The more quality data, the better results. Data
providers demand error-free data and it is relatively impossible
to use only one method of data cleaning to achieve the best
quality data is a challenge. In order to achieve quality data,
we must combine different cleansing methods to meet the
organization’s need. With the increasing speed of data volume
and transformation, the reputation of collected data depends on
the quality and availability of information they provided [38].
But traditional methods were proposed in order to provide
equal access to resources. For instance, in the traditional era,
network administrators should investigate network traffic. But
with the emerging of Big Data, data analysts must analyze
data not go through many details.
B. Big Data Analytics
It relates to database searching, mining, and analysis. With
the usage of data mining in the big data area, a business can
enhance its services. Big Data Mining is a challenge because
of data complexity and scalability. The two salient goals of
data analyses are: first detecting relationships between data
features and second predicting future instances. In other words,
it means searching in a vast area to offer guidelines to users.
Steed, Ricchiuto proposed a new visual analytical system for
the earth that analyze complex earth system simulation which
named Exploratory Data Analysis Environment (EDAE) . Pre-
viously, data had always been analyzed by the trial-and-error
methods which were very difficult in complicated situations,
with vast amounts of data and data heterogeneity [39]. Authors
in [40] discussed that obtaining useful information from large
amounts of data need scalable algorithms. Additional applica-
tions and Cloud infrastructures are needed to deal with data
parallelism. Algorithm orders increase exponentially with the
increase of data size [41]. We have four type of analyses in
simple words:
Descriptive Analysis: What is happening in data now.
Predictive Analysis: What will happen in the future.
Discovery Analysis: Discovering an existing rule from
existing data
Perceptive Analysis: What should we do in future based
on current data.
C. Semantic Indexing
Another usage of DL and open challenge is using it for
semantic indexing with the aim of better information retrieval.
It means we should store semantic indexes rather than storing
as raw data bytes due to massive amounts of data and low
storage capacity. DL generates high-level data representations.
We can use this abstract data representation to provide better
indexing method [42].
D. Data Governance
It is another important core of Big Data Management and
it means defining rules, laws and controlling over data. One
example is that if Big Data should be stored in the cloud,
we must take some policies like which type of data needs
to be stored, how quickly data should be accessed, rules for
data such as transparency, integrity, check and balances, and
last but not the least change management. There are many
open research topics in this field like best decision-making
mechanism, reduction of operational friction [43].
E. Big Data Integration
It means collecting data from multiple sources and storing
them with the aim of providing a unified view. Integrating dif-
ferent types of data is a complex issue that can be even worse
when we have different applications [39]. Many open research
topics are associated with data integration like real-time data
access, the performance of the system, and overlapping of the
same data [44].
VI. CONCLUSION
Nowadays, it is necessary to grapple with Big Data with
the aim of extracting better abstract knowledge. One technique
that is applicable for this aim is Deep Learning (Hierarchical
Learning) that provides higher-level data abstraction. Deep
Learning is a useful technique that can also be used in
the Big Data environment and has its own advantages and
disadvantages. In general, the more data, the higher level
abstract data, but we face many challenges. This paper surveys
at first Big Data steps, then Machine learning and Deep
Learning and at last application of Deep Learning in Big Data,
future trends, and open research problems. In the future, we
have a plan to pay attention to above areas in more detail
and also investigating Big Data problems in the industry. We
are going to also have a survey on Big Data security and
privacy issue. Then we want to address other problems such
as semantic indexing, data tagging and so on.
ACKNOWLEDGMENT
This work was supported in part by the National Natural
Science Foundation of China under Grant Numbers 61632009,
61472451, and 61402543 and in part by the High Level Talents
Program of Higher Education in Guangdong Province under
Grant 2016ZJ01.
REFERENCES
[1] Yoshua Bengio et al. Learning deep architectures for ai. Foundations
and trends R
in Machine Learning, 2(1):1–127, 2009.
[2] Jinchuan Chen, Yueguo Chen, Xiaoyong Du, Cuiping Li, Jiaheng Lu,
Suyun Zhao, and Xuan Zhou. Big data challenge: a data management
perspective. Frontiers of Computer Science, 7(2):157–164, 2013.
[3] Dylan Maltby. Big data analytics. In 74th Annual Meeting of the
Association for Information Science and Technology (ASIST), pages 1–6,
2011.
[4] Han Hu, Yonggang Wen, Tat-Seng Chua, and Xuelong Li. Toward
scalable systems for big data analytics: A technology tutorial. IEEE
Access, 2:652–687, 2014.
[5] Aisha Siddiqa, Ibrahim Abaker Targio Hashem, Ibrar Yaqoob, Mohsen
Marjani, Shahabuddin Shamshirband, Abdullah Gani, and Fariza
Nasaruddin. A survey of big data management: taxonomy and state-of-
the-art. Journal of Network and Computer Applications, 71:151–166,
2016.
[6] Min Chen, Shiwen Mao, and Yunhao Liu. Big data: A survey. Mobile
Networks and Applications, 19(2):171–209, 2014.
[7] Alexandros Labrinidis and Hosagrahar V Jagadish. Challenges and
opportunities with big data. Proceedings of the VLDB Endowment,
5(12):2032–2033, 2012.
[8] Chang Liu, Chi Yang, Xuyun Zhang, and Jinjun Chen. External integrity
verification for outsourced big data in cloud and iot: A big picture.
Future Generation Computer Systems, 49:58–67, 2015.
[9] Katina Michael and Keith W Miller. Big data: New opportunities and
new challenges [guest editors’ introduction]. Computer, 46(6):22–24,
2013.
[10] Ian H Witten, Eibe Frank, Mark A Hall, and Christopher J Pal. Data
Mining: Practical machine learning tools and techniques. Morgan
Kaufmann, 2016.
[11] Xue-Wen Chen and Xiaotong Lin. Big data deep learning: challenges
and perspectives. IEEE Access, 2:514–525, 2014.
[12] CL Philip Chen and Chun-Yang Zhang. Data-intensive applications,
challenges, techniques and technologies: A survey on big data. Infor-
mation Sciences, 275:314–347, 2014.
[13] Andrew McAfee, Erik Brynjolfsson, Thomas H Davenport, DJ Patil, and
Dominic Barton. Big data. The management revolution. Harvard Bus
Rev, 90(10):61–67, 2012.
[14] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimen-
sionality of data with neural networks. science, 313(5786):504–507,
2006.
[15] Anton Riabov and Zhen Liu. Scalable planning for distributed stream
processing systems. In ICAPS, pages 31–41, 2006.
[16] J¨
urgen Schmidhuber. Deep learning in neural networks: An overview.
Neural networks, 61:85–117, 2015.
[17] Jens Dittrich, Lukas Blunschi, and Marcos Antonio Vaz Salles. Movies:
indexing moving objects by shooting index images. Geoinformatica,
15(4):727–767, 2011.
[18] Guoliang Li, Beng Chin Ooi, Jianhua Feng, Jianyong Wang, and Lizhu
Zhou. Ease: an effective 3-in-1 keyword search method for unstructured,
semi-structured and structured data. In Proceedings of the 2008 ACM
SIGMOD international conference on Management of data, pages 903–
914. ACM, 2008.
[19] Li Deng, Dong Yu, et al. Deep learning: methods and applications.
Foundations and Trends R
in Signal Processing, 7(3–4):197–387, 2014.
[20] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine
Manzagol. Extracting and composing robust features with denoising
autoencoders. In Proceedings of the 25th international conference on
Machine learning, pages 1096–1103. ACM, 2008.
[21] Divyakant Agrawal, Amr El Abbadi, Shyam Antony, and Sudipto
Das. Data management challenges in cloud computing infrastructures.
In International Workshop on Databases in Networked Information
Systems, pages 1–10. Springer, 2010.
[22] GNU Octave. Gnu octave. l´
ınea]. Available: http://www. gnu.
org/software/octave, 2012.
[23] Xiao Chen. Google big table. 2010.
[24] Da-Wei Sun, Gui-Ran Chang, Shang Gao, Li-Zhong Jin, and Xing-Wei
Wang. Modeling a dynamic data replication strategy to increase system
availability in cloud computing environments. Journal of computer
science and technology, 27(2):256–272, 2012.
[25] Daniel E O’Leary. Artificial intelligence and big data. IEEE Intelligent
Systems, 28(2):96–99, 2013.
[26] Vivien Marx. Biology: The big challenges of big data. Nature,
498(7453):255–260, 2013.
[27] P. Porkar. Sensor networks challenges. In 11th international conference
on data networks, DNCOCO ’12,, 7-9 September 2012.
[28] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Domain adaptation
for large-scale sentiment classification: A deep learning approach. In
Proceedings of the 28th international conference on machine learning
(ICML-11), pages 513–520, 2011.
[29] Shahabi Amir. clustering algorithm in Wireless Sensor network, chapter
Sustainable Interdependent Networks: From Theory to Application.
Springer, accepted for publication (2018).
[30] Krisztian Buza, G´
abor I. Nagy, and Alexandros Nanopoulos. Storage-
optimizing clustering algorithms for high-dimensional tick data. Expert
Syst. Appl., 41:4148–4157, 2014.
[31] Mehdi Jafari, Jing Wang, Yongrui Qin, Mehdi Gheisari, Amir Shahab
Shahabi, and Xiaohui Tao. Automatic text summarization using fuzzy
inference. In Automation and Computing (ICAC), 2016 22nd Interna-
tional Conference on, pages 256–260. IEEE, 2016.
[32] Dervis Karaboga and Celal Ozturk. A novel clustering approach: Artifi-
cial bee colony (abc) algorithm. Applied soft computing, 11(1):652–657,
2011.
[33] Mehdi Gheisari, Ali Akbar Movassagh, Yongrui Qin, Jianming Yong,
Xiaohui Tao, Ji Zhang, and Haifeng Shen. Nsssd: A new semantic
hierarchical storage for sensor data. In Computer Supported Cooperative
Work in Design (CSCWD), 2016 IEEE 20th International Conference on,
pages 174–179. IEEE, 2016.
[34] Surajit Chaudhuri, Umeshwar Dayal, and Vivek Narasayya. An overview
of business intelligence technology. Communications of the ACM,
54(8):88–98, 2011.
[35] T. Tran, M. Rahman, M. Z. A. Bhuiyan, A. Kubota, S. Kiyomoto, and
K. Omote. Optimizing share size in efficient and robust secret sharing
scheme for big data. IEEE Transactions on Big Data, PP(99):1–1, 2017.
[36] Richard S Sutton and Andrew G Barto. Introduction to reinforcement
learning, volume 135. MIT Press Cambridge, 1998.
[37] Steve Lohr. The age of big data. New York Times, 11(2012), 2012.
[38] M. Z. A. Bhuiyan and J. Wu. Event detection through differential pattern
mining in cyber-physical systems. Jun 2017.
[39] W. Yu, J. Li, M. Z. A. Bhuiyan, R. Zhang, and J. Huai. Ring: Real-
time emerging anomaly monitoring system over text streams. IEEE
Transactions on Big Data, PP(99):1–1, 2017.
[40] Katherine G Herbert and Jason TL Wang. Biological data cleaning: a
case study. International Journal of Information Quality, 1(1):60–82,
2007.
[41] Maryam M Najafabadi, Flavio Villanustre, Taghi M Khoshgoftaar,
Naeem Seliya, Randall Wald, and Edin Muharemagic. Deep learning
applications and challenges in big data analytics. Journal of Big Data,
2(1):1, 2015.
[42] Todd A Letsche and Michael W Berry. Large-scale information retrieval
with latent semantic indexing. Information sciences, 100(1-4):105–137,
1997.
[43] Vijay Khatri and Carol V Brown. Designing data governance. Commu-
nications of the ACM, 53(1):148–152, 2010.
[44] Xin Luna Dong and Divesh Srivastava. Big data integration. In Data
Engineering (ICDE), 2013 IEEE 29th International Conference on,
pages 1245–1248. IEEE, 2013.
... There is a considerable amount of existing research focused on DL models that leverage big datasets [e.g., (Marcus et al 2018;Chen and Lin 2014;Ahmed et al. 2023)]. However, much less research has gone into reviewing the usage of small datasets for training DL models [e.g., (Gheisari et al. 2017;Ahmed et al. 2023;Bansal et al. 2022)]. This study performs a systematic search that utilises IEEE, PubMed, Google Scholar, Science Direct and Arxiv. ...
... Our study systematically categorises these small dataset techniques and highlights the diversity of methods available. The existing survey articles, such as Gheisari et al. (2017), Ahmed et al. (2023), Bansal et al. (2022), , ul Sabha et al. (2024, address only a few small dataset techniques, as shown in Table 1. The comparison illustrates the broader scope and greater depth of our review. ...
Article
Full-text available
Justifiably, while big data is the primary interest of research and public discourse, it is essential to acknowledge that small data remains prevalent. The same technological and societal forces that generate big datasets also produce a more significant number of small datasets. Contrary to the notion that more data is inherently superior, real-world constraints such as budget limitations and increased analytical complexity present critical challenges. Quality versus quantity trade-offs necessitate strategic decision-making, where small data often leads to quicker, more accurate, and cost-effective insights. Concentrating AI research, particularly in deep learning (DL), on big datasets exacerbates AI inequality, as tech giants such as Meta, Amazon, Apple, Netflix and Google (MAANG) can easily lead AI research due to their access to vast datasets, creating a barrier for small and mid-sized enterprises that lack similar access. This article addresses this imbalance by exploring DL techniques optimized for small datasets, offering a comprehensive review of historic and state-of-the-art DL models developed specifically for small datasets. This study aims to highlight the feasibility and benefits of these approaches, promoting a more inclusive and equitable AI landscape. Through a PRISMA-based literature search, 175+ relevant articles are identified and subsequently analysed based on various attributes, such as publisher, country, utilization of small dataset technique, dataset size, and performance. This article also delves into current DL models and highlights open research problems, offering recommendations for future investigations. Additionally, the article highlights the importance of developing DL models that effectively utilize small datasets, particularly in domains where data acquisition is difficult and expensive.
... The extraordinary success of Deep Learning (DL) algorithms in various scientific fields [17,28,13] over the last decade has recently led to the exploration of the possible applications of (deep) neural networks (NN) for solving partial differential equations (PDEs). The exponential growth of interest in these techniques started with the Physics-Informed Neural Networks (PINN) ( [37]). ...
Preprint
Full-text available
Physics-Informed Neural Networks (PINNs) have been successfully applied to solve Partial Differential Equations (PDEs). Their loss function is founded on a strong residual minimization scheme. Variational Physics-Informed Neural Networks (VPINNs) are their natural extension to weak variational settings. In this context, the recent work of Robust Variational Physics-Informed Neural Networks (RVPINNs) highlights the importance of conveniently translating the norms of the underlying continuum-level spaces to the discrete level. Otherwise, VPINNs might become unrobust, implying that residual minimization might be highly un-correlated with a desired minimization of the error in the energy norm. However, applying this robustness to VPINNs typically entails dealing with the inverse of a Gram matrix, usually producing slow convergence speeds during training. In this work, we accelerate the implementation of RVPINN, establishing a LU fac-torization of sparse Gram matrix in a kind of point-collocation scheme with the same spirit as original PINNs. We call out method the Collocation-based Robust Variational Physics Informed Neural Networks (CRVPINN). We test our efficient CRVPINN algorithm on Laplace, advection-diffusion, and Stokes problems in two spatial dimensions.
... The extraordinary success of Deep Learning (DL) algorithms in various scientific fields [11,12,13] over the last decade has recently led to the exploration of the possible applications of (deep) neural networks (NN) for solving partial differential equations (PDEs). The exponential growth of interest in these techniques started with the PINN ( [14]). ...
Preprint
Full-text available
In this paper, we present two computational methods for performing simulations of pollution propagation described by advection-diffusion equations. The first method employs graph grammars to describe the generation process of the computational mesh used in simulations with the meshless solver of the three-dimensional finite element method. The graph transformation rules express the three-dimensional Rivara longest-edge refinement algorithm. This solver is used for an exemplary application: performing three-dimensional simulations of pollution generation by the coal-burning power plant and its propagation in the city of Longyearbyen, the capital of Spitsbergen. The second computational code is based on the Physics Informed Neural Networks method. It is used to calculate the dissipation of the pollution along the valley in which the city of Longyearbyen is located. We discuss the instantiation and execution of the PINN method using Google Colab implementation. We discuss the benefits and limitations of the PINN implementation.
... Consequently, it is commonly referred to as deep neural networks. Similarly, various applications can be tackled in deep learning, including speech recognition and environmental applications (Gheisari et al. 2017). ...
Article
Combating climate change is one of the key topics and concerns that our community is currently facing these days. Since a few decades ago, greenhouse gases emissions gradually started to increase. Thus, the researchers attempted to find a permanent solution for this challenge. In this paper, different methods of machine learning and deep learning models are applied to evaluate their effectiveness and accuracy in predicting greenhouse gases emissions. To increase the accuracy of the assessment, the data of 101 countries over a period of 31 years (1991–2021) from the official World Bank sources are considered. In this study, therefore, a range of matrices are analyzed including Mean Squared Error, Root Mean Squared Error, Mean Absolute Error, p value, and correlation coefficient for each model. The results demonstrate that machine learning models typically overtake the deep learning models with the support vector regression polynomial model. Besides, the statistical findings of longitudinal regression analysis reveal that by increasing cereal yield, and permanent cropland areas the greenhouse gas emissions are significantly increase (p value = 0.000) and (p value = 0.06) respectively; however, increasing in renewable energy consumption and forest areas will lead to decreasing in greenhouse gas emissions (p value = 0.000) and (p value = 0.07) respectively.
Article
Full-text available
The notion of causality assumes a paramount position within the realm of human cognition. Over the past few decades, there has been significant advancement in the domain of causal effect estimation across various disciplines, including but not limited to computer science, medicine, economics, and industrial applications. Given the continous advancements in deep learning methodologies, there has been a notable surge in its utilization for the estimation of causal effects using counterfactual data. Typically, deep causal models map the characteristics of covariates to a representation space and then design various objective functions to estimate counterfactual data unbiasedly. Different from the existing surveys on causal models in machine learning, this review mainly focuses on the overview of the deep causal models based on neural networks, and its core contributions are as follows: (1) we cast insight on a comprehensive overview of deep causal models from both timeline of development and method classification perspectives; (2) we outline some typical applications of causal effect estimation to industry; (3) we also endeavor to present a detailed categorization and analysis on relevant datasets, source codes and experiments.
Article
The identification of mineral deposit footprints by processing geochemical survey data constitutes a crucial stage in mineral exploration because it provides valuable and substantial information for future prospecting endeavors. However, the selection of appropriate pathfinder elements and the recognition of their anomalous patterns for determining metallogenic favorability based on geochemical survey data remain challenging tasks because of the complex interactions among different geochemical elements and the highly nonlinear and heterogeneous characteristics of their spatial distribution patterns. This study investigated the application of a causal discovery algorithm and deep learning models to identify geochemical anomaly patterns associated with mineralization. Using gold-polymetallic deposits in the Edongnan region of China as a case study, stream sediment samples containing concentrations of 39 elements were collected and preprocessed using a centered log-ratio transformation, addressing the closure effect of compositional data. The combination of the synthetic minority oversampling technique, Tomek link algorithm, and causal discovery algorithm to explore the potential associations and influences among geochemical elements provides new insights into the selection of pathfinder elements. Regarding the problem of identifying anomalous spatial distribution patterns in pathfinder elements and considering that the formation of mineral deposits is the result of various geological processes interacting under specific spatiotemporal conditions, we proposed a hybrid deep learning model called VAE-CAPSNET-GAN, which combines a variational autoencoder (VAE), capsule network (CAPSNET), and generative adversarial network (GAN). The model was designed to capture the spatial distribution characteristics of pathfinder elements and the spatial coupling relationships between mineral deposits and geochemical anomalies, enabling the recognition of geochemical anomaly patterns related to mineralization. The results showed that, compared to the VAE model, which also uses reconstruction error as the anomaly detection principle, VAE-CAPSNET-GAN exhibited superior performance in identifying known mineral deposits and delineating anomalous areas aligned more closely with the established metallogenic model. Furthermore, this weakens the impact of overlapping information. Multiple outcomes indicated that an integrated analytical framework combining a causal discovery algorithm with deep learning models can provide valuable clues for further delineating prospects.
Article
Full-text available
The rapid growth of emerging applications and the evolution of cloud computing technologies have significantly enhanced the capability to generate vast amounts of data. Thus, it has become a great challenge in this big data era to manage such voluminous amount of data. The recent advancements in big data techniques and technologies have enabled many enterprises to handle big data efficiently. However, these advances in techniques and technologies have not yet been studied in detail and a comprehensive survey of this domain is still lacking. With focus on big data management, this survey aims to investigate feasible techniques of managing big data by emphasizing on storage, pre-processing, processing and security. Moreover, the critical aspects of these techniques are analyzed by devising a taxonomy in order to identify the problems and proposals made to alleviate these problems. Furthermore, big data management techniques are also summarized. Finally, several future research directions are presented.
Article
Full-text available
Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks. We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.
Article
Extracting knowledge from sensor data for various purposes has received a great deal of attention by the data mining community. For the purpose of event detection in cyber-physical systems (CPS), e.g., damage in building or aerospace vehicles from the continuous arriving data is challenging due to the detection quality. Traditional data mining schemes are used to reduce data that often use metrics, association rules, and binary values for frequent patterns as indicators for finding interesting knowledge about an event. However, these may not be directly applicable to the network due to certain constraints (communication, computation, bandwidth). We discover that, the indicators may not reveal meaningful information for event detection in practice. In this paper, we propose a comprehensive data mining framework for event detection in the CPS named DPminer, which functions in a distributed and parallel manner (data in a partitioned database processed by one or more sensor processors) and is able to extract a pattern of sensors that may have event information with a low communication cost. To achieve this, we introduce a new sensor behavioral pattern mining technique called differential sensor pattern (DSP) which considers different frequencies and values (non-binary) with a set of sensors, instead of traditional binary patterns. We present an algorithm for data preparation and then use a highly-compact data tree structure (called DP-Tree) for generating the DSP. An important tradeoff between the communication and computation costs for the event detection via data mining is made. Evaluation results show that DPminer can be very useful for networked sensing with a superior performance in terms of communication cost and event detection quality compared to existing data mining schemes.
Article
Microblog platforms have been extremely popular in the big data era due to its real-time diffusion of information. It’s important to know what anomalous events are trending on the social network and be able to monitor their evolution and find related anomalies. In this paper we demonstrate RING, a real-time emerging anomaly monitoring system over microblog text streams. RING integrates our efforts on both emerging anomaly monitoring research and system research. From the anomaly monitoring perspective, RING proposes a graph analytic approach such that (1) RING is able to detect emerging anomalies at an earlier stage compared to the existing methods, (2) RING is among the first to discover emerging anomalies correlations in a streaming fashion, (3) RING is able to monitor anomaly evolutions in real-time at different time scales from minutes to months. From the system research perspective, RING (1) optimizes time-ranged keyword query performance of a full-text search engine to improve the efficiency of monitoring anomaly evolution, (2) improves the dynamic graph processing performance of Spark and implements our graph stream model on it, As a result, RING is able to process big data to the entire Weibo or Twitter text stream with linear horizontal scalability. The system clearly presents its advantages over existing systems and methods from both the event monitoring perspective and the system perspective for the emerging event monitoring task.
Article
This monograph provides an overview of general deep learning methodology and its applications to a variety of signal and information processing tasks. The application areas are chosen with the following three criteria in mind: (1) expertise or knowledge of the authors; (2) the application areas that have already been transformed by the successful use of deep learning technology, such as speech recognition and computer vision; and (3) the application areas that have the potential to be impacted significantly by deep learning and that have been experiencing research growth, including natural language and text processing, information retrieval, and multimodal information processing empowered by multi-task deep learning.