ArticlePDF Available

Abstract

Each characteristic of Big Data (volume, velocity, variety, and value) illustrate a unique challenge to Big Data Analytics. The performance of Big Data from velocity characteristic, in particular, appear challenging of time complexity for reduced processing in dissimilar frameworks ranging from batch-oriented, MapReduce-based to real-time and stream-processing frameworks such as Spark and Storm. We proposed an approach to use a Fuzzy logic controller combined with MapReduce frameworks to handle the vehicle analysis by comparing the driving data from the new outcome vehicle trajectory. The proposed approach is evaluated via amount of raw data from the original resource with dataset after the processing of the approach using ANOVA to estimate and analyze the differences. The difference before and after using approach is a positive impact in several stages of the volume of datasets, variances, and P-value that mean significantly and contribute for two aspects i.e. accuracy and performance.
COMPUSOFT, An international journal of advanced computer technology, 8(4), April-2019 (Volume-VIII, Issue-IV)
1
Cite This Paper: WJ Alzyadat, A Alhroob, IH Almukahel, Rodziah A.(2019)
Fuzzy map approach for accruing velocity of bigdata, 8(4), COMPUSOFT, An
International Journal of Advanced Computer Technology.
This work is licensed under Creative Commons Attribution 4.0 International License.
FUZZY MAP APPROACH FOR ACCRUING VELOCITY OF BIG DATA
Wael Jumah Alzyadat1*, Aysh AlHroob2, Ikhlas Hassan Almukahel3, Rodziah Atan4
1,2,3Department of Software Engineering, Faculty of Information Technology Isra University, Amman, Jordan
4Department of Software Engineering and Information Systems, Faculty of Computer Science and Information
Technology, University Putra Malaysia, Selangor, Malaysia.
waael.alzyadat@iu.edu.jo
Abstract: Each characteristic of Big Data (volume, velocity, variety, and value) illustrate a unique challenge to Big Data
Analytics. The performance of Big Data from velocity characteristic, in particular, appear challenging of time complexity for
reduced processing in dissimilar frameworks ranging from batch-oriented, MapReduce-based to real-time and stream-processing
frameworks such as Spark and Storm. We proposed an approach to use a Fuzzy logic controller combined with MapReduce
frameworks to handle the vehicle analysis by comparing the driving data from the new outcome vehicle trajectory. The proposed
approach is evaluated via amount of raw data from the original resource with dataset after the processing of the approach using
ANOVA to estimate and analyze the differences. The difference before and after using approach is a positive impact in several
stages of the volume of datasets, variances, and P-value that mean significantly and contribute for two aspects i.e. accuracy and
performance.
Keywords: Big Data; Velocity; Fuzzy Logic Controller; MapReduce.
I. INTRODUCTION
Big Data has become a hot topic in both academia and
industry. It is defined as datasets whose size is beyond the
ability of typical database software tools to capture, store,
manage and analyze [1]. This technical complication is due
to the characteristic of big data which includes mainly 4V‟s
(Volume, Variety, Value, and Velocity)[2]. Velocity refers
to the rate of generated, processed and analyzed data. The
proliferation of digital devices such as smart phones and
sensors has led to an unprecedented rate of data creation;
thus, driving a growing need for real-time analysis and
evidence-based planning [3]. For many applications, the
speed of data processing is even more important than
volume. Real-time information makes it possible for a more
agile decision making. The usefulness of big data technique
lies in its power to optimize the outcome, improve the
processing efficiency, and reduce costs [4].
The velocity of data in driving has increased due to
improved technologies, increased processing power, and
speed of monitoring and processing. Many amateurs
illustrate approaches and technique to solve big data
problem; including focus on fuzzy clustering algorithms
which apply to cluster approaches for accuracy. The author
investigates the parallelization and scalability of a common
and effective fuzzy clustering algorithm named Fuzzy C-
Means (FCM) algorithm. The algorithm is parallelized
using the MapReduce paradigm outlining how the Map and
Reduce primitives are implemented [5].
Data filtering will be done using ANOVA which is used to
evaluate the differences between data sets. It can be used for
the recorded process. The datasets need not be equal in size.
Datasets suitable for the ANOVA can be small or infinitely
large sets of numbers. The advantage of using the filtering
Available online at: https://ijact.in
Date of Submission
Date of Acceptance
31/12/2018
07/04/2019
Date of Publication
Page numbers
(5 Pages)
ISSN:2320-0790
COMPUSOFT, An international journal of advanced computer technology, 8(4), April-2019 (Volume-VIII, Issue-IV)
2
method is reducing statistical complexity which provides
independent criterion used for feature evaluation.
This article focuses on Fuzzy logic which is considered to
be a technique where computing is based upon “degrees of
truth” used to handle the random and imbalance relation of
MapReduce mapping function (peer to peer). It uses the
relationship on runtime and fuzzy logic determines the path
and MapReduce speeds up the process. This is important to
velocity big data and ensures that qualified output value is
achieved [6].
MapReduce applies the concept of Hadoop. It is a
programming paradigm that allows for massive scalability
across hundreds or thousands of servers in a Hadoop. The
MapReduce concept is fairly simple to understand because
it consists of mapper function and reduces function
processing
A number of previous works theoretical and technological
were encouraged to accuracy and scalability terms of Big
Data challenges [7] as shown in Table 1. One of vital to
handling Big Data challenge focused on cluster concept;
called Fuzzy set techniques can override stymie the issue of
accuracy and scalability.
Table I Comparison among Fuzzy Techniques
Authors
Nature of
problem
Advantages of
using fuzzy
del Río et al.
(2017) [5]
Classificatio
n
Descriptive model
with good
accuracy
Mahmud, et
al ,(2016)[8]
Health-
shocks
prediction
Provide
interpretable
linguistic rules to
explain the causal
factors
He, Q., et al.
(2015) [9]
Parallel
sampling
Represent
Granules by Fuzzy
boundary, the
algorithm
maintains identical
distribution
In spite of effective Fuzzy set techniques such as Fuzzy C-
Means (FCM) algorithm, there is still debate as to what
types of uncertainty are captured by fuzzy logic for the best
practice to apply fuzzy on parallel platforms. MapReduce
paradigm consensus on parallelized [5] & consists of
Mapper and Reduce functions. The mapper function dealing
with sources from distributed platform across servers to set
allocate through Reduce functions that existing filtering
operation. The advantage of using the filtering method is
reducing statistical complexity which provides independent
criterion used for feature evaluation [6].
The remainder of this paper is addressed below. Section 2
presents the concept and components for the proposed
Fuzzy Map approach for accruing velocity of Big Data. In
section 3 we implement the approach and analysis
experiment. Section 4 we illustrate experimental results.
Finally, we summarize the paper in Section 5.
II. RESEARCH METHOD
In this section, we present a fuzzy map approach to
accruing the big data velocity by filter technique. The
approach consisting of six components; the first component
is Data collection and preprocessing. Second is Data
fuzzification which converts data to crisp and builds
membership function. Third is extract fuzzy using if-then
rule according to the dataset. The fourth component
involves Map function which is important acquiring
relations among dataset, separated attribute, and content.
The fifth component is the response of defuzzification to
produce new data as recodes. The last component is
applying filter technique to calculate the matching
percentage between row and new dataset. The Figure below
illustrates the data following the below approach.
Figure 1 Hybrid approach fuzzy logic controller and
MapReduce
The components from the Fuzzy Map approach for accruing
velocity of Big Data interacts as follows:
A. Component One: Collecting
Component one is collecting and preprocessing the data.
Our data used is Vehicle trajectory and kinematics data; it is
collected using detailed vehicle trajectory data on
southbound US 101 and Lanker shim Boulevard in Los
Angeles, CA, eastbound I-80 in Emeryville, CA and
Peachtree Street in Atlanta, Georgia. Data was collected
through a network of synchronized digital video cameras
and customized software application developed for the
NGSIM program, the data source is
https://www.kaggle.com/zhaopengyun/driving-data/home
B. Component Two: Fuzzification
Fuzzification component helps in converting numeric data
to categorical data according to membership function of
each attribute into fuzzy categorical as the dataset. In this
case, fuzzy regions refer to intervals for the linguistic terms.
Therefore, we can construct triangular membership
functions.
C. Component Three: Fuzzy
Build fuzzy using if-then rule after calculating the degree of
a linguistic value which is determined as the linguistic term
whose membership function is maximal in this case. Then,
we repeat the process for all instances in the data to
construct fuzzy rules covering the data.
COMPUSOFT, An international journal of advanced computer technology, 8(4), April-2019 (Volume-VIII, Issue-IV)
3
D. Component Four: Map Function
Divide through apply map function which takes the output
from the fuzzy controller (if-then) rule to reduce the random
grouping generate by MapReduce. The output of this
component is the value (key value) which describes the
degree of the fuzzy rule mapping to data content.
E. Component Five: Defuzzification
Defuzzification is responsible for producing a result in
logic, given fuzzy sets and membership degrees. It is the
process that maps a fuzzy set to a crisp set. It is typically
needed in fuzzy control systems.
F. Component Six: Filtering
Filtering technique involves applying ANOVA where
computing is based upon the distance of differentiating. The
output is a match or a mismatch among old and new
datasets.
In summary, the existing components from the Fuzzy Map
approach for accruing velocity of Big Data turn for forwards
original datasets with modified content by two factors that
are attributes and index. Instead of the data collection and
preprocess component after fetch dataset job from sources
as well as keeping the original structure and index to use in
the filter component to find out the matching and mismatch
via ANOVA; the Fuzzification component treats with
attributes via Membership function to acquire the linguistic
term. When done the second component, the Fuzzy rule
interplay with categorical data to calculate the degree of a
linguistic value which belongs to the fourth component.
Defuzzification component is syndicate Fuzzy sets and
membership degrees in a crisp set then forward to filter
component to compute distance by ANOVA.
III. EXPERIMENT AND ANALYSIS
Fuzzy Map approach for accruing velocity of Big Data uses
R language and Microsoft Excel 2016. The R package used
is shown in the Table below.
Table II R Packages Used in the Experiment
R Packages
Purposed
Read Rectangular Text Data
(Readr)[10]
Read rectangular data
Dplyr [11]
Data manipulation
Tidyr[12]
Work with Attributes (column) and
Raw (Observations)
Classification and Regression
Training caret[13]
Preprocess data set
HadoopStreaming[14]
Provides a framework for writing
Map/Reduce
HiveR[15]
Function Map, Manager and Plots
FuzzyR[16]
Design and simulate Fuzzy logic
The data collection and preprocess fetch the raw dataset; it
includes 82 attributes and 17897 records which are
different types that is string, numeric, and Boolean.
Furthermore, the quintiles description for each attribute is
used in various ways which can be observed by the mean
and standard deviation. After variables are defined, the next
step is defining the Fuzzy rules, vehicle report which is the
final state. The Fuzzy rules are the links between the “non-
final” variables distance traveled, speed, road status and
vehicle, as shown in the Table below.
Table III Variables Consists in Fuzzy
Variables
Values
Vehicle
bad, ok, and perfect
Distance travelled
near, meddle, and far
Speed
slow, limit, and fast
Road statues
crowded, normal, and open
The most effective rules are six rules; Figure 2 shows the
effective rules. The rules structure must include variables
and values to control the direction of interval data as well
as define the Map through membership.
The longest rule involves all variables consists of four
rules; they are different by values and operators (and, or).
There is one rule shortest rule that involve two variables
which are speed and vehicle.
Figure 2: Effective Rules to Vehicle Analysis
The roadmap of rule set presents in Fuzzy includes six
pathways as shown in the Figure 2. While all rule set
involve speed variables with values, the fast value of speed
variables used in three rules, the limit value appears twice
in different rules, and slow value used only once. The road
status variables are covered in five rules in which three
indexing of it is normal and one is open and crowded.
Meanwhile, vehicle variables coverage in all rule sets
presents the values three ok, two is bad, and one is perfect.
Defuzzification the final process in the approach by
removing all value associated with the bad key the raw
dataset reduces the volume which directly affects the
velocity of big data. Then convert data to a crisp value. An
employing ANOVA to analyze the variance for both raw
and new data set first determine and calculate the sum,
average, and variance to calculate ANOVA for the four
variables. As shown in Table IV.
Table IV Comparison Among Source of Variation
Source of
Variation
SS
DF
MS
F
P-value
Raw Data
156595
1467554
17663554
1.46E-09
2.76
New Data
169329
480381
550180
0.9
1
COMPUSOFT, An international journal of advanced computer technology, 8(4), April-2019 (Volume-VIII, Issue-IV)
4
The Table above illustrates as follows:
I. The Degrees of Freedom (DF) using the two
mechanisms is vertical by the number of columns
and horizontal by rows. As equation 1.
    1×
    1 (1)
E.g., (2*1) = 2.
II. F-ratio determines how far the data are scattered
from the mean raw data.
III. P-value refers to probability value.
IV. RESULTS
The result shows evidence and proof that the Fuzzy logic
controller plays an important role in the enhancement and
the performance of Big Data in term of velocity by
reducing the random relation result from Map-reduce. The
result illustrated in the Table 5 shows the main difference
between two datasets; before and after applying the Fuzzy
Map approach.
Table V: Difference between the Two Datasets
Steps
Raw Data
New Data (after apply
approach)
Preprocess
Dimensional 17897
rows 82 attributes
Dimensional 16867
rows 33 attributes
Removing all missing
value
Fuzzification
Dimensional 17897
rows 82 attributes
Dimensional 14557
rows 33 attributes
Remove the content
associated to bad key
Filtering
Dimensional 17897
rows 82 attributes
Applying ANOVA
single factor
We have applied ANOVA to analyse the variance for raw
dataset which shows the result in Figure 3.
Figure 3: ANOVA Applied to Raw Dataset
The ANOVA applying to new data shows that the variables
relative to F-ratio and the variance value decreased which
is evidence that Fuzzy Map approach reduce all
unnecessary relation and focused on the effected relations.
It is also shown in the Figure below:
Figure 4ANOVA Applied to New Datasets
Based on previous results we compare between variance in
raw and new data as shown Table below.
Table VI Comparing the Variances between Raw and New Datasets
Variable
Variance row
Variance new
Difference
Distance
827682.5
620768.5
206914
Travelled Time
897190.1
767453.1
129737
Road
6132315
4136715
1995600
Speed
3299.542
1355.542
1944
Distance in raw dataset variance is 827682.5 and in new
dataset variance is 620768.5. The difference between both
of them is 206914 that means the new dataset reduce the
distance as a result of removing all content of bad key.
Travelled time raw dataset variance is 897190.1 and the
new dataset variance is 620768.5. The difference between
both of them is 206914 that mean the new dataset reduce is
better than the raw one on travelled time. Road raw dataset
variance is 6132315 and the new dataset variance is
4136715. The difference between both of them is 1995600
that mean the new dataset chose the shortest road as a result
of Fuzzy rule controller. Speed from raw dataset variance is
3299.542 and the new dataset variance is 1355.542. The
difference between both of them is 1944 that means the
new dataset optimize speed to ideal driving behavior.
Figure 5 Variance Value Difference in Raw and New Dataset
Figure 5 illustrates that the new data set variance is more
efficient in reducing the cost of process in autonomic
vehicles. This reduction is due to the converge relation in
data after applying to map the fuel used and distance, also
the travelling time will reduce.
V. CONCLUSION
This research addresses the challenge in big data era about
velocity, by combing big data technique with artificial
intelligence technique. The approach handles velocity and
COMPUSOFT, An international journal of advanced computer technology, 8(4), April-2019 (Volume-VIII, Issue-IV)
5
assures the performance of big data analysis by reducing
the processing time. This research presents an efficient
approach using Fuzzy Logic and MapReduce to achieve
performance data. It uses a Fuzzy controller and
MapReduce to create an optimal dataset. The approach
proposed in this research can be applied to other real-world
applications to verify their merits and discover and solve
any shortcomings.
VI. REFERENCES
[1] S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, and
F. Herrera, “Big Data: Tutorial and guidelines on
information and process fusion for analytics algorithms with
MapReduce,” Inf. Fusion, vol. 42, pp. 5161, 2018.
[2] R. Kune, P. K. Konugurthi, A. Agarwal, R. R. Chillarige,
and R. Buyya, “The anatomy of big data computing,” Softw.
- Pract. Exp., vol. 46, no. 1, pp. 79105, 2016.
[3] Jin, Xiaolong, Benjamin W Wah, Xueqi Cheng, and
Yuanzhuo Wang. 2015. 'Significance and challenges of big
data research', Big Data Research, 2: 59-64.
[4] Kaisler, S.H., Armour, F., Espinosa, J.A., & Money, W.H.
(2013). Big Data: Issues and Challenges Moving Forward.
2013 46th Hawaii International Conference on System
Sciences, 995-1004.
[5] Fernández, Alberto, ara del Río, Abdullah Bawakid, and
Francisco Herrera. 2017. 'Fuzzy rule based classification
systems for big data with MapReduce: granularity analysis',
Advances in Data Analysis and Classification, 11: 711-30
[6] Faisal Y.Alzyoud and Wa‟el Jum‟ah Al_Zyadat., The
classification filter techniques by field of application and
thecresults of output. Aust. J. Basic & Appl. Sci., 10(15): 68-
77, 2016
[7] J. A. Benediktsson, Y. Zhu, M. Chi, Z. Sun, A. Plaza, and J.
Shen, “Big Data for Remote Sensing: Challenges and
Opportunities,” Proc. IEEE, vol. 104, no. 11, pp. 22072219,
2016.
[8] S. Mahmud, R. Iqbal, and F. Doctor, “Cloud enabled data
analytics and visualization framework for health-shocks
prediction,” Futur. Gener. Comput. Syst., vol. 65, pp. 169
181, 2016.
[9] Q. He, H. Wang, F. Zhuang, T. Shang, and Z. Shi, “Parallel
sampling from big data with uncertainty distribution,” Fuzzy
Sets Syst., vol. 258, pp. 117133, 2015.
[10] H. Wickham, J. Hester, R. Francois, J. Jylänki, and M.
Jørgensen, “readr: read rectangular text data. R package
version 1.1. 1.” R Foundation for Statistical Computing,
2017.
[11] Wickham, H. and Francois, R. (2015) dplyr: A Grammar of
Data Manipulation. R Package Version 0.4.3.
http://CRAN.R-project.org/package=dplyr.
[12] Wickham, H. (2017), tidyr: Easily Tidy Data with spread and
gather Functions. R package version 0.6.1. URL:
https://CRAN.R-project.org/package=tidyr
[13] M. Kuhn, “Classification and Regression Training (Caret),”
R Program. Lang. Packag., 2015.
[14] Rosenberg DS (2012). HadoopStreaming: Utilities for Using
R Scripts in Hadoop Streaming. R package version 0.2, URL
http://CRAN.R-project.org/package=HadoopStreaming.
[15] J. Chung and M. B. A. Hanson, “Package „ HiveR ,‟” 2017.
[16] T. R., Jon Garibaldi, Chao Chen, “Package „ FuzzyR ,‟”
https://cran.r-project.org/web/packages/FuzzyR/FuzzyR.pdf.
... The system relies on the fuzzy logic component to model linguistic rules and deals with uncertainty, while the neural network component provides learning and adaptation capabilities for the system. This unique combination of neural networks and fuzzy logic offers significant advantages such as enhanced modeling of complex relationships, better adaptability to changing environments, and improved handling of incomplete or ambiguous data [2], [3]. ...
Conference Paper
Full-text available
intelligent agents have become increasingly significant in numerous fields including autonomous systems, robotics, and smart environments. Neural fuzzy logic, an amalgamation of fuzzy logic, and neural networks offers a propitious solution to incumbent-addressed roadblocks to sub-optimal decision-making limited adaptability. This paper presents a methodology for integrating neural fuzzy logic into intelligent agent systems, with a focus on a case study involving a robotic vacuum cleaner, constructing a smart cleaning device that continues to learn and optimize its cleaning performance based on experiences. The proposed approach leverages the power of neural fuzzy logic and robotics to create an intelligent and adaptive cleaning device. The iterative nature of the proposed approach allows the robot vacuum cleaner to continuously adapt its behavior based on real-time feedback and learning capabilities, enhancing the intelligence and autonomy of the agent.
... For example, knowledge-based engineering (KBE) proposed by Stokes (2001) conducted to reuse and acquire knowledge based on an existing product constitutes a routine categorized design that aims to understand how to work on a project in institutes where the lack is limited in scope limited [15]. Moreover, with the growing focus of research on volume [16], value [17], veracity, velocity [18,19], and variety of big data, several researchers are working towards specific goals, such as the BIM wheel methodology by structuring and formalizing business requirements [20]. ...
... Organisations may even relocate their storage needs to external cloud-based suppliers for non-critical data. However, data volume plays a significant role in processing considerations [58]. Whereas the computing power of processing systems has continuously increased, the amount of data has increased more rapidly. ...
Article
Full-text available
Conventional infrastructures based on the cloud are not sufficient for the emerging Internet of Things (IoT) applications requirements. Many big problems are shortcomings, especially in terms of network bandwidth and latency. Throughout recent years, the idea to relieve fog computing and edge computing was suggested by bringing data processing capacities closer to these limits to the edge of the network. The authors assume that the full potential of IoT will, in many cases, only be activated by the combination of cloud, fog, and edge computing in a new computing paradigm, given IoT growth and development forecasts. This article discusses the possibility and need for such a paradigm by introducing steam computing as a new distributed type of computing using cloud, fog, and edge utilities to carry out data processing and storage. The authors treat steam computing through four planes: security and privacy plane, data analytics and fault tolerance plane and deployment and test beds plane. Finally, the authors focus on the open issues and future trends in steam computing.
... The Big Data concept considering to five main characters which each covering the dimensions of data such as behavior, content, and producing; the popular character focus on quantity of data sets that identify the capacity of storage in repositories as data centers, Moreover, present by huge massive amount of data termed volume [5,6]; the velocity conduct to generate data through different sources such as streaming data that generate from digital devices to collect from environment which obvious when use the GPS or sensors [7], consider the big data source form categorize by structure , semi-structure, and unstructured data that character responsible via veracity character [8]. ...
Article
Full-text available
The various model that has been used to predict, datamining, and information retrieval are useful to use through the traditional database, due to big data the prediction should derive in a different role that conduct the hidden structure data based on a stability scale to allow discovering accrue unsupervised drug data. Especially, the drug data must be understandable to analysts. Following this approach, conduct the stability drug data through computation methods are quality measurements, preprocess data, k-mean cluster, and decision tree. This approach seeks to identify the data by two dimensions (vertically and horizontally), which extrapolations, compilation, and interpretation values of the dataset while considering individual attributes. A comparison with clusters defines the set for features using balance value by K-mean algorithm to determine the k clusters that consider the set of features based on two values 0 and 1, which given the discernible between dependent and independent class target, and pinpoint the relationship among them. Keywords: Big Data, Discretize, k-mean cluster Stability, Target drug
... The Big Data concept considering to five main characters which each covering the dimensions of data such as behavior, content, and producing; the popular character focus on quantity of data sets that identify the capacity of storage in repositories as data centers, Moreover, present by huge massive amount of data termed volume [5,6]; the velocity conduct to generate data through different sources such as streaming data that generate from digital devices to collect from environment which obvious when use the GPS or sensors [7], consider the big data source form categorize by structure , semi-structure, and unstructured data that character responsible via veracity character [8]. ...
Article
Full-text available
The various model that has been used to predict, datamining, and information retrieval are useful to use through the traditional database, due to big data the prediction should derive in a different role that conduct the hidden structure data based on a stability scale to allow discovering accrue unsupervised drug data. Especially, the drug data must be understandable to analysts. Following this approach, conduct the stability drug data through computation methods are quality measurements, preprocess data, k-mean cluster, and decision tree. This approach seeks to identify the data by two dimensions (vertically and horizontally), which extrapolations, compilation, and interpretation values of the dataset while considering individual attributes. A comparison with clusters defines the set for features using balance value by K-mean algorithm to determine the k clusters that consider the set of features based on two values 0 and 1, which given the discernible between dependent and independent class target, and pinpoint the relationship among them.
... These features rear three challenges: high dimensionality, high dimensionality with large sample size, and the huge samples in big data are normally combined from various sources at a different time [1] This lead to issues of heterogeneity, experimental differences and statistical preferences. Big Data analysis enables users to use data more capably, drive a higher adaptation rate, improve decisionmaking and authorize customers [2] [3]. Big data analysis is the procedure of investigative techniques alongside big data, different data sets that consist of structured, unstructured and semi-structured data, from different sources, and sizes [4]. ...
Conference Paper
Big Data filed is an unsettled standard comparing with a traditional database, data mining, or data warehouse. Stability measure aims to acquire the quality dataset which encourages to use of preprocessing data method to handle instability that miniaturization missing data. Therefore, to increase the data quality in order to achieve an accurate prediction, significant rules are used to provide value and meaningful data. Through, three measures by support, confidence, and the lift to acquire frequently rules. These rules are used to conduct the objective extracting pattern, to estimate each browsing customer's likelihood of making a purchase, and to choose meaningful patterns from the discovered association rules.
... The Big Data concept considering to five main characters which each covering the dimensions of data such as behavior, content, and producing; the popular character focus on quantity of data sets that identify the capacity of storage in repositories as data centers, Moreover, present by huge massive amount of data termed volume [5,6]; the velocity conduct to generate data through different sources such as streaming data that generate from digital devices to collect from environment which obvious when use the GPS or sensors [7], consider the big data source form categorize by structure , semi-structure, and unstructured data that character responsible via veracity character [8]. ...
Research Proposal
Our previous work in https://www.researchgate.net/publication/338815969_Adaptive_Fuzzy_Map_Approach_for_Accruing_Velocity_of_Big_Data_Relies_on_Fireflies_Algorithm_for_Decentralized_Decision_Making will be used to collect and analyze patients data from medical laboratories in Jordan. The proposed work aims to faci
Article
Full-text available
In this article, we have presented the effective handling of big data using adaptive clustering and optimization techniques. Initially, heterogeneous data is collected from multiple sources and then transformed the data into desired network graphs. Then finding patterns in the graphs, the module distributes the data into the right data blocks using Entropy and sigmoid based K-means clustering. Subsequently, an adaptive grey wolf optimization (AGWO) algorithm in Hadoop distributed file system (HDFS) distributes the data blocks into the right machine. This optimized HDFS serves as a data source for services to execute queries and provide a platform to apply graph algorithms efficiently as well as reduce resource usage. Finally, we can handle a broad range of data types, query time, and resource usage. The experimental results of the proposed work provide better results in comparison with the existing methods such as GWO and PSOin terms of the algorithm run Time, loading time, resource usage, Query time, Query execution time and convergence.
Article
Full-text available
During the world’s challenge to confront the rapidly spreading coronavirus disease (COVID-19) pandemic and the consequent heavy losses and disruption to society, returning to normal life has become a demand. Social distancing, also known as physical distancing, plays a pivotal role in this scenario. Social distancing is a practice to maintain a safe space between a person and others who are not from the same household, preventing the spread of contagious viral diseases. To support this case, several public authorities and governments around the world have proposed social distancing applications (also known as contact-tracing apps). However, the adoption of these applications is arguable because of concerns regarding privacy and user data protection. In this study, we present a comprehensive survey of privacy-preserving techniques for social distancing applications. We provide an extensive background on social distancing applications, including measuring the physical distance between people. We also discuss various privacy-preserving techniques that are used by social distancing applications; specifically, we thoroughly analyze and compare these applications, considering multiple features. Finally, we provide insights and recommendations for designing social distancing applications while reducing the burden of privacy problems.
Article
Full-text available
Software failure prediction is an important activity during agile software development as it can help managers to identify the failure modules. Thus, it can reduce the test time, cost and assign testing resources efficiently. RapidMiner Studio9.4 has been used to perform all the required steps from preparing the primary data to visualizing the results and evaluating the outputs, as well as verifying and improving them in a unified environment. Two datasets are used in this work, the results for the first one indicate that the percentage of failure to predict the time used in the test is for all 181 rows, for all test times recorded, is 3% for Mean time between failures (MTBF). Whereas, SVM achieved a 97% success in predicting compared to previous work whose results indicated that the use of Administrative Delay Time (ADT) achieved a statistically significant overall success rate of 93.5%. At the same time, the second dataset result indicates that the percentage of failure to predict the time used is 1.5% for MTBF, SVM achieved 98.5% prediction.
Article
Full-text available
Advances in information technology and its widespread growth in several areas of business, engineering, medical and scientific studies are resulting in information/data explosion. Knowledge discovery and decision making from such rapidly growing voluminous data is a challenging task in terms of data organization and processing, which is an emerging trend known as Big Data Computing; a new paradigm which combines large scale compute, new data intensive techniques and mathematical models to build data analytics. Big Data computing demands a huge storage and computing for data curation and processing that could be delivered from on-premise or clouds infrastructures. This paper discusses the evolution of Big Data computing, differences between traditional data warehousing and Big Data, taxonomy of Big Data computing and underpinning technologies, integrated platform of Big Data and Clouds known as Big Data Clouds, layered architecture and components of Big Data Cloud and finally discusses open technical challenges and future directions.
Article
Full-text available
In recent years, the rapid development of Internet, Internet of Things, and Cloud Computing have led to the explosive growth of data in almost every industry and business area. Big data has rapidly developed into a hot topic that attracts extensive attention from academia, industry, and governments around the world. In this position paper, we first briefly introduce the concept of big data, including its definition, features, and value. We then identify from different perspectives the significance and opportunities that big data brings to us. Next, we present representative big data initiatives all over the world. We describe the grand challenges (namely, data complexity, computational complexity, and system complexity), as well as possible solutions to address these challenges. Finally, we conclude the paper by presenting several suggestions on carrying out big data projects.
Conference Paper
Full-text available
Big data refers to data volumes in the range of exabytes (1018) and beyond. Such volumes exceed the capacity of current on-line storage systems and processing systems. Data, information, and knowledge are being created and collected at a rate that is rapidly approaching the exabyte/year range. But, its creation and aggregation are accelerating and will approach the zettabyte/year range within a few years. Volume is only one aspect of big data; other attributes are variety, velocity, value, and complexity. Storage and data transport are technology issues, which seem to be solvable in the near-term, but represent longterm challenges that require research and new paradigms. We analyze the issues and challenges as we begin a collaborative research program into methodologies for big data analysis and design.
Article
We live in a world were data are generated from a myriad of sources, and it is really cheap to collect and storage such data. However, the real benefit is not related to the data itself, but with the algorithms that are capable of processing such data in a tolerable elapse time, and to extract valuable knowledge from it. Therefore, the use of Big Data Analytics tools provide very significant advantages to both industry and academia. The MapReduce programming framework can be stressed as the main paradigm related with such tools. It is mainly identified by carrying out a distributed execution for the sake of providing a high degree of scalability, together with a fault-tolerant scheme. In every MapReduce algorithm, first local models are learned with a subset of the original data within the so-called Map tasks. Then, the Reduce task is devoted to fuse the partial outputs generated by each Map. The ways of designing such fusion of information/models may have a strong impact in the quality of the final system. In this work, we will enumerate and analyze two alternative methodologies that may be found both in the specialized literature and in standard Machine Learning libraries for Big Data. Our main objective is to provide an introduction of the characteristics of these methodologies, as well as giving some guidelines for the design of novel algorithms in this field of research. Finally, a short experimental study will allow us to contrast the scalability issues for each type of process fusion in MapReduce for Big Data Analytics.
Article
Every day a large number of Earth observation (EO) spaceborne and airborne sensors from many different countries provide a massive amount of remotely sensed data. Those data are used for different applications, such as natural hazard monitoring, global climate change, urban planning, etc. The applications are data driven and mostly interdisciplinary. Based on this it can truly be stated that we are now living in the age of big remote sensing data. Furthermore, these data are becoming an economic asset and a new important resource in many applications. In this paper, we specifically analyze the challenges and opportunities that big data bring in the context of remote sensing applications. Our focus is to analyze what exactly does big data mean in remote sensing applications and how can big data provide added value in this context. Furthermore, this paper describes the most challenging issues in managing, processing, and efficient exploitation of big data for remote sensing problems. In order to illustrate the aforementioned aspects, two case studies discussing the use of big data in remote sensing are demonstrated. In the first test case, big data are used to automatically detect marine oil spills using a large archive of remote sensing data. In the second test case, content-based information retrieval is performed using high-performance computing (HPC) to extract information from a large database of remote sensing images, collected after the terrorist attack to the World Trade Center in New York City. Both cases are used to illustrate the significant challenges and opportunities brought by the use of big data in remote sensing applications.
Article
In this paper, we present a data analytics and visualization framework for health-shocks prediction based on large-scale health informatics dataset. The framework is developed using cloud computing services based on Amazon web services (AWS) integrated with geographical information systems (GIS) to facilitate big data capture, storage, index and visualization of data through smart devices for different stakeholders. In order to develop a predictive model for health-shocks, we have collected a unique data from 1000 households, in rural and remotely accessible regions of Pakistan, focusing on factors like health, social, economic, environment and accessibility to healthcare facilities. We have used the collected data to generate a predictive model of health-shock using a fuzzy rule summarization technique, which can provide stakeholders with interpretable linguistic rules to explain the causal factors affecting health-shocks. The evaluation of the proposed system in terms of the interpret-ability and accuracy of the generated data models for classifying health-shock shows promising results. The prediction accuracy of the fuzzy model based on a k-fold cross-validation of the data samples shows above 89% performance in predicting health-shocks based on the given factors.
Article
Data is inherently uncertain in most applications. It is encountered when an experiment such as sampling is to proceed, the result of which is not known to us while leading to variety of potential outcomes. With the rapid developments of data collection and distribution storage technologies, big data has become a bigger-than-ever problem. And dealing with big data with uncertainty distribution is one of the most important issues of big data research. In this paper, we propose a Parallel Sampling method based on Hyper Surface for big data with uncertainty distribution, namely PSHS, which adopts a universal concept Minimal Consistent Subset (MCS) of Hyper Surface Classification (HSC). Our inspiration for handling uncertainties in sampling from big data depending on (1) the inherent structure of the original sample set is uncertain for us, (2) boundary set formed of all the possible separating hyper surfaces is a fuzzy set and (3) the uncertainty of elements in MCS. PSHS is implemented based on MapReduce framework, which is a current and powerful parallel programming technique used in many fields. Experiments have been carried out on several data sets including real world data from UCI repository and synthetic data. The results show that our algorithm shrinks data sets while maintaining identical distribution, which is useful for obtaining the inherent structure of the data sets. Furthermore, the evaluation criterions of speedup, scaleup and sizeup validate its efficiency.
Fuzzy rule based classification systems for big data with MapReduce: granularity analysis', Advances in Data Analysis and Classification
  • Alberto Fernández
  • Abdullah Del Río
  • Francisco Bawakid
  • Herrera
Fernández, Alberto, ara del Río, Abdullah Bawakid, and Francisco Herrera. 2017. 'Fuzzy rule based classification systems for big data with MapReduce: granularity analysis', Advances in Data Analysis and Classification, 11: 711-30