Big data faces many challenges from different aspects; these challenges are represented in characteristics, such as volume, velocity, variety , and value. Preprocessing and analyzing big data are important issues to acquire quality information toward accurate values for correct decision making. Quality data taxonomy points to two basic actions to ensure that data is meaningful and predictive. Consequently, a hybrid approach using fuzzy logic and MapReduce is utilized to produce a new version of MapReduce which consist of four layers. Data collection is achieved in the first layer. The second layer consist of preprocessing data, where semi-structured data is treated to clean up and obtain the map function to acquire relationships. The third layer includes the application of fuzzy controller as well as classification to generate rules. Finally, in the fourth and last layer, data reduction and classification are carried out to achieve a meaningful and predic-tive outcome. The result showed the efficiency of the approach through Sensitivity = 80%, Specificity = 86% and F-measure= 2.5 that were validated in TREC conference website. The hybrid approach treating the 4Vs towards achieving meaningful which has positive effect support doctor to take the right decision.
Hybrid Approach Using Fuzzy Logic and MapReduce
to Achieve Meaningful Used Big Data
Ikhlas Almukahel 1 *, Wael Alzyadat 2, Mohamad Alfayomi 3
1 Teacher Assistant, Software Engineering Department, Faculty of Information Technology, Isra University, Amman, Jordan
2 Assistant prof., Software Engineering Department, Faculty of Information Technology, Isra University, Amman, Jordan
3 Professor, Computer Science Department, Faculty of Information Technology, Isra University, Amman, Jordan
Keywords: Big Data; MapReduce; Meaningful; Predictive; Fuzzy Logic Controller.
1. Introduction
The massive amount of data is acquired from many sources in
different domains, such as industry, business, social networks,
internet, health, finance, economics, and transportation. Flexible
tools and techniques are needed to lead the big data era the term
big data refers to deriving, collecting, and processing massive
amounts of data [1] due to that the characteristics of big data have
become difficult to analyze and manage with traditional data pro-
cessing tools [1, 2]. The main challenge of big data is extracting
value to make a decision, predict and improve services [3]. Tradi-
tional data mining techniques are almost unable to handle big data
so many artificial intelligent techniques are applied to big data
framework. The objective is combining fuzzy logic controller with
MapReduce to extract quality data and evaluate the approach use
PIMA India dataset for diabetic patients [4].
2. Related works
Big data is still a misleading definition for the concept itself.
Many researchers and organizations define big data as datasets
whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze [5]. The issues of big data
refer to the 4V's, which are the characteristics of volume, velocity,
variety, and value [6]. Many frameworks apply MapReduce,
which was originally produced by Google to solve the web search
index creation problem. It has been established as a de-facto solu-
tion that deals with big data scalability [7]. MapReduce is a pro-
gramming framework the main functions are automatically paral-
lelized and executed on computing big data. Apache Hadoop is
one of the most popular open-source implementations of MapRe-
duce paradigm [8]. Recent years publications focused on ap-
proaches to the issues of big data refer to the 4V's [9]. Recent
years also publications focused on approaches to improve extract-
ing value from big data at a possible time response. Table one
shows a comparison between similar approaches solving big data
classification problem.
Abdrabo et al. (2018) introduced a framework that enhances the
reduction of big data dimensionality. Selecting more important
features helps to improve classification performance decision tree
accuracy was 86.4% while its precision was 84.3 %. In the data
preprocessing step, and tried to overcome two main problems:
heterogeneous data using peer to peer transformation and incom-
plete data based on assigning a fixed number. In map step, a
fuzzy-rough set was assigned to feature selection. In reduce step,
assigning fuzzy applications means clustering for identifying simi-
lar features to assign them the same key [10].
In MapReduce with dynamic fuzzy inference/interpolation for big
data applications, both inference and interpolation methods can
work together to produce a final output. It was shown that the
average accuracy of classification in terms of performance in tease
two methods has been contrasted in experimental research includ-
ing six different big data problems [11].
del Río, López, Benítez& Herrera (2015), the big data classifica-
tion problems especially the extraction of information is uncertain-
ty that associated with the noise inherent to available data. The
classification evaluated using the accuracy obtained and the
runtime spent by the models, where this research aimed to analyze
the quality of the ChiFRBCS-BigData algorithm in the big data
scenario [12].
Classification problems provide extraction of information with
ambiguity associated with the noise inherent in availability [13].
Furthermore, evaluation is carried out using the accuracy obtained
and the runtime spent by models [4].
Fuzzy logic is used to handle the random and imbalanced relation
of MapReduce mapping function (peer to peer) with the runtime.
Fuzzy logic determines the path and MapReduce speeds up the
process, which is significant to velocity and value. The pinpoint
using fuzzy technique is to efficiently process a large volume of
big data within limited run times.
Del.Rio, S. et al (2015) propose the Chi-FRBCS-BigDataCS algo-
rithm, In order to effectively deal with big data a fuzzy rule-based
classification system that is able to deal with the uncertainty that is
introduced in large volumes of data by using MapReduce frame-
work to distribute the computational operations of the fuzzy model
while it includes cost-sensitive learning techniques in its design to
address the imbalance of big data [12].
Mahmud, et al (2016) focused on using a fuzzy rule summariza-
tion technique, which can provide stakeholders with interpretable
linguistic rules to explain the causal factors affecting health-
shocks [14]. He, Q, et al. (2015) present a Parallel Sampling
method based on HyperSurface (PSHS) for big data with uncer-
tainty distribution to get the Minimal Consistent Subset (MCS) of
the original sample set whose inherent structure is uncertain [15].
Notice that all classifier for solving them is not understandable
(black box type) that is often vital in medical diagnosis there is no
explanation and discussion about the fuzzy rule. The table below
shows a comparison between fuzzy techniques.
3. Research method
The approach consists of four layers. First comes the data collec-
tion layer and the second is data preprocessing layer which con-
verts semi-structured data into a structured format and implements
the dataset to be input throughout layers that involve map function
with the duty of acquiring relations among dataset, separated at-
tributes, and content, while the function of the third layer is apply-
ing Fuzzy Logic Controller (FLC) and starting the first-round
The fourth layer is responsible for reducing function with second-
round classification to find an outcome that uses two equations for
precision and recall to calculate F-measure for evaluating the ac-
curacy. Each layer involves components that lead to treating with
characteristics of big data; the whole process achieves 4V's as
follows; the first layer achieves volume, while second and third
layers achieve verity and velocity and layer four achieves value.
The figure below illustrates the process of the hybrid approach
with fuzzy logic controller and MapReduce.
Fig. 1: Hybrid Approach Fuzzy Logic Controller and Mapreduce.
3.1. Layer1: data collection (volume)
Input dataset was obtained from the University of California Ir-
vine (UCI) machine learning repository. The dataset originates
from the National Institute of Diabetes and Digestive of Kidney
Diseases. Several constraints were placed on the selection of in-
stances from a larger database. All patients here are of the female
gender, with at least 21 years of years, from PIMA Indian heritage
dataset containing 8 attributes and 786 records.
3.2. Layer 2: data preprocessing (variety)
After monitoring data, all values which affect the results, such as
null, outliers, and missing values, are defined to detect and prepare
data for map function. Preprocessing output will achieve the varie-
ty and veracity that are related to big data introduce a degree of
uncertainty that has to be handled. In addition to reducing volume
and enhance velocity requirements. Preprocessing will include two
main steps: removal of all missing values and data conversion.
3.3. Remove missing value
(If attribute. Value = = NULL || 0)
Then Delete record;
1) Normalization
Apply normalization to convert all string data to numerical value
this important to make later process more sufficiency and quality.
Data preprocessing is carrying out through rules applied in raw
3.4. Layer 3: map function and fuzzy rules (velocity)
The map function is applied, taking the output from the prepro-
cessing component as structured data. In this step presenting map-
ping attributes and values. Then fuzzy controller (IF-THEN) rules
are used to avoid random grouping generated by MapReduce; the
second role is associated with the classifier rate.
Fuzzy rules are implemented as following steps
Step 1: includes converting numeric data to categorical data
according to the normal reading as shown in Table 3. Each
attribute in the fuzzy categorical dataset refers to an interval
for the linguistic terms. Therefore, the length of fuzzy lin-
guistic term is defined as "low” and “high”. Triangular
membership function which is also constructed; e.g., in the
first case, we have the corner points a = 1, b = 2 and c = 3
where b is a normal reading whose degree in the member-
ship function equals one.
Step 2: fuzzy IF-THEN rules are generated covering the
training data, using the dataset from Step 1. First, degrees of
the membership function for all values are calculated in the
data. Through each instance and each variable, a linguistic
value is determined as whose membership function is max-
imal, while the process is repeated for all instances to con-
struct fuzzy rules covering all the data.
Step 3: a degree is adjusted for each rule. Degrees of mem-
bership function are then aggregated.
Step 4 a final rule is obtained base after deleting redundant
rules. Considering the degrees of rules, redundant rules and
those with lower degrees are deleted. Fuzzy based rule =
2^8, focusing on 12 rules that cover 90% of the diabetic pa-
tient's dataset.
3.5. Layer 4: reduce (value)
Reduce function is associated with the listing of data and put into
groups of values, where each group is produced as a pair (key,
value). The output is a list of attributes (keys) and all their associ-
ated values Measurement is carried out for converging between
productive and perspective results to achieve meaningful data
useful for doctors and patients. Classification measures used are
accuracy, precision, recall, and F-measure to evaluate the classifi-
er used to verify the effectiveness based on the confusion matrix.
The hybrid approach is evaluated using the Text Retrieval Confer-
ence (TREC) [16], [17].
4. Experiment and analysis
A hybrid approach using fuzzy logic and MapReduce is applied to
achieve meaningful data on both R packages [18] and WEKA
[19], by the independent running of the same equipment computer
properties; R integrated suite of software facilities for data manip-
ulation using separated packages. Readr package [20]to Read
Rectangular Text Data, Dplyr package [21] which purposed for
data manipulation, Tidyr package [22] is important to work with
Attributes-Feature and Raw-Observation, Classification and Re-
gression Training caret package [23], PreProcess package [24]
which role is preparing data while HadoopStreaming [24, 25] and
HiveR [26] Provides a framework for writing map/reduce Func-
tion manager and plots them and FuzzyR [27] to Design and simu-
late fuzzy logic. On the other hand, WEKA is a collection of
learning algorithms and data preprocessing tools applied built-in
function. Experiment steps as follow:
4.1. Layer1: data collection (volume)
In WEKA dataset is imported using import data function to re-
trieves data from a file this depending on how to view and under-
stand the whole data. An advantage of R packages is that they are
able to treat in-memory and out-memory storage, which reflects
covering the volume characteristic of big data.
4.2. Layer 2: verity (data preprocessing)
Once a dataset has been read, various data preprocessing tools, In
Weka, built-in functions are used in preprocessing called filter,
while in R the PreProcess package prepared data and made it suit-
able to analysis.
4.3. Layer 3: map function and fuzzy rules (velocity)
Map instructions divide the data to key and value. Map-function in
R is involved in two packages, which are HadoopStreaming and
hive to deal with the scalability of big data. In Fuzzy logic, rules
are applied to predict diabetes depending on the relations among
attributes and the outcome. In table 3 the main affected rules.
Table 3: Main Effected Rules That Define Diabetic and Non-Diabetic
R1: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is L) and (BMI is High) and (PED is High) and
(Age is Low)
R2: If (Npreg is Low) and (Glu is Low ) and (BP is Low)) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is High)
and (Age is Low)
R3: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is Low) and (BMI is High) and (PED is Low)
and (Age is Low)
R4: If (Npreg is Low) and (Glu is Low) and (BP is Low) and (Skin is High) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
R5: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is High) and (BMI is Low) and (PED is Low)
and (Age is Low)
R6: If (Npreg is Low) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
R7: If (Npreg is Low) and (Glu is Low) and (BP is Low) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
R8: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is Low) and (Insulin is High) and (BMI is L) and (PED is High) and
(Age is L)
R9: If (Npreg is Low) and (Glu is H) and (BP is Low) and (Skin is Low) and (Insulin is High) and (BMI is High) and (PED is High) and
(Age is High)
R10: If (Npreg is Low) and (Glu is High) and (BP is L) and (Skin is High) and (Insulin is Low) and (BMI is Low) and (PED is Low) and
(Age is Low)
R11: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is High) and (Insulin is High) and (BMI is High) and (PED is
High) and (Age is High)
R12: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is H) and (Insulin is Low) and (BMI is High) and (PED is Low)
and (Age is High)
In R, Hive and HadoopStreaming packages are used to derive big
data characteristics. Especially, MapReduce is a parallel distribut-
ed system. Reducing random factor affects the map, where the
significance of MapReduce appears in obtaining velocity. In WE-
KA classification and regression algorithms applicable to the pre-
processed data use classify panel.
4.4. Layer 4: reduce function (value)
Reduce function performs calculations on small chunks of data in
parallel then it combines the sub results from each reduced-chunk.
Patients are classified into two classes: diabetic and non-diabetic.
The value is extracted from big data achieved via classifying the
PIMA dataset, through calculating Precision- Sensitivity (Eq.1),
Recall- Specificity (Eq.2) and F-measure (Eq.3). The results illus-
trated in Figure 2.
Precision = (tp)
(TP+FP) [28] (1)
Recall= (tp)
(TP+FN) [28] (2)
F-measure =(2∗precision∗Recall)
(precision+Recall) [28] (3)
Where TP, TN, FP, and FN indicate in the following order:
True positives: predict Diabetic as Diabetic.
True negatives: predict Non-Diabetic as Non-Diabetic.
False positives: predict Non-Diabetic as Diabetic.
False negatives: predict Diabetic as Non-Diabetic.
Fig. 2: F-Measure and Accuracy in R and WEKA Values and Periods for
F-Measure are Shown in Graphs.
5. Results
The results obtained of the hybrid approach fuzzy logic and
MapReduce in big data are very interesting and can be used confi-
dently to help for decision making and achieve meaningful. The
result evidence shows the scalability, which is able to extract pro-
cess and manipulate with preserving the accuracy of classification
at a satisfactory level, measured by determining the confusion
matrices which contains information about actual and predicted
classifications by an approach as shown in the figure below.
Fig. 3: Confusion Matrices for R Packages and Weka.
Through the evaluation results are observed as follows:
Precision: in Weka, the result is 0.822 and in R (Readr,
Dplyr, Tidyr, PreProcess, HadoopStreaming, HiveR, and
FuzzyR) packages the result is 0.803, where the difference
between both results is 0.01905. that means Weka is effect
than R in positive prediction value.
Recall: in Weka, the result is 0.676 and in R (Readr, Dplyr,
Tidyr, PreProcess, HadoopStreaming, HiveR, and FuzzyR)
packages, the result is 0.866 where the difference between
both results is 0.19. that means R is effect than Weka in
sensitivity measure.
F-measure: Weka result is 2.028, while R result is 2.598,
that means R is more harmonic then Weka with a difference
of 0.57 between two results.
Accuracy: in Weka, the result is 0.701, while in R the result
is 0.774. The difference between both results is 0.073,
which mean R (Readr, Dplyr, Tidyr, PreProcess, Hadoop-
Streaming, HiveR, and FuzzyR) improved than Weka in
Accuracy measure. The results are summarized in the figure
Fig. 4: Result Differences between R Packages and Weka.
6. Conclusion
This paper addressed two main challenges, the first challenge is
extracting meaningful data from big data, the second challenge is
combining big data technique with fuzzy logic artificial intelli-
gence techniques through hybrid approach consisting of four lay-
ers to treat with 4V's of big data. The experiment with R (packag-
es) and Weka are implemented MapReduce and apply it for classi-
fying and predicting meaningful data. The effectiveness of the
approach has been demonstrated through PIMA Indian diabetes
dataset. The pinpoint aspects handled are management, acquisition
and acting with big data.
The aspect of management of big data uses fuzzy logic controller
and MapReduce dynamically to treat manipulate data, such as data
updating, adding, deletion, and insertion reflecting the value of
recall measure. Furthermore, the main impact in terms of big data.
The significance of the approach based on F-measure to acquire
meaningful data by mean of MapReduce. The contribution of this
research based on the results of experiment approach supports
healthcare domain to make accurate decisions. On the other hand,
the precision measure is negatively affected by the random values
generated in MapReduce which will be the future work.
