ArticlePDF Available

Hybrid Approach Using Fuzzy Logic and MapReduce to Achieve Meaningful Used Big Data

Authors:

Abstract and Figures

Big data faces many challenges from different aspects; these challenges are represented in characteristics, such as volume, velocity, variety , and value. Preprocessing and analyzing big data are important issues to acquire quality information toward accurate values for correct decision making. Quality data taxonomy points to two basic actions to ensure that data is meaningful and predictive. Consequently, a hybrid approach using fuzzy logic and MapReduce is utilized to produce a new version of MapReduce which consist of four layers. Data collection is achieved in the first layer. The second layer consist of preprocessing data, where semi-structured data is treated to clean up and obtain the map function to acquire relationships. The third layer includes the application of fuzzy controller as well as classification to generate rules. Finally, in the fourth and last layer, data reduction and classification are carried out to achieve a meaningful and predic-tive outcome. The result showed the efficiency of the approach through Sensitivity = 80%, Specificity = 86% and F-measure= 2.5 that were validated in TREC conference website. The hybrid approach treating the 4Vs towards achieving meaningful which has positive effect support doctor to take the right decision.
Content may be subject to copyright.
Copyright ©2018 Ikhlas Almukahel et al. This is an open access article distributed under the Creative Commons Attribution License, which per-
mits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
International Journal of Engineering &Technology, 7 (4) (2018) 6997-7001
International Journal of Engineering & Technology
Website: www.sciencepubco.com/index.php/IJET
doi: 10.14419/ijet.v7i4.28772
Research paper
Hybrid Approach Using Fuzzy Logic and MapReduce
to Achieve Meaningful Used Big Data
Ikhlas Almukahel 1 *, Wael Alzyadat 2, Mohamad Alfayomi 3
1 Teacher Assistant, Software Engineering Department, Faculty of Information Technology, Isra University, Amman, Jordan
2 Assistant prof., Software Engineering Department, Faculty of Information Technology, Isra University, Amman, Jordan
3 Professor, Computer Science Department, Faculty of Information Technology, Isra University, Amman, Jordan
*Corresponding author E-mail: Ikhlas.almukahel@iu.edu.jo
Abstract
Big data faces many challenges from different aspects; these challenges are represented in characteristics, such as volume, velocity, vari-
ety, and value. Preprocessing and analyzing big data are important issues to acquire quality information toward accurate values for cor-
rect decision making. Quality data taxonomy points to two basic actions to ensure that data is meaningful and predictive. Consequently, a
hybrid approach using fuzzy logic and MapReduce is utilized to produce a new version of MapReduce which consist of four layers. Data
collection is achieved in the first layer. The second layer consist of preprocessing data, where semi-structured data is treated to clean up
and obtain the map function to acquire relationships. The third layer includes the application of fuzzy controller as well as classification
to generate rules. Finally, in the fourth and last layer, data reduction and classification are carried out to achieve a meaningful and predic-
tive outcome. The result showed the efficiency of the approach through Sensitivity = 80%, Specificity = 86% and F-measure= 2.5 that
were validated in TREC conference website. The hybrid approach treating the 4Vs towards achieving meaningful which has positive
effect support doctor to take the right decision.
Keywords: Big Data; MapReduce; Meaningful; Predictive; Fuzzy Logic Controller.
1. Introduction
The massive amount of data is acquired from many sources in
different domains, such as industry, business, social networks,
internet, health, finance, economics, and transportation. Flexible
tools and techniques are needed to lead the big data era the term
big data refers to deriving, collecting, and processing massive
amounts of data [1] due to that the characteristics of big data have
become difficult to analyze and manage with traditional data pro-
cessing tools [1, 2]. The main challenge of big data is extracting
value to make a decision, predict and improve services [3]. Tradi-
tional data mining techniques are almost unable to handle big data
so many artificial intelligent techniques are applied to big data
framework. The objective is combining fuzzy logic controller with
MapReduce to extract quality data and evaluate the approach use
PIMA India dataset for diabetic patients [4].
2. Related works
Big data is still a misleading definition for the concept itself.
Many researchers and organizations define big data as datasets
whose size is beyond the ability of typical database software tools
to capture, store, manage and analyze [5]. The issues of big data
refer to the 4V's, which are the characteristics of volume, velocity,
variety, and value [6]. Many frameworks apply MapReduce,
which was originally produced by Google to solve the web search
index creation problem. It has been established as a de-facto solu-
tion that deals with big data scalability [7]. MapReduce is a pro-
gramming framework the main functions are automatically paral-
lelized and executed on computing big data. Apache Hadoop is
one of the most popular open-source implementations of MapRe-
duce paradigm [8]. Recent years publications focused on ap-
proaches to the issues of big data refer to the 4V's [9]. Recent
years also publications focused on approaches to improve extract-
ing value from big data at a possible time response. Table one
shows a comparison between similar approaches solving big data
classification problem.
Table 1: Comparison Among Similar Approaches
Authors
Approach
Case studies
Tools
Abdrab
o, M., et
al.
(2018)
[10]
MapRe-
duce paral-
lel pro-
cessing and
fuzzy
rough
Diabetes dataset
electroencephalog-
raphy
WE-
KA
Jin, S.
Peng,
and D.
Xie.
(2017)
[11]
MapRe-
duce ap-
proach
with dy-
namic
fuzzy in-
ference
Six UCI datasets
JAVA
del Río
S., et al.
(2015)
[12]
The Chi et
al.’s algo-
rithm for
classifica-
tion
Six problems from
the UCI dataset
repository
JAVA
Abdrabo et al. (2018) introduced a framework that enhances the
reduction of big data dimensionality. Selecting more important
features helps to improve classification performance decision tree
6998
International Journal of Engineering & Technology
accuracy was 86.4% while its precision was 84.3 %. In the data
preprocessing step, and tried to overcome two main problems:
heterogeneous data using peer to peer transformation and incom-
plete data based on assigning a fixed number. In map step, a
fuzzy-rough set was assigned to feature selection. In reduce step,
assigning fuzzy applications means clustering for identifying simi-
lar features to assign them the same key [10].
In MapReduce with dynamic fuzzy inference/interpolation for big
data applications, both inference and interpolation methods can
work together to produce a final output. It was shown that the
average accuracy of classification in terms of performance in tease
two methods has been contrasted in experimental research includ-
ing six different big data problems [11].
del Río, López, Benítez& Herrera (2015), the big data classifica-
tion problems especially the extraction of information is uncertain-
ty that associated with the noise inherent to available data. The
classification evaluated using the accuracy obtained and the
runtime spent by the models, where this research aimed to analyze
the quality of the ChiFRBCS-BigData algorithm in the big data
scenario [12].
Classification problems provide extraction of information with
ambiguity associated with the noise inherent in availability [13].
Furthermore, evaluation is carried out using the accuracy obtained
and the runtime spent by models [4].
Fuzzy logic is used to handle the random and imbalanced relation
of MapReduce mapping function (peer to peer) with the runtime.
Fuzzy logic determines the path and MapReduce speeds up the
process, which is significant to velocity and value. The pinpoint
using fuzzy technique is to efficiently process a large volume of
big data within limited run times.
Del.Rio, S. et al (2015) propose the Chi-FRBCS-BigDataCS algo-
rithm, In order to effectively deal with big data a fuzzy rule-based
classification system that is able to deal with the uncertainty that is
introduced in large volumes of data by using MapReduce frame-
work to distribute the computational operations of the fuzzy model
while it includes cost-sensitive learning techniques in its design to
address the imbalance of big data [12].
Mahmud, et al (2016) focused on using a fuzzy rule summariza-
tion technique, which can provide stakeholders with interpretable
linguistic rules to explain the causal factors affecting health-
shocks [14]. He, Q, et al. (2015) present a Parallel Sampling
method based on HyperSurface (PSHS) for big data with uncer-
tainty distribution to get the Minimal Consistent Subset (MCS) of
the original sample set whose inherent structure is uncertain [15].
Notice that all classifier for solving them is not understandable
(black box type) that is often vital in medical diagnosis there is no
explanation and discussion about the fuzzy rule. The table below
shows a comparison between fuzzy techniques.
Table 2: Comparison Among Fuzzy Techniques
Authors
Nature of
problem
The role of fuzzy set
technique
Advantages of using fuzzy
techniques
Del Río,
S.et al
(2015)
[12]
Classification
Linguistic fuzzy rule-
based classification
A descriptive model with
good accuracy
Mahmud,
et al
(2016)
[14]
Health-
shocks pre-
diction
Fuzzy linguistic
summarization
Providing interpretable
linguistic rules to explain
the causal factors
He, Q, et
al. (2015)
[15]
Parallel
sampling
represent
Handling uncertain-
ties of the boundaries
of hypersurfaces
granules by a fuzzy bounda-
ry; the algorithm maintains
identical distribution
3. Research method
The approach consists of four layers. First comes the data collec-
tion layer and the second is data preprocessing layer which con-
verts semi-structured data into a structured format and implements
the dataset to be input throughout layers that involve map function
with the duty of acquiring relations among dataset, separated at-
tributes, and content, while the function of the third layer is apply-
ing Fuzzy Logic Controller (FLC) and starting the first-round
classification.
The fourth layer is responsible for reducing function with second-
round classification to find an outcome that uses two equations for
precision and recall to calculate F-measure for evaluating the ac-
curacy. Each layer involves components that lead to treating with
characteristics of big data; the whole process achieves 4V's as
follows; the first layer achieves volume, while second and third
layers achieve verity and velocity and layer four achieves value.
The figure below illustrates the process of the hybrid approach
with fuzzy logic controller and MapReduce.
Fig. 1: Hybrid Approach Fuzzy Logic Controller and Mapreduce.
3.1. Layer1: data collection (volume)
Input dataset was obtained from the University of California Ir-
vine (UCI) machine learning repository. The dataset originates
from the National Institute of Diabetes and Digestive of Kidney
Diseases. Several constraints were placed on the selection of in-
stances from a larger database. All patients here are of the female
gender, with at least 21 years of years, from PIMA Indian heritage
dataset containing 8 attributes and 786 records.
3.2. Layer 2: data preprocessing (variety)
After monitoring data, all values which affect the results, such as
null, outliers, and missing values, are defined to detect and prepare
data for map function. Preprocessing output will achieve the varie-
ty and veracity that are related to big data introduce a degree of
uncertainty that has to be handled. In addition to reducing volume
and enhance velocity requirements. Preprocessing will include two
main steps: removal of all missing values and data conversion.
International Journal of Engineering & Technology
6999
3.3. Remove missing value
(If attribute. Value = = NULL || 0)
Then Delete record;
1) Normalization
Apply normalization to convert all string data to numerical value
this important to make later process more sufficiency and quality.
Data preprocessing is carrying out through rules applied in raw
data.
3.4. Layer 3: map function and fuzzy rules (velocity)
The map function is applied, taking the output from the prepro-
cessing component as structured data. In this step presenting map-
ping attributes and values. Then fuzzy controller (IF-THEN) rules
are used to avoid random grouping generated by MapReduce; the
second role is associated with the classifier rate.
Fuzzy rules are implemented as following steps
Step 1: includes converting numeric data to categorical data
according to the normal reading as shown in Table 3. Each
attribute in the fuzzy categorical dataset refers to an interval
for the linguistic terms. Therefore, the length of fuzzy lin-
guistic term is defined as "low” and “high”. Triangular
membership function which is also constructed; e.g., in the
first case, we have the corner points a = 1, b = 2 and c = 3
where b is a normal reading whose degree in the member-
ship function equals one.
Step 2: fuzzy IF-THEN rules are generated covering the
training data, using the dataset from Step 1. First, degrees of
the membership function for all values are calculated in the
data. Through each instance and each variable, a linguistic
value is determined as whose membership function is max-
imal, while the process is repeated for all instances to con-
struct fuzzy rules covering all the data.
Step 3: a degree is adjusted for each rule. Degrees of mem-
bership function are then aggregated.
Step 4 a final rule is obtained base after deleting redundant
rules. Considering the degrees of rules, redundant rules and
those with lower degrees are deleted. Fuzzy based rule =
2^8, focusing on 12 rules that cover 90% of the diabetic pa-
tient's dataset.
3.5. Layer 4: reduce (value)
Reduce function is associated with the listing of data and put into
groups of values, where each group is produced as a pair (key,
value). The output is a list of attributes (keys) and all their associ-
ated values Measurement is carried out for converging between
productive and perspective results to achieve meaningful data
useful for doctors and patients. Classification measures used are
accuracy, precision, recall, and F-measure to evaluate the classifi-
er used to verify the effectiveness based on the confusion matrix.
The hybrid approach is evaluated using the Text Retrieval Confer-
ence (TREC) [16], [17].
4. Experiment and analysis
A hybrid approach using fuzzy logic and MapReduce is applied to
achieve meaningful data on both R packages [18] and WEKA
[19], by the independent running of the same equipment computer
properties; R integrated suite of software facilities for data manip-
ulation using separated packages. Readr package [20]to Read
Rectangular Text Data, Dplyr package [21] which purposed for
data manipulation, Tidyr package [22] is important to work with
Attributes-Feature and Raw-Observation, Classification and Re-
gression Training caret package [23], PreProcess package [24]
which role is preparing data while HadoopStreaming [24, 25] and
HiveR [26] Provides a framework for writing map/reduce Func-
tion manager and plots them and FuzzyR [27] to Design and simu-
late fuzzy logic. On the other hand, WEKA is a collection of
learning algorithms and data preprocessing tools applied built-in
function. Experiment steps as follow:
4.1. Layer1: data collection (volume)
In WEKA dataset is imported using import data function to re-
trieves data from a file this depending on how to view and under-
stand the whole data. An advantage of R packages is that they are
able to treat in-memory and out-memory storage, which reflects
covering the volume characteristic of big data.
4.2. Layer 2: verity (data preprocessing)
Once a dataset has been read, various data preprocessing tools, In
Weka, built-in functions are used in preprocessing called filter,
while in R the PreProcess package prepared data and made it suit-
able to analysis.
4.3. Layer 3: map function and fuzzy rules (velocity)
Map instructions divide the data to key and value. Map-function in
R is involved in two packages, which are HadoopStreaming and
hive to deal with the scalability of big data. In Fuzzy logic, rules
are applied to predict diabetes depending on the relations among
attributes and the outcome. In table 3 the main affected rules.
Table 3: Main Effected Rules That Define Diabetic and Non-Diabetic
Rule
class
R1: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is L) and (BMI is High) and (PED is High) and
(Age is Low)
Non-diabetic
R2: If (Npreg is Low) and (Glu is Low ) and (BP is Low)) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is High)
and (Age is Low)
Non-diabetic
R3: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is Low) and (BMI is High) and (PED is Low)
and (Age is Low)
Non-diabetic
R4: If (Npreg is Low) and (Glu is Low) and (BP is Low) and (Skin is High) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
Non-diabetic
R5: If (Npreg is High) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is High) and (BMI is Low) and (PED is Low)
and (Age is Low)
Non-diabetic
R6: If (Npreg is Low) and (Glu is Low) and (BP is High) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
Non-diabetic
R7: If (Npreg is Low) and (Glu is Low) and (BP is Low) and (Skin is Low) and (Insulin is Low) and (BMI is Low) and (PED is Low)
and (Age is Low)
Non-diabetic
R8: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is Low) and (Insulin is High) and (BMI is L) and (PED is High) and
(Age is L)
Diabetic
R9: If (Npreg is Low) and (Glu is H) and (BP is Low) and (Skin is Low) and (Insulin is High) and (BMI is High) and (PED is High) and
(Age is High)
Diabetic
R10: If (Npreg is Low) and (Glu is High) and (BP is L) and (Skin is High) and (Insulin is Low) and (BMI is Low) and (PED is Low) and
(Age is Low)
Diabetic
R11: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is High) and (Insulin is High) and (BMI is High) and (PED is
High) and (Age is High)
Diabetic
7000
International Journal of Engineering & Technology
R12: If (Npreg is High) and (Glu is High) and (BP is High) and (Skin is H) and (Insulin is Low) and (BMI is High) and (PED is Low)
and (Age is High)
Diabetic
In R, Hive and HadoopStreaming packages are used to derive big
data characteristics. Especially, MapReduce is a parallel distribut-
ed system. Reducing random factor affects the map, where the
significance of MapReduce appears in obtaining velocity. In WE-
KA classification and regression algorithms applicable to the pre-
processed data use classify panel.
4.4. Layer 4: reduce function (value)
Reduce function performs calculations on small chunks of data in
parallel then it combines the sub results from each reduced-chunk.
Patients are classified into two classes: diabetic and non-diabetic.
The value is extracted from big data achieved via classifying the
PIMA dataset, through calculating Precision- Sensitivity (Eq.1),
Recall- Specificity (Eq.2) and F-measure (Eq.3). The results illus-
trated in Figure 2.
Precision = (tp)
(TP+FP) [28] (1)
Recall= (tp)
(TP+FN) [28] (2)
F-measure =(2∗precision∗Recall)
(precision+Recall) [28] (3)
Where TP, TN, FP, and FN indicate in the following order:
True positives: predict Diabetic as Diabetic.
True negatives: predict Non-Diabetic as Non-Diabetic.
False positives: predict Non-Diabetic as Diabetic.
False negatives: predict Diabetic as Non-Diabetic.
Fig. 2: F-Measure and Accuracy in R and WEKA Values and Periods for
F-Measure are Shown in Graphs.
5. Results
The results obtained of the hybrid approach fuzzy logic and
MapReduce in big data are very interesting and can be used confi-
dently to help for decision making and achieve meaningful. The
result evidence shows the scalability, which is able to extract pro-
cess and manipulate with preserving the accuracy of classification
at a satisfactory level, measured by determining the confusion
matrices which contains information about actual and predicted
classifications by an approach as shown in the figure below.
Fig. 3: Confusion Matrices for R Packages and Weka.
Through the evaluation results are observed as follows:
Precision: in Weka, the result is 0.822 and in R (Readr,
Dplyr, Tidyr, PreProcess, HadoopStreaming, HiveR, and
FuzzyR) packages the result is 0.803, where the difference
between both results is 0.01905. that means Weka is effect
than R in positive prediction value.
Recall: in Weka, the result is 0.676 and in R (Readr, Dplyr,
Tidyr, PreProcess, HadoopStreaming, HiveR, and FuzzyR)
packages, the result is 0.866 where the difference between
both results is 0.19. that means R is effect than Weka in
sensitivity measure.
F-measure: Weka result is 2.028, while R result is 2.598,
that means R is more harmonic then Weka with a difference
of 0.57 between two results.
Accuracy: in Weka, the result is 0.701, while in R the result
is 0.774. The difference between both results is 0.073,
which mean R (Readr, Dplyr, Tidyr, PreProcess, Hadoop-
Streaming, HiveR, and FuzzyR) improved than Weka in
Accuracy measure. The results are summarized in the figure
below.
Fig. 4: Result Differences between R Packages and Weka.
6. Conclusion
This paper addressed two main challenges, the first challenge is
extracting meaningful data from big data, the second challenge is
combining big data technique with fuzzy logic artificial intelli-
gence techniques through hybrid approach consisting of four lay-
ers to treat with 4V's of big data. The experiment with R (packag-
es) and Weka are implemented MapReduce and apply it for classi-
fying and predicting meaningful data. The effectiveness of the
approach has been demonstrated through PIMA Indian diabetes
dataset. The pinpoint aspects handled are management, acquisition
and acting with big data.
The aspect of management of big data uses fuzzy logic controller
and MapReduce dynamically to treat manipulate data, such as data
updating, adding, deletion, and insertion reflecting the value of
recall measure. Furthermore, the main impact in terms of big data.
The significance of the approach based on F-measure to acquire
meaningful data by mean of MapReduce. The contribution of this
research based on the results of experiment approach supports
healthcare domain to make accurate decisions. On the other hand,
the precision measure is negatively affected by the random values
generated in MapReduce which will be the future work.
References
[1] Manyika, J., et al., Big data: The next frontier for innovation,
competition, and productivity. 2011.
[2] Xiaofeng, M., C.J.J.o.c.r. Xiang, and development, Big data
management: concepts, techniques, and challenges [J]. 2013. 1(98):
p. 146-169.
[3] Jin, X., et al., Significance and challenges of big data research.
2015. 2(2): p. 59-64. https://doi.org/10.1016/j.bdr.2015.01.006.
International Journal of Engineering & Technology
7001
[4] Fernández, A., et al., Fuzzy rule-based classification systems for
big data with MapReduce: granularity analysis. Advances in Data
Analysis and Classification, 2017. 11(4): p. 711-730.
https://doi.org/10.1007/s11634-016-0260-z.
[5] Chen, C.P. and C.-Y.J.I.S. Zhang, Data-intensive applications,
challenges, techniques, and technologies: A survey on Big Data.
2014. 275: p. 314-347. https://doi.org/10.1016/j.ins.2014.01.015.
[6] Tidke, B. and R. Mehta, A Comprehensive Review and Open
Challenges of Stream Big Data, in Soft Computing: Theories and
Applications. 2018, Springer. p. 89-99. https://doi.org/10.1007/978-
981-10-5699-4_10.
[7] del Río, S., et al., A MapReduce approach to address big data
classification problems based on the fusion of linguistic fuzzy rules.
2015. 8(3): p. 422-437.
https://doi.org/10.1080/18756891.2015.1017377.
[8] Hashem, I.A.T., et al., MapReduce: Review and open challenges.
2016. 109(1): p. 389-422. https://doi.org/10.1007/s11192-016-
1945-y.
[9] Jovanovič, U., et al., Big-data analytics: a critical review and some
future directions. 2015. 10(4): p. 337-355.
https://doi.org/10.1504/IJBIDM.2015.072211.
[10] ABDRABO, M., et al., A Framework For Handling Big Data
Dimensionality Based on Fuzzy-Rough Technique. Journal of
Theoretical & Applied Information Technology, 2018. 96(4).
[11] Jin, S., J. Peng, and D. Xie. Towards MapReduce approach with
dynamic fuzzy inference/interpolation for big data classification
problems. in 2017 IEEE 16th International Conference on
Cognitive Informatics & Cognitive Computing (ICCI* CC). 2017.
IEEE. https://doi.org/10.1109/ICCI-CC.2017.8109781.
[12] del Río, S., et al., A MapReduce approach to address big data
classification problems based on the fusion of linguistic fuzzy rules.
International Journal of Computational Intelligence Systems, 2015.
8(3): p. 422-437. https://doi.org/10.1080/18756891.2015.1017377.
[13] Al_Zyadat, W.J. and F. Y.Alzyoud, The classification filter
techniques by field of application and the results of output.
Australian Journal of Basic and Applied Sciences (AJBAS), 2016.
10(15): p. 10.
[14] Mahmud, S., R. Iqbal, and F. Doctor, Cloud-enabled data analytics
and visualization framework for health-shocks prediction. Future
Generation Computer Systems, 2016. 65: p. 169-181.
https://doi.org/10.1016/j.future.2015.10.014.
[15] He, Q., et al., Parallel sampling from big data with uncertainty
distribution. Fuzzy Sets and Systems, 2015. 258: p. 117-133.
https://doi.org/10.1016/j.fss.2014.01.016.
[16] Haruna, K. and M.A. Ismail. Evaluation Datasets for Research
Paper Recommendation Systems. in Data Science Research
Symposium 2018. 2018.
[17] The Text Retrieval Conference (TREC). 2018; Available from
trec.nist.gov/evals.html.
[18] Venables, W.N., D.M. Smith, and R.C. Team, An introduction to
R-Notes on R: A programming environment for data analysis and
graphics. 2018.
[19] Holmes, G., A. Donkin, and I.H. Witten, Weka: A machine learning
workbench. 1994.
[20] Wickham, H., J. Hester, and R.J.U.h.C.R.-p.o.p.r.R.p.v. Francois,
readr: Read Rectangular Text Data, 2017. 1(0).
[21] Wickham, H., et al., dplyr: A grammar of data manipulation. 2015.
3.
[22] Wickham, H.J.U.h.C.R.-p.o.p.t.R.p.v., tidyr: Easily Tidy Data
with’spread ()and gather ()’Functions, 2017. 2017. 1: p. 248.
[23] Denniston, K.J., J.J. Topping, and R.L. Caret, General, organic, and
biochemistry. 2004: McGraw-Hill New York.
[24] Coombes, K.R., K.A. Baggerly, and J.S. Morris, Pre-processing
mass spectrometry data, in Fundamentals of Data Mining in
Genomics and Proteomics. 2007, Springer. p. 79-102.
https://doi.org/10.1007/978-0-387-47509-7_4.
[25] Verma, C. and R. Pandey, Statistical Visualization of Big Data
Through Hadoop Streaming in RStudio, in Handbook of Research
on Big Data Storage and Visualization Techniques. 2018, IGI
Global. p. 549-577. https://doi.org/10.4018/978-1-5225-3142-
5.ch019.
[26] Sadhana, S.S., S.J.I.J.o.E.T. Shetty, and A. Engineering, Analysis
of diabetic data set using hive and R. 2014. 4(7): p. 626-9.
[27] Bondarenko, I., et al., IDAS: a Windows-based software package
for cluster analysis. 1996. 51(4): p. 441-456.
https://doi.org/10.1016/0584-8547(95)01448-9.
[28] Powers, D.M., Evaluation: from precision, recall and F-measure to
ROC, informedness, markedness, and correlation. 2011.
... A few of them adapt the current fuzzy setbased algorithms to big data environment, whereas others work to create new methods that are appropriate for tackling certain big data problems. [19], A novel version of MapReduce with four layers is created by combining MapReduce and fuzzy logic in a hybrid method. The first layer is used to collect data. ...
Article
Full-text available
There is a real increasing in the generating of data from different sources. Data mining (DM) is a useful method to elicit valuable information. Association rule mining (ARM) can assist in finding patterns and trends in big data. Also fuzzy logic plays a main role as assistance technique in handling big data issues. This review paper produces recent literature on hybridization regarding the association rule mining or other DM methods such as classification and clustering and fuzzy logic techniques in big data. Whereas a hybrid model of association rule and fuzzy logic is suggested to get a valuable knowledge for big data applications at good accuracy and less time, with the aid of distributed framework for big data handling (Hadoop, Spark and MapReduce). Different techniques and algorithms were used in these works and evaluated according to accuracy, sensitivity, recall and run time with a various result as Specificity = 86%, Sensitivity = 80% and F-measure = 2.5, or achieving high accuracy and shorter runtime compared to other methods and 98.5accuraccy of fitness function in pruning redundant rules. At the end of the paper we present the most used and prominent techniques that assist in providing a useful and valuable knowledge in different domains from a huge, unstructured and even heterogeneous data. The paper will be beneficial to the researches who interesting in the field of mining big data.
... Additionally, a hybrid method is used for extracting rules and improving the accuracy of important data using machine learning. Apriori algorithms are used in [18] to improve the reduction of time as consumed through 67.38% compared with the original Apriori [19]. An Approach for documentation of dissimilar instructions linked to the transactional datasets is shown in [11]. ...
Article
Full-text available
In recent years, big data has become an important branch of computer science. However, without AI, it is difficult to dive into the context of data as a prediction term, relying on a large feature of improving the process of prediction is connected with big data modelling, which appears to be a significant aspect of improving the process of prediction. Accordingly, one of the basic constructions of the big data model is the rule-based method. Rule-based method is used to discover and utilize a set of association rules that collectively represent the relationships identified by the system. This work focused on the use of the Apriori algorithm for the investigations of constraints from panel data using the discretization preprocess technique. The statistical outcomes are associated with the improved preprocess that can be applied over the transaction and it can illustrate interesting rules with confidence approximately equal to one. The minimum support provided to the present rule considers constraint as a milestone for the prediction model. The model makes an effective and accurate decision. In nowadays business, several guidelines have been produced. Moreover, the generation method was upgraded because of an association data algorithm that works for dissimilar principles of the structures compared with fewer breaks that are delivered by the discretization technique.
... the proposed calculation caters in taking care of confirmation of understudies to different colleges by ordering them into three groups conceded, dismissed and the individuals who most likely would get the affirmation. Ikhlas A. et al [5] have used a hybrid approach utilizing fuzzy rationale and MapReduce to deliver another form of MapReduce which comprise of four layers. Golnar and Shahriar [6] have utilized information (data) anonymization method for affiliation rule stowing away, while parallelization and adaptability highlights are likewise inserted in the proposed model, so as to accelerate enormous information mining process. ...
Article
Due to the huge amount of uncertain data collection, mankind facing a colossal amount and fast data having such a composite configurations. The data can be transact on the web and manually exchanges, interpersonal organizations and through our everyday life exercises. A very helpful result of designing an appropriate Big Data in different zones, for example, medical services, human services, executives and administrators. Managing the Big Data efficiently, new investigation perspective has been secured at this point the viewpoints about huge information (Big Data) requires extra-long stretch innovative interests. The Fuzzy Logic s have been implemented here for Big Data because of their capacities to handle the vagueness and uncertainties in the information. A couple of imaginative approach for Big Data processing is presented previously. To abbreviate the current duties and extant a point of view of further enhancements. We survey the different examinations and concentrated on that there are constrained Fuzzy systems have been adopted in large information (Big Data preparing. We also examine the various advantages of Fuzzy Logic s in large information (Big Data) issues. Therefore in this paper a Fuzzy object-oriented database is designed for Big Data and perform some Fuzzy queries to check the performance of the designed Fuzzy object-oriented database. We focused on the continuously propelled augmentations of Fuzzy sets and their blends in with various contraptions could offer a novel promising planning condition.
... Traffic data is growing rapidly as it starts to save pictures and videos in centre place to use it, and the cities' security needs to extend the period of saving traffic huge data to support public security and decision-makers. The increasing amount of traffic data possess [2] [3], the traffic management is the issue because counter by time to swap among colours (Red, Yellow, and Green) , so these are the initial parameters even through colours the time is the major scale, meanwhile the counter of the traffic in configuration mood is movement from more than a one still the challenge looks to solve into different aspects. Inner to challenge from Big Data aspect led to finding out Global Position System (GPS) [4], [5]: Since the GPS device is compatible and useful in Hardware and Software [6]. ...
Article
Full-text available
The Streaming big data is one of shifting trend of technology, from data cycle from external behaviour (Hardware) that is low-level low language to digital level, through virtual memory to be implemented as datasets. The proposed model in this preliminary study impending to track data from a different platform and considered important parameters in establishing regulate the storage machinery, storage format, and the pre-processing tools. Moreover, the source for the data type is unstructured data that request organized data in different levels to able forwards process data from machine level to MetaData. The type of storage is virtual memory and the important matter is the capacity is limited. The tuple is third level context the digital data from each sensor through template are recorded by counter time.
Conference Paper
A critical step in the creation of software is the elicitation of requirements. According to most of the research, nonfunctional requirements get less attention from nonfunctional requirements, also necessary for the creation of every new application. A poor choice of elicitation causes the system to malfunction. Without the use of an elicitation approach, the needs and requirements of users cannot be ascertained. Ensuring efficient communication between analysts and users during the elicitation process is the biggest challenge for analysts. This study’s major artifact is a proposed model that creates a conceptual model automatically from a series of agile requirements given as user stories. The accuracy, especially when user stories are succinct assertions that identify the issue to be handled, was one of our case study’s positive outcomes. The objective was to prove that artificial intelligence can be used to elicit software requirements for software systems, and the findings support this claim. An elicitation model for NFR in agile methodology is proposed by this work. The approach will help the software business identify and collect needs for all kinds of software. as well as directing both users and developers in the development of software. Due to the elicitations of both FRs and NFRs of the initial phase in agile projects, which receive less attention, this study decreased the time, effort, and risk.
Article
Full-text available
Big Data is a huge amount of high dimensional data, which is produced from different sources. Big Data dimensionality is a considerable challenge in data processing applications. In this paper, we proposed a framework for handling Big Data dimensionality based on MapReduce parallel processing and FuzzyRough for feature selection. This paper proposes a new method for selecting features based on fuzzy similarity relations. The initial experimentation shows that it reduces dimensionality and enhances classification accuracy. The proposed framework consists of three main steps. The first is the preprocessing data step. As for the next two steps, they are a map and reduce steps, which belong to MapReduce concept. In map step, FuzzyRough is utilized for selecting features. In reduce step, the fuzzy similarity is presented for reducing the extracted features. In our experimental results, the proposed framework achieved 86.4% accuracy by using decision tree technique, while the accuracy of the previous frameworks, which are performed on the same data set, achieved accuracy between 70 to 80%.
Article
Full-text available
The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. The MapReduce computational paradigm is a major enabler for underlying numerous big data platforms. MapReduce is a popular tool for the distributed and scalable processing of big data. It is increasingly being used in different applications primarily because of its important features, including scalability, fault tolerance, ease of programming, and flexibility. Thus, bibliometric analysis and review was conducted to evaluate the trend of MapReduce research assessment publications indexed in Scopus from 2006 to 2015. This trend includes the use of the MapReduce framework for big data processing and its development. The study analyzed the distribution of published articles, countries, authors, keywords, and authorship pattern. For data visualization, VOSviewer program was used to produce distance- and graph-based maps. The top 10 most cited articles were also identified based on the citation count of publications. The study utilized productivity measures, domain visualization techniques and co-word to explore papers related to MapReduce in the field of big data. Moreover, the study discussed the most influential articles contributed to the improvements in MapReduce and reviewed the corresponding solutions. Finally, it presented several open challenges on big data processing with MapReduce as future research directions.
Chapter
Data Visualization enables visual representation of the data set for interpretation of data in a meaningful manner from human perspective. The Statistical visualization calls for various tools, algorithms and techniques that can support and render graphical modeling. This chapter shall explore on the detailed features R and RStudio. The combination of Hadoop and R for the Big Data Analytics and its data visualization shall be demonstrated through appropriate code snippets. The integration perspective of R and Hadoop is explained in detail with the help of a utility called Hadoop streaming jar. The various R packages and their integration with Hadoop operations in the R environment are explained through suitable examples. The process of data streaming is provided using different readers of Hadoop streaming package. A case based statistical project is considered in which the data set is visualized after dual execution using the Hadoop MapReduce and R script.
Article
Due to the vast amount of information available nowadays, and the advantages related to the processing of this data, the topics of big data and data science have acquired a great importance in the current research. Big data applications are mainly about scalability, which can be achieved via the MapReduce programming model.It is designed to divide the data into several chunks or groups that are processed in parallel, and whose result is “assembled” to provide a single solution. Among different classification paradigms adapted to this new framework, fuzzy rule based classification systems have shown interesting results with a MapReduce approach for big data. It is well known that the performance of these types of systems has a strong dependence on the selection of a good granularity level for the Data Base. However, in the context of MapReduce this parameter is even harder to determine as it can be also related with the number of Maps chosen for the processing stage. In this paper, we aim at analyzing the interrelation between the number of labels of the fuzzy variables and the scarcity of the data due to the data sampling in MapReduce. Specifically, we consider that as the partitioning of the initial instance set grows, the level of granularity necessary to achieve a good performance also becomes higher. The experimental results, carried out for several Big Data problems, and using the Chi-FRBCS-BigData algorithms, support our claims.
Article
Data type and amount in human society is growing in amazing speed which is caused by emerging new services such as cloud computing, internet of things and social network, the era of big data has come. Data has been fundamental resource from simple dealing object, and how to manage and utilize big data better has attracted much attention. Evolution or revolution on database research for big data is a problem. This paper discusses the concept of big data, and surveys its state of the art. The framework of big data is described and key techniques are studied. Finally some new challenges in the future are summarized.
Article
In this paper, we present a data analytics and visualization framework for health-shocks prediction based on large-scale health informatics dataset. The framework is developed using cloud computing services based on Amazon web services (AWS) integrated with geographical information systems (GIS) to facilitate big data capture, storage, index and visualization of data through smart devices for different stakeholders. In order to develop a predictive model for health-shocks, we have collected a unique data from 1000 households, in rural and remotely accessible regions of Pakistan, focusing on factors like health, social, economic, environment and accessibility to healthcare facilities. We have used the collected data to generate a predictive model of health-shock using a fuzzy rule summarization technique, which can provide stakeholders with interpretable linguistic rules to explain the causal factors affecting health-shocks. The evaluation of the proposed system in terms of the interpret-ability and accuracy of the generated data models for classifying health-shock shows promising results. The prediction accuracy of the fuzzy model based on a k-fold cross-validation of the data samples shows above 89% performance in predicting health-shocks based on the given factors.