ArticlePDF Available

Abstract and Figures

Big data of different types, such as texts and images, are rapidly generated from the internet and other applications. Dealing with this data using traditional methods is not practical since it is available in various sizes, types, and processing speed requirements. Therefore, data analytics has become an important tool because only meaningful information is analyzed and extracted, which makes it essential for big data applications to analyze and extract useful information. This paper presents several innovative methods that use data analytics techniques to improve the analysis process and data management. Furthermore, this paper discusses how the revolution of data analytics based on artificial intelligence algorithms might provide improvements for many applications. In addition, critical challenges and research issues were provided based on published paper limitations to help researchers distinguish between various analytics techniques to develop highly consistent, logical, and information-rich analyses based on valuable features. Furthermore, the findings of this paper may be used to identify the best methods in each sector used in these publications, assist future researchers in their studies for more systematic and comprehensive analysis and identify areas for developing a unique or hybrid technique for data analysis.
Content may be subject to copyright.
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 45
Data Analytics and Techniques: A Review
Safa S. Abdul-Jabbar1 and Alaa K. Farhan2
1Department of Computer Science, College of Science for Women, University of Baghdad, Baghdad, Iraq
2Department of Computer Science, University of Technology, Baghdad, Iraq
Abstract—Big data of dierent types, such as texts and images,
are rapidly generated from the internet and other applications.
Dealing with this data using traditional methods is not practical
since it is available in various sizes, types, and processing speed
requirements. Therefore, data analytics has become an important
tool because only meaningful information is analyzed and extracted,
which makes it essential for big data applications to analyze and
extract useful information. This paper presents several innovative
methods that use data analytics techniques to improve the analysis
process and data management. Furthermore, this paper discusses
how the revolution of data analytics based on articial intelligence
algorithms might provide improvements for many applications.
In addition, critical challenges and research issues were provided
based on published paper limitations to help researchers distinguish
between various analytics techniques to develop highly consistent,
logical, and information-rich analyses based on valuable features.
Furthermore, the ndings of this paper may be used to identify the
best methods in each sector used in these publications, assist future
researchers in their studies for more systematic and comprehensive
analysis and identify areas for developing a unique or hybrid
technique for data analysis.
Index Terms—Big data analysis, Data analytics, Data
analysis, Data management, Machine learning
Every company collects a considerable amount of data from
various sources. So two prerequisites are needed to secure this
data and use techniques to extract useful information from
this data (Khoshbakht, Shiranzaei and Quadri, 2021; Farhan
and Ali, 2017). The use of big data has rapidly progressed
from a theory to a reality with the rapid progression of data
resources and the creation of companies specializing in big
data (Zheng and Guo, 2020; Do Nascimento, et al., 2021;
Mariani and Baggio, 2022). For example, clients struggle to
  
because the amount of data on the internet is constantly
rising. When a customer submits a query for information
or data to an Internet search engine, the result is typically
many pages. Hence, he faces the repetitious task of locating
         
describing this problem is called “Data Overloading” (Kan
and Klavans, 2002). Hence, the primary objective of this
decade of electronic revolution is to construct and ensure a
better manner of managing, collaborating, and developing
via the use of computer and information technology-based
knowledge and information-oriented services (Rajon, Shamim
and Arif, 2011; Russell and Norvig, 2020). The process of
analyzing and discovering hidden patterns, undiscovered
correlations, and other valuable business information from a
vast volume of data is known as big data analytics (Patel,
Singh and Kazi, 2017; Faizan, et al., 2020). Therefore, data
analytics is a crucial subject for many systems, such as those
that work with strings or information retrieval operations
(Abdul-jabbar and George, 2017). Furthermore, data analytics
can be used to check the privacy issues in social media, such
as tags and image uploading, as we can see on Flickr and
Facebook (Smith, et al., 2013; Abkenar, et al., 2020). Besides
social media applications, data analytics can provide many
 
video (Verma and Agrawal, 2016).
This paper has three overarching goals:
1. It will provide a brief history of data analytics techniques
and methods for documents and describe how data analytics
tools utilize the knowledge from all input documents.
2. Presents how the previous studies are based on multi-
algorithms and multi objective to optimize the traditional
methods and explain the researcher with a comprehensive
overview that helps him choose the suitable algorithms and
integrate them into a model according to the task at hand.
3. Finally, this paper also illustrates the limitations of each
proposed method to present new directions in future works.
The paper is structured as follows, Section II introduces
the proposed data analytics techniques and methods, and
        
         
present a compelling discussion and conclusions that inform
researchers on what they can learn from published research
papers mentioned in this research.
Data analysis primarily entails big data analytical
methodologies, systematic architecture, data mining, and
analysis tools. The most crucial phase in big data is data
Vol. X, No. 2 (2022), Article ID: ARO.10975. 11 pages
DOI: 10.14500/aro.10975
Received 01 May 2022; Accepted: 28 August 2022
Regular research paper: Published: 08 October 2022
Corresponding author’s e-mail:
Copyright © 2022 Safa S. Abdul-Jabbar and Alaa K. Farhan. This
is an open access article distributed under the Creative Commons
Attribution License.
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X
     
making recommendations, and making judgments and
decision support tools that have gained popularity, such
as executive information systems and online analytical
processing. Therefore, data analysis and interpretation
complexity encourage researchers and companies to use
algorithms that process real-time data, analyze it, and
produce highly accurate analytics results. In addition, data
analysis can be used to investigate potential values where
this information can be used for business development and
performance enhancement, such as predictive analytics
that can make future predictions. Data analytics is a wide,
        
analysis varies depending on the type of application required
(Schwarz, Schwarz and Black, 2014; Harfouchi, et al., 2017;
Rajaraman, 2016). Hence, data analytics aims to answer three
categories of questions in general. As shown in Fig. 1, these
elucidate what happened in the past, what is happening now,
and what is anticipated (Ghavami, 2020).
As a result, processing and obtaining the necessary
information from an extensive database cost a lot of time
and processing power (Abdul Majeed, Kadhim and Subhi
Ali, 2017). Moreover, interdisciplinary investigation makes it
   
to conduct a large-scale reality check. Therefore, viable
research provides critical features for completing this activity
and overcoming the inaccessibility of analytical capabilities
(Kashyap, 2019).
          
science used to break data into individual components for
personal inspection and integrate these components to create
knowledge. Informally, Oracle and Cloudera have proposed a
seven-step “value-chain” approach for extracting value using
data analytics; these steps are as follows (Ghavami, 2020):
 
 
3. Data collecting.
4. Data cleaning.
5. Data modeling.
6. Data science team creating (i.e., building solid teams).
7. Optimize and repeat.
On the other hand, Dr. Carol Anne Hargreaves proposed
another seven steps for the business analytics process in
her data science process model, which also can be listed as
follows (Ghavami, 2020):
 
2. Explore the data.
3. Analyze the data.
4. Predict what is likely to happen.
 
6. Make a decision and measure the outcome.
7. Update the system with the results of the decision.
All kinds of data analytics processes, including the
traditional Knowledge Discovery in Databases (KDD)
process, and others such as (Mishra and Sharma, 2014), who
proposed six steps for data analytics and (Chen, Mao and Liu,
2014) suggested three primary steps only and many others.
These proposed systems depend on big data analytics tools
that provide valuable knowledge for enhancing business.
        
models that can be divided into the following types:
A. Advanced analytics and predictive modeling
Machine learning, data science, and predictive modeling
have grown widespread in every area where data analysis
plays a key role (Butcher and Smith, 2020). Data mining
is a sophisticated technique for evaluating large amounts
of data. There are two forms of data analytics: Supervised/
       
previous research studies, predictive modeling works in
       
results when used with unsupervised analytics than with
supervised analytics (Fan, et al., 2018). A prediction
model is built by learning a dataset with a known outcome
       
       
Martens and Provost, showed in 2013 that when predictive
models are created depending on varied and accurate
data, they can provide a performance improvement even
on a large amount of data. In this study, the researchers
trained models and made predictions on sparse datasets
       
      
test the proposed method. The proposed method can
conclude that the system with big data might be more
     
organizations with more data and better understanding may
      
Provost, 2013). The data analytics technologies can be used
in health-care systems as in 2014 when Pourhomayoun,
et al., 2014 proposed a new system for remote health
monitoring (Pourhomayoun, et al., 2014).
On the other hand, machine learning is one of the most
      
of knowledge-intensive automation that can be used in many
applications (Mishra and Sharma, 2014; Cearley, et al., 2018).
      
such as medical, roads and many other applications with
the risk of facing many problems in robustness, monitoring,
alignment and systemic safety that should be handled
(Rajpurkar, et al., 2017; Hendrycks, et al., 2021).Fig. 1. Big data analytics' temporal questions (Peter Ghavami, 2020).
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 47
As an example of using neural networks for advanced data
analytics systems, in 2017, Jain presented the implementation
details for detecting telecommunication fraud using Data
     
Data Mining. The proposed method depends on Microsoft
Azure’s Event Hub and Stream Analytics components for
fraud detection using a self-coded algorithm and a Data
Mining Neural Network Pattern Recognition tool. The
 
         
systems and provide a foundation for big data analytics and
mining (Jain, 2017). Furthermore, Talasila, et al. (2020)
presented a novel neural network-based method for medical
data analytics and disease prediction in 2020. They employed
    
and then fed them into a Recurrent Neural Network for
disease forecasting. As a result, the new technique had a
98.57% accuracy, more than the current accuracy presented
by the existing methods for the heart disease dataset (Talasila,
et al., 2020). Furthermore, dealing with big data can be
aided by deep learning, which has the potential to extract
complicated abstractions (Vu, et al., 2021). A new analytics
model for distant physiological data was proposed based on
 
The proposed model is decomposed into several steps. The
        
which collect the data and send it to the analytical system.
Then the data preprocessing and feature extraction step
should be done to the received data. Followed by data sample
      
          
model. The proposed model was evaluated using a subset
of data acquired from 600 heart failure patients through
a remote health monitoring system. The proposed model
dramatically improved prediction accuracy and performance
(Pourhomayoun, et al., 2014). Whereas in 2019, Corizzo,
Ceci and Malerba, 2019 were inspired by the goals of
  
several national governments (Corizzo, Ceci and Malerba,
2019). They employed recommended methodologies based
on distributed architectures, big data analytics, and predictive
modeling research domains. The results of the proposed
system give accurate predictions (temporal and geographical)
that are scalable in big data. While in 2021(Hamarashid, Saeed
and Rashid, 2021), a new paper was published to present a
novel model for predicting the next word depending on the
         
N-grams used to reduce the time for predicting the next word
in Kurdish dataset. The proposed model achieved results with
accuracy up to 96.3%. Also, in 2021 another research was
presented to produce a prediction model for healthcare centers
based on machine learning algorithms and analysis methods
(Moharram, Altamimi and Alshammari, 2021). In this paper,
they analyze the input data to reduce the number of training
data. Then, three machine learning algorithms were applied
       
the results and select the best one for the proposed system.
       
appointment no-shows in pediatric outpatient clinics with
 
year, Rocha, et al. (2021) used Principal Component Analysis
techniques and unsupervised algorithms to perform better
clustering. As a result, K-mean clustering algorithm shows the
best results for clustering operation (Rocha, et al., 2021).
B. Model accuracy and optimization
There are several optimization strategies available by
multiobjective optimization approaches (Zarchi and Attaran,
2019), (Wang, et al., 2011), (Jaouadi, et al., 2020). In 2020,
Castellanos, et al. showed how to specify, deploy and track
performance metrics in big data analytics applications
       
design process methodology based on the Attribute-Driven
and Architecture analysis method technique (Castellanos,
et al., 2020). Furthermore, many researchers employ the
approximation model instead of the accurate numerical
        
multiobjective optimization approaches in dealing with
complicated engineering issues (Choi, Cho and Kim, 2018).
Therefore, employing optimization techniques is the best
method for identifying suitable model parameters (Kumar,
et al., 2018). A data-driven predictive modeling strategy for
forecasting surface roughness in additive manufacturing is
developed to optimize the integrity of fabricated components.
Various sensors of various sorts are used to collect data on
temperature and vibration. An ensemble learning approach is
used to train the surface roughness prediction model. A subset
of these characteristics is chosen to enhance computational
complexity and accuracy rate. As a result, the proposed
model can provide accurate predicting results. At the same
time, the frequency amplitude of the build plate vibrations,
       
outcome (Li, et al., 2019). Whereas in the education sector,
Tran, et al. (2019) have published a paper that described the
        
by establishing Federated Learning over a wireless network.
       
communication latencies caused by learning accuracy level,
Federated Learning time, and energy consumption of mobile
user equipment. They found the globally optimal solution
        
solution provides exciting insights into design issues through
the ideal Federated Learning over wireless network learning
(time, accuracy, and user equipment’s energy cost) obtained
through numerical and theoretical analysis (Tran, et al.,
2019). In 2019 Zou, et al., proposed a new vehicle evaluation
prediction model (Zou, et al., 2019). This model is used to
optimize the traditional logistic regression algorithm by
studying the logistic mathematical model, designing the error
function, using the gradient descent method to discover the
      
   
enhanced, and the accuracy is maintained.
In 2020, Liu, et al. developed a new adaptive model for
     
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X
on micro multi objective genetics to improve performance.
The optimization results further demonstrate the proposed
model’s usefulness in real-life applications. However, this
model needs more samples and local-densifying iterations to
provide reliable optimization results (Liu, et al., 2020). Also,
   
Intelligence model that aims to create a hybrid framework
for predicting and analyzing stress intensity factors. This
framework was built by building an adaptive neuro-fuzzy
inference system, tuned using two meta-heuristic algorithms:
genetic algorithm and particle swarm optimization. The
proposed model outperformed the other AI models for
accurate prediction, with R2 = 0.9913, RMSE = 23.6, and
MAE = 18.07. However, increasing the datasets generated
         
computations that include a variety of ranges and materials
might enhance prediction performance (Seghier, et al., 2020).
C. Natural language processing
Researchers in the discipline can leverage techniques
developed to appropriately and accurately analyze language.
For example, natural language techniques have computational
  
and deep learning techniques such as CNN are widely used in
this area (McNamara, et al., 2017; Shamsaldin, et al., 2019).
Hence, the Natural Processing techniques allow researchers
to collect and analyze data to extract the information (Rajput,
        
categorization is the optimization problem. This problem can
consider an analytical issue for document summarization,
prompting a group of academics to create a nature-inspired
optimization technique based on a multi-criteria optimization
  
       
increases of 31.09% (8.43%) and 18.63% (6.09%) in
ROUGE-2 (ROUGE-L) compared to the best single-objective
  
Gomez, Vega-Rodríguez and Pérez, 2018). In the same year
(Rashid, Mustafa and Saeed, 2018), Rashid, Mustafa and
Saeed (2018) applied a stemmer to Kurdish text documents
(KDC-4007 dataset). They used three algorithms: Support
Vector Machine, Naïve Bays, and Decision Tree, to classify
Kurdish text. After the preprocessing phase, they found
that the support vector machine achieved the best accuracy
among all the applied algorithms. In 2019, researcher
Sanchez-Gomez, Vega-Rodríguez and Pérez, 2018 continued
developing the research proposed in the previous year by
        
Bee Colony. The developed system was tested on several
datasets (the same datasets used in their previous research)
and evaluated the results using a variety of measures.
Consequently, the results for ROUGE-2 and ROUGE-L have
improved to between 7.37% and 40.76% and 2.59% and
11.24%, respectively (Sanchez-Gomez, et al., 2019).
On the other hand, Yadav and Chatterjee (2016) describe
       
the meaning of essential words in the content for text
summarization. Sentiment analysis is constantly utilized for
large-scale text data analysis and subjectivity analysis. This
study demonstrates that sentiment analysis may be used
        
to summarize the content, particularly for 50% (Yadav and
Chatterjee, 2016). Furthermore, researchers can employ a
lexicon-based technique to examine students’ responses.
A new algorithm has been suggested to establish teachers’
opinion results by extracting semantic meaning from students’
      
amount of positive or negative thoughts. This method displays
the instructors’ opinion results, categorized according to the
strength of the positive or negative sentences. However,
utilizing a lexicon approach to sentiment analysis is not
optimal because some crucial details might be lost (Aung and
Myo, 2017).
A summarization system can be designed depending on the
dataset’s similarity or dissimilarity measures. The research
        
summarization for text as a binary analysis issue. They
      
    
encode a potential subcategory of sentences to be included
in the summary, and then assessed using objective functions
such as the sentence’s location in the document. The results
show that good improvements were obtained depending on
the dataset used and the objective function (Saini, et al.,
2019). In 2021, another paper was proposed to perform
data analysis using state-of-the-art techniques. Using syntax
analysis, they developed a method capable of extracting
the recent Toolkit for ATM Incidence investigative process
taxonomy factors from free-text safety reports. Finally, they
modify a Data-Driven Method capable of automatically
determining the cause of the aircraft accident. The results
demonstrate that when merely elevated predictions are
considered, the model provided pilots’ contribution is around
97% accurate and 94% for ATCo (Buselli, et al., 2021). In
the same year, Vargas-Calderón, et al. (2021) presented a
model used in healthcare applications to evaluate the quality
of service in hospitals depending on client reviews. After the
text extraction and cleaning step, the model was designed
depending on multi-ML algorithms (Vargas-Calderón, et al.,
2021). Furthermore, in 2021 Hryshchenko and Yaremenko
      
and neural networks to categorize a batch of text data and
determine both disadvantages and advantages of each method
(Hryshchenko and Yaremenko, 2021). At the same time,
Yaremenko, Rogoza and Spitkovskyi (2021) developed a
neural network architecture that can process a large amount
of data in real-time systems and handle the determined
limitation of the applied mathematical models of the standard
Neural Networks and Naive Bays (Yaremenko, Rogoza and
Spitkovskyi, 2021).
D. Quantitative analysis (prediction and prognostics)
Quantitative analysis is concerned with quantifying and
analyzing variables to arrive at conclusions. It entails using
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 49
statistical tools to analyze numerical data for answering
questions such as who and when. Apuke published his
work on predictor measurement heterogeneity by altering
the degree of measurement error across derivation and
validation scenarios. Hence, he generated hybrid predictor
measurements using measurement error models (Apuke,
2017; Pajouheshnia, et al., 2019; Luijken, et al., 2019; and
Luijken, et al., 2020). In 2021, Admiraal, et al. used 12
quantitative features gathered from various patient situations
to train several types of machine learning algorithms. The
research results show that machine learning employing
quantitative features derived from collected data has a
better precision than visual data analysis in predicting poor
prognosis following cardiac arrest, making it a potential
alternative to visual analysis (Admiraal, et al., 2021). In
2022, Luijken, Song and Groenwold proposed a paper
to analyze the expected predictor measurement diversity
impact. In period outcome data, simulation research was
variation across validation and implementation settings. The
application of quantitative prediction error analysis was
demonstrated with an illustration of forecasting the 6-year
probability of acquiring second type diabetes with variability
in the predictor body mass-index measurement. As a result of
this paper, all situations of predictor measurement variability
resulted in the poor measurement of prediction models, and
overall accuracy was lowered. Furthermore, it increased
random predictor measurement variability (Luijken, Song
and Groenwold, 2022).
     
representations from large-scale data, particularly unlabeled
data, which is plentiful in Big Data (Chen and Lin, 2014),
(Najafabadi, et al., 2015). In 2020, Zhong, Yu and Ai
proposed the big data-based hierarchical deep learning
system in the context of employing deep learning for data
analytics. This system uses behavioral and content features
       
in the payload. When several machines are deployed, the
        
boost the detection rate of intrusive attacks and reduce the
 
E. Ensemble of models (data analytics prediction
In many real-world applications, the availability of
        
and eliminate duplicate and unnecessary variables from the
feature-set, particularly in high-dimensional applications.
This circumstance naturally happens in many real-life
situations when a large amount of data can be acquired
      
of samples is time-intensive and cannot be assumed. Many
approaches were suggested to improve accuracy in machine
learning; one of these approaches is to aggregate the output
of several learners. Ensemble Learning is a term used to
describe this approach. Bagging, boosting, stacking, and
error-correcting output are the four methods for merging
several models (Wang, et al., 2014). The learning under
       
individual classes. It is employed when there is a reference
value and training set with the variables to cluster (Dean,
2014). Whereas in unsupervised learning, feature selection
seeks to locate meaningful subsets of features that yield best
groupings clustering by clustering “similar” items together
using any similarity metric (Nag and Mitra, 2002), (Dy and
Brodley, 2004), (Hong, et al., 2008), (Elghazel and Aussem,
2010). In 2011, Rajon, Shamim and Arif proposed a complete
framework that designed and implemented a generic product-
independent e-market model for emerging economies. This
paper’s fundamental contribution is creating and executing a
generic e-marketplace model for emerging economies where
agriculture is widely practiced and a thriving manufacturing
sector. A comprehensive examination of the utility and
     
services has also been presented (Rajon, Shamim and Arif,
2011). Rajon, Shamim and Arif proposed a method based on
the random sample partition model that retains the statistical
features of the data set in each data block in 2018. They
presented the Alpha framework, which consists of three
primary layers for data administration, batch management,
         
analysis with Random Sample Partition blocks. The results
show that the proposed method can provide approximate
results for data analysis tasks such as data summarization and
the Alpha framework for Big Data Analysis tasks (Salloum,
et al., 2018). In the same year, Yu, et al. (2018) design a
model to demonstrate how boosting and bagging approaches
can be compared to produce better explanatory models to
prove that the ensamble approaches are more suitable for
some problems than other approaches (Yu, et al., 2018).
On the other hand, Kumar, Singh and Buyya proposed a
new ensemble learning-based workload prediction model in
2020, which makes use of excessive learning machines and
weights their estimates with a voting engine. The optimized
weights are chosen using a metaheuristic algorithm
motivated by the black hole theory. The results demonstrate
the approach’s superiority over conventional methods, with
a reduction in RMSE of up to 99.20% (Kumar, Singh and
Buyya, 2020). Whereas in 2021 a new framework based on
features modeling and ensemble learning to predict query
performance was proposed by Zaghloul, Salem and Ali-
Eldin (2021) using Machine learning algorithm attempting to
predict a performance metric based on the amount of time
elapsed and ensemble learning (Zaghloul, Salem and Ali-
Eldin, 2021).
Data analytics methods and techniques have more and
more applications in life, and performance enhancement
solutions are widely applied. It is essential to improve the
        
by enhancing the result accuracy and processing time; this
will be done by analyzing the input data and extracting
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X
only the relevant information that the application needs.
Depending on this principle, many types of research papers
  
that can be used to enhance the analysis results. This paper
provides a survey of these research papers as summarized in
Table I. The research papers that proposed new methods in
the literature review section were presented in this table; to
     
was developed, the main research issues: Used to show
the problems that the research tries to solve, the research
techniques used to describe the method and tools that used to
   
to describe the most important results obtained or concluded
from the published research.
It can be seen from the summary of the research papers
presented in Table I that most researchers suggested methods
that used machine learning algorithms. Accordingly, this
paper discusses the proposed methods’ details and their
impact on the results, as shown in Table II. This table focuses
References Research scope Main research issues Main research techniques 
De Fortuny, Martens, and
Provost, 2013
Big data Critical prediction jobs Multivariate Bernoulli
Naive Bayes
Organizations with more data
selection provide a better
understanding of improving the
system performance
Pourhomayoun, et al., 2014 Healthcare systems Remote health monitoring Multi-model approach Improved the prediction accuracy
and performance
Corizzo, Ceci and Malerba,
Energy sector Accurate big data analytics and
predictive modeling
Multi-model approach Accurate results of predictions
with big data
Moharram, Altamimi, and
Alshammari, 2021
Healthcare systems Reduce the number of training
ML algorithms The best algorithm used was
Rocha, et al., 2021 Human development
Classify the departments of
peru according to their human
development index using
clustering techniques
Multi-model approach
Li, et al., 2019 Extrusion-based additive
Optimize the integrity of
fabricated components
ML algorithms Providing accurately predicted
Tran, et al., 2019 Education sector 
(computation vs. communication
latencies) (Learning time vs.
energy consumption)
Zou, et al., 2019 Vehicle evaluation Many iterations and training
vast amounts of data take a long
ML algorithms The training time is reduced, the
and the accuracy is maintained
Liu, et al., 2020 Multiobjective
High computational cost Multi-model approach The proposed model’s useful in
real-life applications
El and Ben, 2020 Stress intensity factor Prediction of stress intensity
ML algorithms Provide accurate predictions
Vega-Rodríguez and
Pérez, 2018
Text summarization Find the essential information
from a document collection
Optimization algorithms Provide essential improvements
compared to the traditional
Sanchez-Gomez, et al., 2019 Text summarization Find the essential information
from a document collection
Optimization algorithms Provide improved performance
and accurate results compared
to the previous version of this
proposed model
Yadav and Chatterjee, 2016 Text summarization 
summarize texts
Sentiment analysis 
summarizing approach based on
Aung and Myo, 2017 Education system Analysis of students’ comment Sentiment analysis Provide good results but not
optimal because some crucial
details might be lost
Saini, et al., 2019 Text summarization Design an automatic
ML algorithms Provide good improvements to
the traditional approaches
Buselli, et al., 2021  Automation and digitization to
maintain safety aviation
Multi-model approach The model provides pilots’
contribution is around 97%
accurate and 94% for ATCo
Admiraal, et al., 2021 Healthcare systems electroencephalography
reactivity quantitative analysis
of neurological prognostication
following cardiac arrest
ML algorithms Provide better precision than
visual data analysis in predicting
poor prognosis following cardiac
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 51
the proposed methods. It has been formulated to include: the
ML algorithms used to design the proposed method and the
techniques used to describe how these algorithms are used to
build the proposed method. Furthermore, this table illustrates
 
         
are also presented in this table to describe the challenges
and problems in the proposed methods. As we have seen
in (Admiraal, et al., 2021) and (Zhong, Yu, and Ai, 2020),
there are limitations in the number of models used to test
the proposed method, so it may have some implementation
issues when used with a large amount of data. Furthermore,
(Kumar, Singh, and Buyya, 2020), there is another issue:
people must determine the number of networks and nodes
in hidden networks. This causes the system to need human
intervention at the data entry stage and is not entirely
automated. While in (Jain, 2017), the proposed model was
designed without any dynamic implementation, which causes
the possibility of facing several problems when applied to
real-time data.
Data analytics aims to provide meaningful and relevant
information. However, many users are uncertain about
the sort of analysis to perform on data collection and
which kinds of visual data presentation are appropriate.
This paper presented a comprehensive review of data
analytics techniques to help researchers construct an
        
in the intentional system. This leads to utilizing these
analytics tools in the best way to provide better privacy,
       
rather than relying only on standard analytics tools to solve
       
Table I, ML algorithms have shown the highest usability
in data analysis systems than other algorithms because they
provide good accuracy with higher performance capacity
in multiple areas of industrial, commercial, agricultural,
health, education, and mining activity. As a result, these
algorithms contribute to the development, increased
employment, the contribution of mining techniques
and increased business investment. Furthermore, Fig. 2
illustrates the turnout percentage, which shows that ML
algorithms are superior to other methods. Therefore, the
ML improvements were presented in Table II to show the
Finally, it is worth mentioning that many challenges can
be faced when adopting ML algorithms in data analytics
systems. For example, processing time, computation power…
etc. However, these challenges can be addressed using
References Research scope Main research issues Main research techniques 
Luijken, Song and
Groenwold, 2022
Healthcare systems 
measurement heterogeneity
Quantitative prediction
error analysis
Proves that the increasing
random predictor measurement
heterogeneity will decrease the
model discrimination
Rajon, Shamim and Arif,
Electronic commerce Design a general prototype for
the e-marketplace
General framework
designing tools
They are creating and developing
an e-market prototype that is
Salloum, et al., 2018 Big data Making the extensive data
analysis is feasible when the
volume of data exceeds the
available computation power
ML algorithms 
extensive data analysis
Kumar, Singh, and Buyya,
Cloud systems These systems must allocate and
deallocate resources with low
operational cost and maintain
the quality of services
ML algorithms Provide accurate predictions by
reducing the error prediction
Zaghloul, Salem, and
Ali-Eldin, 2021
Query optimizer Attempt to predict a
performance metric
ML algorithms 
performance metric
Jain, 2017 Fraud detection 
for fraud detection using a
ML algorithms 
cloud analytics
Talasila, et al., 2020 Healthcare systems Provide an accurate method for
diseases prediction
ML algorithms Provide prediction accuracy up
to 98.57%
Zhong, Yu, and Ai, 2020 Intrusion detection Intrusion detection depends on
data analytics
ML algorithms Boost the detection rate of
intrusive attacks and reduce the
ML: Machine learning
Fig. 2. The percentage of turnout for each method.
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X
parallel and distributed frameworks and choosing appropriate
algorithms to implement for each system.
algorithms used for designing each system. Furthermore, it
       
big data analytics methodologies, evaluating them in terms of
     
        
applications has been adjusted to construct how it can be used
to provide a high-quality performance. It should be noted that
this paper recognizes that the core function of machine learning
          
the behavior of previous data models. As a result, this paper
intends to provide simple research examining several proposed
gaps clearly in unknown information. From this examination,
we can conclude that when the ML algorithms and data
analytics self-tuning system feature selections have been used,
they will improve performance compared to other approaches
and techniques. Several paper investigations in many sectors
have demonstrated the possibility of using machine learning
algorithms in data analytics systems to improve performance
speed and accuracy.
Abdul Majeed, G., Kadhim, A. and Subhi Ali, R. (2017). Retrieving encrypted
query from encrypted database depending on symmetric encrypted cipher system
method. Diyala Journal For Pure Science, 13(1), pp.183-207.
References The used ML
The design techniques  Research limitation
Time Accuracy Processing
Moharram, Altamimi
and Alshammari, 2021
logistic regression,
The optimum machine-learning method was discovered
by comparing three ML algorithms by computing recall
and prescience for each
* * * -
Li, et al., 2019 RF, AdaBoost, CART,
Data was collected depending on multiple sensors,
extracted and selected features from these data, then
applied selective algorithms to these features
* -
Zou, et al., 2019 Logistic regression Using the logistic regression algorithm based on the
gradient descent method and optimizing the sigmoid
* * * -
Seghier, et al., 2020 Genetic and particle
swarm algorithms
Hybrid AI model (GA and PSO) for accurate prediction
method calculations
* * * -
Saini, et al., 2019 Genetic operation
inspired form genetic
self-organizing map (based on genetic operations)
to select a subset of sentences, then evaluate these
sentences using the objective functions
* * -
Admiraal, et al., 2021 LR, SVM, RF, NN, and
GTB algorithms
quantitative variables derived from data of 134 adult
cardiac arrest patients
* A limited number of
samples for training
Salloum, et al., 2018 Hadoop MapReduce
and data mining and
machine learning
The data is assigned to the blocks sampling algorithm,
followed by blocks analysis and ensemble estimates
techniques. Finally, an ensemble model evaluation was
* * * -
Kumar, Singh, and
Buyya, 2020
Extreme learning
They were using extreme learning machines and a
voting engine to optimize the weights of the prediction
hole-inspired metaheuristic algorithm
* * Heuristics are used to
number of networks
and the number of
hidden nodes in each
Zaghloul, Salem, and
Ali-Eldin, 2021
XGBoost algorithm They used real-world datasets to predict query
performance using feature modelling and ensemble
learning based on the duration of time elapsed
* * * -
Jain, 2017 Neural network The event hub and Stream analytics of microsoft azure
with the neural network and self-coded algorithms were
employed to detect fraud telecommunication
* * It cannot be applied
to real-time dynamic
Talasila, et al., 2020 Recurrent Neural
rough set and then fed into a recurrent neural network to
predicate the system results
* * -
Zhong, Yu, and Ai,
Deep learning The hierarchical deep learning system used the
behavioural and content features to interpret network
* * * 
applied to a small
number of samples
ML: Machine learning
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 53
Abdul-jabbar, S.S. and George, L.E. (2017). Fast text analysis using symbol
enumeration and Hashing methodology. Fast Strings Search Process, 58(1),
Abkenar, S.B. Kashani, M.H., Mahdipour, E. and Jameii, S.M. (2020). Big data
analytics meets social media: A systematic review of techniques, open issues,
and future directions. Telematics and Informatics, 57, 101517.
Admiraal, M.M., Ramos, L.A., Delgado Olabarriaga, S., Marquering, H.A.,
Horn, J. and van Rootselaar, A.F. (2021). Quantitative analysis of EEG reactivity
for neurological prognostication after cardiac arrest. Clinical Neurophysiology,
132(9), pp.2240-2247.
Apuke, O.D. (2017). Quantitative research methods : A synopsis approach.
Kuwait Chapter of Arabian Journal of Business and Management Review,
6(11), pp.40-47.
Aung, K.Z. Myo, N.N. (2017). Sentiment Analysis of Students’Comment Using
Lexicon Based Approach. IEEE/ACIS 16th International Conference on Computer
and Information Science, pp.149-154.
Ben Seghier, M., Carvalho, H., Keshtegar, B. and Correia, J.A.F. (2020). Novel
hybridized adaptive neuro-fuzzy inference system models based particle swarm
optimization and genetic algorithms for accurate prediction of stress intensity
factor. FFEMS, 43(11), pp.2653-2667.
Buselli, I., Oneto, L., Dambra, C., Gallego, C.V., Martínez, M.G., Smoker, A. and
Martino, P.R. (2021). Natural Language Processing and Data-Driven Methods for
Aviation Safety and Resilience : From Extant Knowledge to Potential Precursors.
Open Research Europe.
Butcher, B. and Smith, B.J. (2020). Feature engineering and selection: A practical
approach for predictive models. The American Statistician, 74(3), pp.308-309.
Castellanos, C., Pérez, B., Varela, C.A. and Correal, D. (2020). A Model-Driven
Architectural Design Method for Big Data Analytics Applications. Proceedings
2020 IEEE International Conference on Software Architecture Companion,
ICSA-C 2020, pp.89-94.
Cearley, D.W., Natis, Y., Walker, M. and Burke, B. (2018). Top 10 Strategic
Technology Trends for 2018. Gartner, Stamford. Available fromt: https://www.
strategic-technology-trends-for-2018.pdf [Last accessed on 2017 Oct 3].
Chen, M., Mao, S. and Liu, Y. (2014). Big data: A survey. Mobile Networks and
Applications, 19(2), pp.171-209.
Chen, X.W. and Lin, X. (2014). Big data deep learning: Challenges and
perspectives. IEEE Access, 2, pp.514-525.
Choi, B.C., Cho, S. and Kim, C.W. (2018). Kriging Model Based Optimization
of MacPherson Strut Suspension for Minimizing Side Load using Flexible
Multi-Body Dynamics. International Journal of Precision Engineering and
Manufacturing, 19(6), pp. 873-879.
Corizzo, R., Ceci, M. and Malerba, D. (2019). Big Data Analytics and Predictive
Modeling Approaches for the Energy Sector. 2019 IEEE International Congress
on Big Data (BigDataCongress), pp.55-63.
De Fortuny, E.J., Martens, D. and Provost, F. (2013). Predictive modeling with
big data: Is bigger really better? Big Data, 1(4), pp.215-226.
Dean, J. (2014). Big Data, Data Mining, and Machine Learning: Value Creation
for Business Leaders and Practitioners. John Wiley and Sons, Hoboken. Available
from: Mining/Big Data, Data Mining, and
Machine Learning_Value Creation for Business Leaders and Practitioners
%5BDean 2014-05-27%5D.pdf. [Last accessed on 2022 Apr 01].
Do Nascimento, I.J.B., Marcolino, MS., Abdulazeem, H.M, Weerasekara, I.,
Azzopardi-Muscat, N., Gonçalves, M.A. and Novillo-Ortiz D. (2021). Impact
of big data analytics on people’s health: Overview of systematic reviews and
recommendations for future studies. Journal of Medical Internet Research,
23(4), p.e27275.
Dy, J.G. and Brodley, C.E. (2004). Feature selection for unsupervised learning.
Journal of Machine Learning Research, 5, pp. 848-889.
Elghazel, H. and Aussem, A. (2010). Feature selection for unsupervised learning
using random cluster ensembles. Proceedings IEEE International Conference on
Data Mining ICDM, pp.168-175. Doi: 10.1109/ICDM.2010.137.
Faizan, M. Zuhairi, M.F., Ismail, S. and Sultan, S. (2020). Applications of
Clustering Techniques in Data Mining: A Comparative Study. International
Journal of Advanced Computer Science and Applications, 11(12), pp.146-153.
Fan, C., Xiao, F., Li, Z. and Wang, J. (2018). Unsupervised data analytics
         
A review. Energy and Buildings, 159, pp.296-308.
Farhan, K.A. and Ali, M.A. (2017). Database Protection System Depend on
2nd International Conference of Cihan University-Erbil
on Communication Engineering and Computer Science, p.2520-4777.
Ghavami P. (2020). Big Data Analytics Methods. 2nd ed. De Gruyter, Berlin.
Hamarashid, H.K., Saeed, S.A. and Rashid, T.A. (2021). Next word prediction
based on the N-gram model for Kurdish Sorani and Kurmanji. Neural Computing
and Applications, 33(9), pp. 4547-4566.
         
optimization with robustness analysis. Soft Computing A Fusion of Foundations
Methodologies and Applications, 22(19), pp.6371-6394.
Hendrycks, D. Carlini, N., Schulman, J., Steinhardt, J. (2021) Unsolved Problems
in ML Safety. ArXiv, Cornell Tech, pp.1-28. Available from:
Hong, Z., Smart, G., Dawood, M., Kaita, K., Wen, S.W., Gomes, J. and Wu, J.
(2008). Hepatitis C Infection and Survivals of Liver Transplant Patients in
Canada, 1997-2003. Transplantation Proceedings, 40(5), pp.1466-1470.
Hryshchenko, O. and Yaremenko, V. (2021). A comparative analysis of text data
Bayes. Technology Audit and Production Reserves, 5(2(61), pp.6-8.
Jain, V. (2017). Perspective analysis of telecommunication fraud detection
International Journal of Information Technology, 9(3), p.1-8.
Jaouadi, Z., Abbas, T., Morgenthal, G. and Lahemer, T. (2020). Single and
multi-objective shape optimization of streamlined bridge decks. Structural and
Multidisciplinary Optimization, 61(4), pp.1495-1514.
Kan, M.Y. and Klavans, J.L., (2002). Using Librarian Techniques in Automatic
Text Summarization for Information Retrieval. Proceedings of the ACM
International Conference on Digital Libraries, Wuhan, pp.36-45.
Kashyap, R., (2019). Big data analytics challenges and solutions. In: Big Data
Analytics for Intelligent Healthcare Management, Academic Press, Cambridge,
Khoshbakht, F., Shiranzaei, A. and Quadri, S.M.K., (2021). Role of the big data
Turkish Journal of Computer and Mathematics Education, 12(10), pp.560-566.
Kumar, D.U., Soon, T.K., Saad, M., Idna, I.M.Y., Mehdi, S. and Bend, H. (2018).
Forecasting of photovoltaic power generation and model optimization : A review.
Renewable and Sustainable Energy Reviews, 81, pp.912-928.
Kumar, J., Singh, A.K. and Buyya, R. (2020). Ensemble learning based predictive
framework for virtual machine resource request prediction. Neurocomputing,
397, p.20-30.
Li, Z., Zhang, Z., Shi, J. and Wu, D., (2019). Prediction of surface roughness in
extrusion-based additive manufacturing with machine learning. Robotics and
Computer Integrated Manufacturing, 57, pp.488-495.
           
optimization method based on the adaptive approximation model of the radial
basis function. Structural and Multidisciplinary Optimization, 63(4), p.1-19.
Luijken, K., Groenwold, R.H.H., Van Calster, B.E.W., Steyerberg, E.W. and
Van Smeden, M. (2019). Impact of predictor measurement heterogeneity
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X
across settings on the performance of prediction models: A measurement error
perspective. Statistics in Medicine, 38(18), pp.3444-3459.
Luijken, K., Song, J. and Groenwold, R.H.H., (2022). Quantitative prediction
error analysis to investigate predictive performance under predictor measurement
heterogeneity at model implementation. Diagnostic and Prognostic Research,
1, pp.1-11.
Luijken, K., Wynants, L., van Smeden, M., Van Calster, B., Steyerberg, E.W.
and Groenwold, R.H.H., (2020). Changing predictor measurement procedures
Journal of
Clinical Epidemiology, 119, pp.7-18.
Mariani, M. and Baggio, R. (2022). Big data and analytics in hospitality and
tourism: A systematic literature review. International Journal of Contemporary
Hospitality Management, 34(1), pp.231-278.
McNamara, D.S., Allen, L.K., Crossley, S.A., Dascalu,M. and Perret, C.A.,
(2017). Natural language processing and learning analytics. In: Handbook of
Learning Analytics. Ch. 8. Society for Learning Analytics Research, Alberta,
Mishra, R. and Sharma, R. (2014). Big data opportunities and challenges:
Discussions from data analytics perspectives. International Journal of Computer
Science and Mobile Computing, 46(6), pp.27-35.
Moharram, A., Altamimi, S. and Alshammari, R., (2021). Data Analytics and
Predictive Modeling for Appointments No-show at a Tertiary Care Hospital.
2021 1st,
CAIDA 2021, pp.275-277.
Nag, A.K. and Mitra, A., (2002). Forecasting daily foreign exchange rates
using genetically optimized neural networks. Journal of Forecasting, 21(7),
Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N., Wald, R. and
Muharemagic, E., (2015). Deep learning applications and challenges in big data
analytics. Journal of Big Data, 2(1), pp.1-21.
Pajouheshnia, R., van Smeden, M., Peelen, L.M., Groenwold, R.H.H., (2019).
 
transportability of a prediction model. Journal of Clinical Epidemiology, 105,
Patel, A., Singh, N.M. and Kazi, F. (2017). Internet of Things and Big Data
Technologies for Next Generation Healthcare. Springer Cham, Berlin.
Pourhomayoun, M., Alshurafa, N., Mortazavi, B., Ghasemzadeh, H., Sideris, K.,
Sadeghi, B., Ong, M., Evangelista, L., Romano, P., Auerbach, A., Kimchi, A. and
Sarrafzadeh, M. (2014). Multiple model analytics for adverse event prediction
in remote health monitoring systems. In: 2014 IEEE Healthcare Innovation
Conference, HIC 2014, pp.106-110.
Rajaraman, V., (2016). Big data analytics. Resonance, 21(8), pp.695-716.
Rajon, S.A.A., Shamim, A. and Arif, M., (2011). A Generic Framework for
Implementing Electronic Commerce in Developing Countries. International
Journal of Computer and Information Technology, 1(2).
Rajpurkar, P., Irvin, J., Zhu, K., Yang, B., Mehta, H., Duan, T., Ding, D.,
Bagul, A., Langlotz, C., Shpanskaya, K., Lungren, M.P. and Na, A.Y., (2017).
CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep
Learning. ArXiv.
Rajput, A., (2019). Natural language processing, sentiment analysis, and clinical
analytics. In: Innovation in Health Informatics: A Smart Healthcare Primer,
Academic Press, USA, pp.79-97.
Rashid, T.A., Mustafa, A.M. and Saeed, A.M. (2018). Automatic kurdish text
and Web Technologies, Barolli, L., Zhang, M. and Wang, Z., editors. Lecture
Notes on Data Engineering and Communications Technologies. Vol. 6, Springer,
Cham, Berlin, pp.187-198.
Rocha, J.L.M., Zela, M.A.C., Torres, N.I.V. and Medina, G.S. (2021). Analogy
of the application of clustering and K-means techniques for the approximation
of values of human development indicators. International Journal of Advanced
Computer Science and Applications, 12(9), pp.526-532.
Russell, S. and Norvig, P. (2020).   .
4th ed. Prentice Hall, Hoboken.
Saini, N., Saha, S., Chakraborty, D. and Bhattacharyya, B., (2019). Extractive
   PLoS One, 14(11), p.e0223477. [Last
accessed on 2022 Apr 01].
Salloum, S., Huang, J.Z., He, Y. and Chen, X. (2018). An asymptotic ensemble
learning framework for big data analysis. IEEE Access, 7(c), pp.3675-3693.
Sanchez-Gomez, J.M., Vega-Rodriguez, M.A. and C., Perez, C.J. (2019). An
Indicator-based Multi-objective optimizationapproach applied to extractive
multi-documenttext summarization. IEEE Latin America Transactions, 17(8),
Sanchez-Gomez, J.M., Vega-Rodríguez, M.A. and Pérez, C.J. (2018). Extractive
optimization approach. Knowledge-Based Systems, 159, pp.1-8.
Schwarz, C., Schwarz, A. and Black, W.C., (2014). Tutorial: Big data analytics:
Concepts, technologies, and applications. Communications of the Association
for Information Systems, 34(1), pp.1191-1208.
Shamsaldin, A., Rashid, T.A., Fattah, P. and Al-Salihi, N.K., (2019). A study of
the convolutional neural networks applications. UKH Journal of Science and
Engineering, 3(2), pp.31-40.
Shouval, R., Bondi, O., Mishan, H., Shimoni, A., Unger, R. and Nagler, A.,
(2014). Application of machine learning algorithms for clinical predictive
modeling: A data-mining approach in SCT. Bone Marrow Transplantation,
49(3), pp.332-337.
Smith, M., Szongott, C., Henne, B. and von Voigt, G., (2013). Big Data Privacy
Issues in Public Social Media. IEEE International Conference on Digital
Ecosystems and Technologies [Preprint].
Talasila, V., Madhubabu, K., Mahadasyam, M.C., Atchala, N.J. and Kande, L.S.,
(2020). The prediction of diseases using rough set theory with recurrent neural
network in big data analytics. International Journal of Intelligent Engineering
and Systems, 13(5), pp.10-18.
Tran, N.H., Bao, W., Zomaya, A., Nguyen Minh, N.H. and Hong, C.S., (2019).
Federated Learning over Wireless Networks: Optimization Model Design and
Analysis. Proceedings-IEEE INFOCOM, 2019-April(1), pp.1387-1395.
Vargas-Calderón, V., Ochoa, A.M., Nieto, G.Y.C. and Camargo, J.E., (2021).
Machine learning for assessing quality of service in the hospitality sector based
on customer reviews. Information Technology and Tourism, 23(3), pp.351-379.
Verma, J.P. and Agrawal, S., Patel, P. and Patel, A., (2016). Big data analytics:
challenges and applications for text, audio, video, and social media data.
5(1), pp.41-51.
Vu, T., Belussi, A., Migliorini, S. and Eldway, A., (2021). Using deep learning
for big spatial data partitioning. ACM Transactions on Spatial Algorithms and
Systems (TSAS), 7(1), p.1-37.
Wang, J., Tang, Y., Nguyen, M. and Altintas, I., (2014). A Scalable Data Science
Proceedings of
the 2014 International Symposium on Big Data Computing, BDC 2014, pp.16-25.
Wang, X.D., Hirsch, C., Kang, S. and Lacor, C., (2011). Multi-objective
optimization of turbomachinery using improved NSGA-II and approximation
model. Computer Methods in Applied Mechanics and Engineering, 200(9-12),
Yadav, N. and Chatterjee, N., (2016). Text Summarization using Sentiment Analysis
for DUC Data. International Conference on Information Technology, pp.5.
Yaremenko, V.S., Rogoza, W.S. and Spitkovskyi, V.I., (2021). Application of
    Journal of
ARO p-ISSN: 2410-9355, e-ISSN: 2307-549X 55
Theoretical and Applied Information Technology, 99(1), pp.125-134.
Yu, C.H. Lee, H.S., Lara, E. and Gan, S., (2018). The ensemble and model
comparison approaches for big data analytics in social sciences. Practical
Assessment Research and Evaluation, 23(17).
Zaghloul, M., Salem, M. and Ali-Eldin, A., (2021). A new framework based on
features modeling and ensemble learning to predict query performance. PLoS
One, 16(10), pp.1-18.
Zarchi, M. and Attaran, B., (2019). Improved design of an active landing gear
for a passenger aircraft using multi-objective optimization technique. Structural
and Multidisciplinary Optimization, 59(5), pp.1813-1833.
Zheng, L. and Guo, L., (2020). Application of big data technology in insurance
innovation. Journal of Physics Conference Series, 1682(1), pp.285-294.
Zhong, W., Yu, N. and Ai, C., (2020). Applying big data based deep
learning system to intrusion detection. Big Data Mining and Analytices,
3(3), pp.181-195.
Zou, X., et al. (2019). Logistic Regression Model Optimization and Case
Analysis, Proceedings of IEEE 7th International Conference on Computer Science
and Network Technology, ICCSNT 2019, pp.135-139.
... to predict of physical phenomena and optimization [41e43]. According to the agreement of the predicted data with the actual data in a set of reports, the artificial neural networks have excellent and strong performance to work as alternative models [44,45]. The neural network used in this work is a nonlinear regression model that consists of several layers: input layer, output layer and several hidden layers. ...
The whole world is affected by climate change and renewable energy plays an important role in combating climate change. To add to the existing precarious situation, the current political events such as the war in Ukraine mean that fossil raw materials such as oil and gas are becoming more and more expensive in the raw material markets. This paper presents the current state of renewable energies in Germany and Europe. Using data from the past 56 years, the predictive models ARIMA and Prophet are used to find out if the conversion to renewable energies and the elimination of fossil raw materials in the energy sector can be achieved in the EU. The results are compared with the target of the EU in 2030 and a long-term outlook until 2050 will be provided.
Full-text available
Background When a predictor variable is measured in similar ways at the derivation and validation setting of a prognostic prediction model, yet both differ from the intended use of the model in practice (i.e., “predictor measurement heterogeneity”), performance of the model at implementation needs to be inferred. This study proposed an analysis to quantify the impact of anticipated predictor measurement heterogeneity. Methods A simulation study was conducted to assess the impact of predictor measurement heterogeneity across validation and implementation setting in time-to-event outcome data. The use of the quantitative prediction error analysis was illustrated using an example of predicting the 6-year risk of developing type 2 diabetes with heterogeneity in measurement of the predictor body mass index. Results In the simulation study, calibration-in-the-large of prediction models was poor and overall accuracy was reduced in all scenarios of predictor measurement heterogeneity. Model discrimination decreased with increasing random predictor measurement heterogeneity. Conclusions Heterogeneity of predictor measurements across settings of validation and implementation reduced predictive performance at implementation of prognostic models with a time-to-event outcome. When validating a prognostic model, the targeted clinical setting needs to be considered and analyses can be conducted to quantify the impact of anticipated predictor measurement heterogeneity on model performance at implementation.
Full-text available
The object of research is the methods of fast classification for solving text data classification problems. The need for this study is due to the rapid growth of textual data, both in digital and printed forms. Thus, there is a need to process such data using software, since human resources are not able to process such an amount of data in full. A large number of data classification approaches have been developed. The conducted research is based on the application of the following methods of classification of text data: Bloom filter, naive Bayesian classifier and neural networks to a set of text data in order to classify them into categories. Each method has both disadvantages and advantages. This paper will reflect the strengths and weaknesses of each method on a specific example. These algorithms were comparatively among themselves in terms of speed and efficiency, that is, the accuracy of determining the belonging of a text to a certain class of classification. The work of each method was considered on the same data sets with a change in the amount of training and test data, as well as with a change in the number of classification groups. The dataset used contains the following classes: world, business, sports, and science and technology. In real conditions of the classification of such data, the number of categories is much larger than that considered in the work, and may have subcategories in its composition. In the course of this study, each method was analyzed using different parameter values to obtain the best result. Analyzing the results obtained, the best results for the classification of text data were obtained using a neural network.
Full-text available
A query optimizer attempts to predict a performance metric based on the amount of time elapsed. Theoretically, this would necessitate the creation of a significant overhead on the core engine to provide the necessary query optimizing statistics. Machine learning is increasingly being used to improve query performance by incorporating regression models. To predict the response time for a query, most query performance approaches rely on DBMS optimizing statistics and the cost estimation of each operator in the query execution plan, which also focuses on resource utilization (CPU, I/O). Modeling query features is thus a critical step in developing a robust query performance prediction model. In this paper, we propose a new framework based on query feature modeling and ensemble learning to predict query performance and use this framework as a query performance predictor simulator to optimize the query features that influence query performance. In query feature modeling, we propose five dimensions used to model query features. The query features dimensions are syntax, hardware, software, data architecture, and historical performance logs. These features will be based on developing training datasets for the performance prediction model that employs the ensemble learning model. As a result, ensemble learning leverages the query performance prediction problem to deal with missing values. Handling overfitting via regularization. The section on experimental work will go over how to use the proposed framework in experimental work. The training dataset in this paper is made up of performance data logs from various real-world environments. The outcomes were compared to show the difference between the actual and expected performance of the proposed prediction model. Empirical work shows the effectiveness of the proposed approach compared to related work.
Full-text available
The objective of this study was to apply Clustering and K-Means' techniques to classify the departments of Peru according to their Human Development Index. In this article, the elbow method was used to determine the optimal number of clusters, applying the classification algorithms to group the departments of Peru according to their similarities, in addition to the Principal Component Analysis (PCA) technique for a better display of clusters. After applying the unsupervised algorithms, the results were more relevant in clusters 2 and 4 according to their HDI, made up of the departments of Arequipa, the Constitutional Province of Callao, Ica, Lima, Moquegua and Tacna, where the most notable is the life expectancy at birth, the population with full secondary education, the number of years of education, the average per capita income, and the state's density index. The results obtained by the K-Means algorithm show more cohesive results than the Clustering algorithm.
Full-text available
The increasing use of online hospitality platforms provides firsthand information about clients preferences, which are essential to improve hotel services and increase the quality of service perception. Customer reviews can be used to automatically extract the most relevant aspects of the quality of service for hospitality clientele. This paper proposes a framework for the assessment of the quality of service in the hospitality sector based on the exploitation of customer reviews through natural language processing and machine learning methods. The proposed framework automatically discovers the quality of service aspects relevant to hotel customers. Hotel reviews from Bogotá and Madrid are automatically scrapped from Semantic information is inferred through Latent Dirichlet Allocation and FastText, which allow representing text reviews as vectors. A dimensionality reduction technique is applied to visualise and interpret large amounts of customer reviews. Visualisations of the most important quality of service aspects are generated, allowing to qualitatively and quantitatively assess the quality of service. Results show that it is possible to automatically extract the main quality of service aspects perceived by customers from large customer review datasets. These findings could be used by hospitality managers to understand clients better and to improve the quality of service.
Full-text available
Background Although the potential of big data analytics for health care is well recognized, evidence is lacking on its effects on public health. Objective The aim of this study was to assess the impact of the use of big data analytics on people’s health based on the health indicators and core priorities in the World Health Organization (WHO) General Programme of Work 2019/2023 and the European Programme of Work (EPW), approved and adopted by its Member States, in addition to SARS-CoV-2–related studies. Furthermore, we sought to identify the most relevant challenges and opportunities of these tools with respect to people’s health. Methods Six databases (MEDLINE, Embase, Cochrane Database of Systematic Reviews via Cochrane Library, Web of Science, Scopus, and Epistemonikos) were searched from the inception date to September 21, 2020. Systematic reviews assessing the effects of big data analytics on health indicators were included. Two authors independently performed screening, selection, data extraction, and quality assessment using the AMSTAR-2 (A Measurement Tool to Assess Systematic Reviews 2) checklist. Results The literature search initially yielded 185 records, 35 of which met the inclusion criteria, involving more than 5,000,000 patients. Most of the included studies used patient data collected from electronic health records, hospital information systems, private patient databases, and imaging datasets, and involved the use of big data analytics for noncommunicable diseases. “Probability of dying from any of cardiovascular, cancer, diabetes or chronic renal disease” and “suicide mortality rate” were the most commonly assessed health indicators and core priorities within the WHO General Programme of Work 2019/2023 and the EPW 2020/2025. Big data analytics have shown moderate to high accuracy for the diagnosis and prediction of complications of diabetes mellitus as well as for the diagnosis and classification of mental disorders; prediction of suicide attempts and behaviors; and the diagnosis, treatment, and prediction of important clinical outcomes of several chronic diseases. Confidence in the results was rated as “critically low” for 25 reviews, as “low” for 7 reviews, and as “moderate” for 3 reviews. The most frequently identified challenges were establishment of a well-designed and structured data source, and a secure, transparent, and standardized database for patient data. Conclusions Although the overall quality of included studies was limited, big data analytics has shown moderate to high accuracy for the diagnosis of certain diseases, improvement in managing chronic diseases, and support for prompt and real-time analyses of large sets of varied input data to diagnose and predict disease outcomes. Trial Registration International Prospective Register of Systematic Reviews (PROSPERO) CRD42020214048;
Purpose The purpose of this work is to survey the body of research revolving around big data (BD) and analytics in hospitality and tourism, by detecting macro topical areas, research streams and gaps and to develop an agenda for future research. Design/methodology/approach This research is based on a systematic literature review of academic papers indexed in the Scopus and Web of Science databases published up to 31 December 2020. The outputs were analyzed using bibliometric techniques, network analysis and topic modeling. Findings The number of scientific outputs in research with hospitality and tourism settings has been expanding over the period 2015–2020, with a substantial stability of the areas examined. The vast majority are published in academic journals where the main reference area is neither hospitality nor tourism. The body of research is rather fragmented and studies on relevant aspects, such as BD analytics capabilities, are virtually missing. Most of the outputs are empirical. Moreover, many of the articles collected relatively small quantities of records and, regardless of the time period considered, only a handful of articles mix a number of different techniques. Originality/value This work sheds new light on the emergence of a body of research at the intersection of hospitality and tourism management and data science. It enriches and complements extant literature reviews on BD and analytics, combining these two interconnected topics.
Objective To test whether 1) quantitative analysis of EEG reactivity (EEG-R) using machine learning (ML) is superior to visual analysis, and 2) combining quantitative analyses of EEG-R and EEG background pattern increases prognostic value for prediction of poor outcome after cardiac arrest (CA). Methods Several types of ML models were trained with twelve quantitative features derived from EEG-R and EEG background data of 134 adult CA patients. Poor outcome was a Cerebral Performance Category score of 3-5 within 6 months. Results The Random Forest (RF) trained on EEG-R showed the highest AUC of 83% (95-CI 80-86) of tested ML classifiers, predicting poor outcome with 46% sensitivity (95%-CI 40-51) and 89% specificity (95%-CI 86-92). Visual analysis of EEG-R had 80% sensitivity and 65% specificity. The RF was also the best classifier for EEG background (AUC 85%, 95%-CI 83-88) at 24h after CA, with 62% sensitivity (95%-CI 57-67) and 84% specificity (95%-CI 79-88). Combining EEG-R and EEG background RF classifiers reduced the number of false positives. Conclusions Quantitative EEG-R using ML predicts poor outcome with higher specificity, but lower sensitivity compared to visual analysis of EEG-R, and is of some additional value to ML on EEG background data. Significance Quantitative EEG-R using ML is a promising alternative to visual analysis and of some added value to ML on EEG background data.