ArticlePDF Available

Abstract and Figures

Virtualization as a key IT technology has developed to a predominant model in data centers in recent years. The flexibility regarding scaling-out and migration of virtual machines for seamless maintenance has enabled a new level of continuous operation and changed service provisioning significantly. Meanwhile, services from domains striving for highest possible availability – e.g. from the telecommunications domain – are adopting this approach as well and are investing significant efforts into the development of Network Function Virtualization (NFV). However, the availability requirements for such infrastructures are much higher than typical for IT services built upon standard software with off-the-shelf hardware. They require sophisticated methods and mechanisms for fast detection and recovery of failures. This paper presents a set of methods and an implemented prototype for anomaly detection in cloud-based infrastructures with specific focus on the deployment of virtualized network functions. The framework is built upon OpenStack, which is the current de-facto standard of open-source cloud software and aims at increasing the availability and fault tolerance level by providing an extensive monitoring and analysis pipeline able to detect failures or degraded performance in real-time. The indicators for anomalies are created using supervised and non-supervised classification methods and preliminary experimental measurements showed a high percentage of correctly identified anomaly situations. After a successful failure detection, a set of pre-defined countermeasures is activated in order to mask or repair outages or situations with degraded performance.
Content may be subject to copyright.
Procedia Computer Science 94 ( 2016 ) 491 496
Available online at www.sciencedirect.com
1877-0509 © 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Conference Program Chairs
doi: 10.1016/j.procs.2016.08.076
ScienceDirect
International Workshop on Applications of Software-Defined Networking in Cloud Computing
(SDNCC 2016)
A system architecture for real-time anomaly detection in large-scale
NFV systems
Anton Gulenkoa, Marcel Wallschlägera, Florian Schmidta, Odej Kaoa, Feng Liub
*
aTechnische Universität Berlin (TU Berlin), Complex and Distributed IT Systems (CIT), 10587 Berlin, Germany
bHuawei European Research Center, Huawei Technologies Co., Ltd., 80992 Munich, Germany
Abstract
Virtualization as a key IT technology has developed to a predominant model in data centers in recent years. The flexibility
regarding scaling-out and migration of virtual machines for seamless maintenance has enabled a new level of continuous
operation and changed service provisioning significantly. Meanwhile, services from domains striving for highest possible
availability e.g. from the telecommunications domain are adopting this approach as well and are investing significant efforts
into the development of Network Function Virtualization (NFV). However, the availability requirements for such infrastructures
are much higher than typical for IT services built upon standard software with off-the-shelf hardware. They require sophisticated
methods and mechanisms for fast detection and recovery of failures. This paper presents a set of methods and an implemented
prototype for anomaly detection in cloud-based infrastructures with specific focus on the deployment of virtualized network
functions. The framework is built upon OpenStack, which is the current de-facto standard of open-source cloud software and
aims at increasing the availability and fault tolerance level by providing an extensive monitoring and analysis pipeline able to
detect failures or degraded performance in real-time. The indicators for anomalies are created using supervised and non-
supervised classification methods and preliminary experimental measurements showed a high percentage of correctly identified
anomaly situations. After a successful failure detection, a set of pre-defined countermeasures is activated in order to mask or
repair outages or situations with degraded performance.
© 2016 The Authors. Published by Elsevier B.V.
Peer-review under responsibility of the Conference Program Chairs.
Keywords: Anomaly detection; Cloud; OpenStack; Fault tolerance; NFV
* Anton Gulenko. Tel.: +49 (30) 314-25286; fax: +49 (30) 314-21114.
E-mail address: anton.gulenko@tu-berlin.de
© 2016 Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license
(http://creativecommons.org/licenses/by-nc-nd/4.0/).
Peer-review under responsibility of the Conference Program Chairs
492 Anton Gulenko et al. / Procedia Computer Science 94 ( 2016 ) 491 – 496
1. Introduction
The virtualization as a key IT technology in the last years reached a high level of maturity leading to the
implementation of IT services on virtualized platforms as a predominant model in data centers. The flexibility
regarding scaling-out and migration of virtual machines for seamless maintenance enabled a new level of continuous
operation and changed the service provisioning significantly. Meanwhile, not only standard IT services are deployed
on virtualized infrastructures, but also services from domains striving for highest availability possible such as the
telecommunication branch are adopting this approach as well and are investing significant efforts into the
development of standard products such as OpenStack1 to fit the requirements of modern Network Function
Virtualization (NFV) scenarios.
These efforts elevate the virtualization infrastructure to the next level of complexity. Critical services e.g. in
telecommunication scenarios require an availability of 99.9999% as a standard, which is expected from virtualized
solutions as well. The availability requirements for the infrastructures are even increasing, as telecommunication
providers widely deploy NFV setups in the context of e.g. OpenStack and thus run essential parts of the critical
infrastructure on typically off-the-shelf components. Such platforms lack availability features engineered over
decades by the telecommunication community and are still far away from the desired degree of availability. This
paper describes an approach and an implemented framework to enable reliable deployment and operation on
unreliable components. The basic approach is well-known from other domains, e.g. implementing secure channels
over unsecure public networks or building redundant storage arrays from standard hard disks. Analogously, we aim
at developing and deploying methods for extensive and in-depth monitoring of the vital system data for failure
detection and activation of fault tolerance mechanisms utilizing redundant components.
The monitoring system collects and analyses on-the-fly data from several hardware components and the core
operating and virtualization processes in order to detect anomalies that can lead to overall performance degradation,
violating the promised quality of service, and eventually lead to crashes of parts of the system. Detecting and
handling anomalies is a non-trivial issue which needs to span over a multitude of layers and components composing
an NFV system. Due to their large scale such systems must be designed to automatically detect and handle
anomalies.
The developed system framework facilitates real-time anomaly detection in large-scale NFV deployments.
Anomalies are detected by performing deep cross-layer data collection and mining the collected data through a
variety of data analysis techniques. The results are used to select and execute appropriate restoration routines. The
data collection implements a continuous feedback loop notifying the restoration engine about the success of the
executed routines. The developed methods and tools are included in a productive OpenStack installation, but can be
generalized for different systems for Cloud and resource management.
The remaining of the paper is organized as follows. The next section lists related work on fault tolerance for
virtualized environments, in particular related to continuous operation of NVF-based systems. Section 3 describes
the global architecture and the functionality of the deployed components. Subsequently, in Section 4 we describe the
system prototype and give preliminary performance evaluation results. Finally, Section 5 concludes with an outline
of future extensions to the presented system.
2. Related work
The importance of anomaly detection in NVF-based systems running in a Cloud management system is reflected
in a wide range of research publications. Gaikwad et al. for example developed an integrated workflow for finding
anomalies in scientific workflows and execution of applications in cloud-based infrastructures. The authors used an
auto-regression approach based on statistical methods for online monitoring in order to find anomalies in the
collected data1. F²PM is a framework for a machine learning system that predicts the Remaining Time to Failure
(RTTF) of software services3. The F²PM system collects time series of system-level features such as CPU, memory
and swap, using the Lasso regularization as a feature selection method. The authors compared failure prediction
approaches on all parameters using an e-commerce application use case.
Alonso et al. describe the Lasso Regularization on e-commerce environment as well4 and categorize machine-
learning classifiers detecting dynamic and non-deterministic software anomalies. The applied method monitors the
493
Anton Gulenko et al. / Procedia Computer Science 94 ( 2016 ) 491 – 496
system and reduces the recorded features by around 60% with Lasso Regularization using Random Forest for
classification. The achieved validation errors are less than 1%. In addition, alternative methods based on Decision
Trees, LDA/QDA, Naive Bayes, Supported Vector Machines and K-nearest neighbors were compared as well.
As cloud infrastructure and NFV services are distributed and further producing a large amount of logs, such logs
can be used for root-cause analysis of error situations in the running services5. The tool LOGAN inspects the
divergence of current logs from a reference model and highlights logs likely to contain hints to the root cause. The
paper presents the designed reference model for problem diagnosis which serves as a basis for detecting crucial log
messages for the running system. Furthermore, LOGAN is able to analyze a large volume of logs and helps
operators to save time during problem diagnosis by automatically highlighting high value logs.
3. Architecture for real-time anomaly detection
The suggested architecture is an extension of a typical cloud infrastructure utilized by telecommunication
providers for NFV deployments. The three pillars cloud management, cloud resources and anomaly detection
pipeline are described in the following.
A continuous operation of telecommunication services can only be guaranteed if the underlying cloud platform
itself is highly available. We assume OpenStack as an underlying infrastructure, as it currently viewed as the de-
facto standard for open-source cloud software and is deployed in many data centers, so we were able to monitor and
process real-word systems and information. OpenStack consists of a multitude of components that work together in
order to deliver the desired Cloud management services. A full description of all services and their interfaces would
go beyond the scope of this paper and the general architecture of OpenStack is well known1. To achieve high
availability, every OpenStack controller service must run in a replicated fashion. At the top of the stack we place a
pair of load balancers. A virtual IP address serves as an entry point for all services of the stack. The load balancers
are running in an active-passive replication mode and upon failure of the active node the passive node will
automatically take its place.
The replicated OpenStack installation consists of multiple instances of the controller node and some additional
instances of the network node each running on separate physical machines. For load balancing, the OpenStack
Figure 1: OpenStack service distribution over multiple physical machines
494 Anton Gulenko et al. / Procedia Computer Science 94 ( 2016 ) 491 – 496
controller services rely on the native clustering capability of the deployed message queue server and the stateless
property of the OpenStack controller services. The relational database used by OpenStack must be replicated as well
in order to avoid data loss. The Glance image service requires shared storage, which can be provided by a distributed
file system like Ceph
. Figur presents the OpenStack services distributed on multiple physical machines and an
example NFV infrastructure deployed on top of OpenStack.
We extended the replicated OpenStack installation through additional components for monitoring the vital service
parameters, for anomaly detection, and with a self-stabilization framework that contains a selection of
countermeasures to mask and/or repair failures. The current setup including the cloud management component, the
execution of VNF services, and the self-stabilization pipeline is depicted in Figure . In the following, the individual
components of the self-stabilization pipeline are described in more detail.
The first step in this process is to collect monitoring data in all hosts and other components in the system, so the
framework can identify, isolate, and evaluate indicators for the detection of anomalies. The basic data to collect
represents the usage of different resources in each host. This includes CPU and RAM usage, I/O operations of
different partitions and mount points, and network I/O metrics of different network interfaces and protocols. Such
resource usage metrics are collected both on the physical nodes and inside of every virtual machine. In addition, the
virtualization stack is queried for information. For example the virtualization library Libvirt offers an APIs to
retrieve resource usage information of VMs. Open vSwitch6 is a widespread software switch used for implementing
virtual networks and offers remote APIs to retrieve network I/O metrics about its virtual bridges and ports. This list
Ъ
http://ceph.com/
Figure 2: System architecture for anomaly detection in an NFV system
495
Anton Gulenko et al. / Procedia Computer Science 94 ( 2016 ) 491 – 496
has to be extended to all subsystems and components involved in the NFV stack. An important non-functional
requirement for the data collection, as well as for the entire anomaly detection pipeline, is to consume only a small
portion of the systems resources. In particular, the operation of the virtualized services should not be impacted by the
anomaly detection.
The monitoring data is collected in a data sink which is responsible for converting the data from all data sources
into a common internal format. Information about the system architecture is used to associate incoming data streams
with the correct system components.
The data analysis step receives the uniform data and condenses it into higher-level information. A variety of data
analysis or data mining techniques are used here. The simplest approach is to use a complex event processor and
manually define thresholds for the metrics that are known to indicate anomalies. However, this requires detailed
expert knowledge to define such thresholds and even experienced administrators can miss certain patterns or
dependencies hidden in the data. The major challenge for the failure prediction and failure recovery is related to the
definition of sets of characteristic values that indicate the presence of an anomaly or failure. Such feature vectors can
be engineered top-down based on anticipated behavior of the system in case of failures. However, such a systematic
approach is less visible for large, productive environments, so usually a bottom-up approach inserting failures and
measuring the system parameters is applied. The collection of such feature vectors serves as a starting point for a
similarity search in order to detect and then recover failures in a running server. Thus, the main challenge of the data
analysis step is to develop and deploy corresponding retrieval mechanisms, which can separate the noise in a running
infrastructure from the current processing state and detect anomalies as precisely as possible.
Therefore, we developed and implemented advanced techniques based on online unsupervised clustering and
classification algorithms capable of handling continuous data streams. Multiple analysis steps are be chained
together and executed on different hosts to achieve scalability. For example, every virtual and physical host performs
an analysis of all locally collected data and forwards these intermediate results to a higher-level analysis which
combines all information for a group of hosts and finally forwards its own results to a global analysis step which
performs root cause analysis using the distilled information about the entire NFV system. All data analysis steps
have access to a catalog of known anomalies to improve the analysis results.
The output of the data analysis are system state events sent to the self-stabilization engine. The purpose of this
engine is to execute recovery actions in the cloud management platform or directly within the NFV resources. Pre-
defined actions from the recovery action catalog are used to guide the system back to a “healthy” state. The recovery
action catalog can include recipes for an orchestration tool to run actions on the service layer or the underlying cloud
system. Possible actions include migrating a VM to another hypervisor or changing the configuration of an
OpenStack service such as a DHCP agent in Neutron. Recovery actions can be combined and executed on different
layers simultaneously. Through the continuous data monitoring and analysis, the decision engine will receive
continuous feedback on the success of the implemented recovery routines. Based on that it can execute additional
recovery or mitigation routines or contact an administrator as a last resort.
4. Prototype
Several parts of the architecture are already implemented on a dedicated testbed of 20 physical machines. The
basis is an installation of OpenStack Liberty with a three-fold replicated controller node and a two-fold replicated
network node. A pair of replicated load balancers provides a reliable access to all OpenStack services. The
remaining nodes are used as compute nodes hosting the virtual machines. To simulate the workload of an NFV
installation, the open source IMS core implementation Project Clearwater
is executed on top of OpenStack.
OpenStacks own orchestration module Heat is used to deploy the virtual infrastructure, and the lightweight
configuration management tool Ansible
§
configures all virtual machines to run their respective VNFs.
Ы
http://www.projectclearwater.org/
c
https://www.ansible.com/
496 Anton Gulenko et al. / Procedia Computer Science 94 ( 2016 ) 491 – 496
The data collection is implemented in the Go programming language and collects between 130 and 180 metrics
on a typical Linux machine. The data is mainly obtained by parsing the /proc file system in short time intervals. On
hypervisor nodes, additional 22 metrics are collected for every hosted virtual machine by querying the Libvirt API
and 7 metrics for every virtual network interface inside Open vSwitch. This adds up to about 500 metrics on a small
sized compute node. When collecting the data in 300ms intervals and sending it over the network in a dense binary
format the CPU usage does not exceed 3% of a single 3.3 GHz processor. Since many of the metrics do not change
frequently, the sampling rate could be selectively lowered to further reduce the resource overhead.
Preliminary evaluation of the collected data indicates a high degree of reliable recognition of pre-defined failure
scenarios, exceeding 95%. These results need to be further investigated in terms of the execution environment and
the impact of production “noise” concurrently running processes, user interaction, and varying load in order to
make a reliable statement on the efficiency and the precision of the framework. This is part of the future work
described in the following.
4. Conclusion and future work
This paper presents a set of methods, an implemented prototype, and preliminary evaluation for anomaly
detection in cloud-based infrastructure with specific focus on deployment of virtualized network functions (VNF).
The framework is built upon OpenStack, which is the current de-facto standard in open-source data center cloud
software. Our system architecture targets the NFV use case as the demands for fault tolerance are especially high in
the case of telecommunication providers, which are used to execute services on dedicated and specialized hardware.
However, the flexibility and the cost effectiveness of virtualized solutions motivate service providers to deploy
virtualized solutions and simultaneously research software-based methods for increasing their fault tolerance. The
presented research aims at solving this problem by providing an extensive monitoring solution that collects a
significant set of data and analyzes it with supervised and non-supervised machine learning techniques. The
computed indicators are used to activate pre-defined countermeasures to mask or even repair outages or situations
with degraded performance.
In the next steps the testing environment will be extended by components for injecting general fault scenarios like
overload or memory leaks, together with fault scenarios specific to NVF environments. Based on the collected data,
additional unsupervised machine learning algorithms will be implemented inside the data analysis framework.
Further, a visualization of the current system state will help to evaluate the success of the anomaly detection and to
find correlations between different system layers. New cloud-related optimization technologies introduce layer
violations, which further exacerbate anomaly detection in NFV infrastructures. Such technologies, like the DPDK
**
will be taken into consideration and further set our solution apart from conventional monitoring and fault detection
systems.
References
1. OpenStack [Online]. Available: https://www.openstack.org/.
2. Prathamesh Gaikwad: "Anomaly Detection for Scientific Workflow Applications on Networked Clouds" in 2016 International Conference on
High Performance Computing & Simulation (HPCS), Insbruck, Austria.
3. Alessandro Pellegrini, Pierangelo Di Sanzo, Dimiter R. Avresky: "A Machine Learning-Based Framework for Building Application Failure
Prediction Models," in 2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW), Hyderabad, India.
4. Javier Alonso, Lluis Belanche, Dimiter R. Avresky: "Predicting Software Anomalies Using Machine Learning Techniques," in 2011 10th IEEE
International Symposium on Network Computing and Applications (NCA), pp. 163-170.
5. B. C. Tak, S. Tao, L. Yang, C. Zhu and Y. Ruan: "LOGAN: Problem Diagnosis in the Cloud Using Log-based Reference Models," in 2016
International Conference on Cloud Engineering (IC2E ), Berlin, Germany.
6. Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, and Pravin
Shelar: "The Design and Implementation of Open vSwitch," in 12th USENIX Symposium on Networked Systems Design and Implementation,
Oakland, CA, USENIX Association, 2015, pp. 117--130
**
http://www.dpdk.org/
... In this context, the device designs rely on various factors, including its application and environmental [54,206]. Inaccuracies in the manufacturing process may have a negative effect not only on the performance and efficiency of different physical devices but also on the behavior and functions of such devices [61]. For example, an observed behavior is drift in a system, which is caused by the defect in the timing circuit of the hardware introduced during device manufacturing. ...
... Thus, in such a setting, it is crucial to comprehensively understand all relevant environmental factors to evaluate any device or system effectively. Therefore, whenever it comes to identifying and classifying CPS devices, it is done by detecting abnormalities in the readings produced by the sensors and actuators [4,5,61,106]. ...
Article
The continued growth of the Cyber-Physical System (CPS) and Internet of Things (IoT) technologies raises device security and monitoring concerns. For device identification, authentication, conditioning, and security, device fingerprints (DFP) are increasingly used. However, finding the correct DFP features and sources to establish a unique and stable fingerprint is challenging. We present a state-of-the-art survey of device fingerprinting techniques for CPS device applications. We investigate the numerous DFP features, their origins, characteristics, and applications. Additionally, we discuss the DFP characteristics and their sources in detail, taking into account the physical contexts of various entities (i.e., machines, sensors, networks, and computational devices), as well as their software and applications for the CPS. We believe that this report will provide researchers and developers with insights into DFP and its applications, sources, aggregation methods and factors affecting its use in CPS domains.
... Adapting to evolving workload patterns and infrastructure dynamics is essential for maintaining optimal performance and cost efficiency. Generative AI can help in detecting shifts in workload characteristics, such as changes in request patterns or resource requirements, and dynamically adjust the optimization strategies to accommodate these changes [51]. Continuous optimization also involves the ongoing monitoring and evaluation of the AI models themselves. ...
Article
Full-text available
The integration of Generative AI with Kubernetes and Helm has emerged as a promising approach to optimize container orchestration and management practices. This article explores the potential applications and benefits of leveraging AI algorithms in Kubernetes environments, focusing on key areas such as dynamic resource allocation, auto-scaling, optimized deployment strategies, predictive maintenance, cost optimization, and continuous optimization. By analysing the synergies between Generative AI and Kubernetes, this study highlights the opportunities for improved resource utilization, enhanced application performance, reduced infrastructure costs, and increased operational efficiency. The article also discusses the challenges and considerations associated with implementing AI in Kubernetes, including scalability, integration complexity, data quality, interpretability, and security. Through a comprehensive review of existing research and industry practices, this work emphasizes the importance of collaboration among researchers, practitioners, and the open-source community to drive innovation and unlock the full potential of AI-driven optimizations in Kubernetes. The findings suggest that the combination of Generative AI, Kubernetes, and Helm has the potential to revolutionize container orchestration and management, enabling organizations to gain a competitive edge in the rapidly evolving landscape of cloud-native computing.
... The self-healing technique makes it possible to recover the environment with little or no human intervention. In this category, there are the DARGOS approaches (Javier et al., 2013) and a prototype for the automatic treatment of anomalies (Gulenko et al., 2016), presented below. ...
Article
Full-text available
Considering that there are several distinct cloud computing environments and several suggested approaches for the treatment of fault tolerance for such environments, the objective of the study presented here is a systematization of fault tolerance proposals that results in a survey and the generation of a guided consultation environment for reading the relevant techniques for each case. With the systematization of proposed solutions, it is intended to obtain a document that administrators of cloud computing systems can use. This work points out which techniques apply to which problems, including the advantages and disadvantages of each technique, and facilitates the support process for these administrators in handling the failures. Finally, with the information obtained, a website will be generated to store some of this information. This virtual environment is a prototype of a recommendation environment for cloud fault tolerance. At first, the recommendation will occur through guided search so that administrators of cloud computing systems can have better conditions to handle failures in their environments
... Various effective techniques and methods based on network security have been proposed by the researchers that also include anomaly detection [12]. Anomaly detection identifies and addresses performance and security-related issues associated with anomalies by analyzing patterns, behaviors, and observations that deviate significantly from the norm [13]. ...
Article
Full-text available
Network function virtualization (NFV) is a rapidly growing technology that enables the virtualization of traditional network hardware components, offering benefits such as cost reduction, increased flexibility, and efficient resource utilization. Moreover, NFV plays a crucial role in sensor and IoT networks by ensuring optimal resource usage and effective network management. However, adopting NFV in these networks also brings security challenges that must promptly and effectively address. This survey paper focuses on exploring the security challenges associated with NFV. It proposes the utilization of anomaly detection techniques as a means to mitigate the potential risks of cyber attacks. The research evaluates the strengths and weaknesses of various machine learning-based algorithms for detecting network-based anomalies in NFV networks. By providing insights into the most efficient algorithm for timely and effective anomaly detection in NFV networks, this study aims to assist network administrators and security professionals in enhancing the security of NFV deployments, thus safeguarding the integrity and performance of sensors and IoT systems.
... Very few studies have exclusively tackled the security of cloud workflow orchestration, management, and enforcement [13]. Such studies typically focus on anomaly detection and prediction using various techniques such as HTM [14], statistical-clustering [8,15,16], regression [17,18], and unsupervised machine learning (ML) [19]. They neither clearly define the security attributes nor specify the cloud workflow characteristics, which can be described as resource-aware, time-series, and highly dynamic. ...
Article
Full-text available
Supporting security and data privacy in cloud workflows has attracted significant research attention. For example, private patients’ data managed by a workflow deployed on the cloud need to be protected, and communication of such data across multiple stakeholders should also be secured. In general, security threats in cloud environments have been studied extensively. Such threats include data breaches, data loss, denial of service, service rejection, and malicious insiders generated from issues such as multi-tenancy, loss of control over data and trust. Supporting the security of a cloud workflow deployed and executed over a dynamic environment, across different platforms, involving different stakeholders, and dynamic data is a difficult task and is the sole responsibility of cloud providers. Therefore, in this paper, we propose an architecture and a formal model for security enforcement in cloud workflow orchestration. The proposed architecture emphasizes monitoring cloud resources, workflow tasks, and the data to detect and predict anomalies in cloud workflow orchestration using a multi-modal approach that combines deep learning, one class classification, and clustering. It also features an adaptation scheme to cope with anomalies and mitigate their effect on the workflow cloud performance. Our prediction model captures unsupervised static and dynamic features as well as reduces the data dimensionality, which leads to better characterization of various cloud workflow tasks, and thus provides better prediction of potential attacks. We conduct a set of experiments to evaluate the proposed anomaly detection, prediction, and adaptation schemes using a real COVID-19 dataset of patient health records. The results of the training and prediction experiments show high anomaly prediction accuracy in terms of precision, recall, and F1 scores. Other experimental results maintained a high execution performance of the cloud workflow after applying adaptation strategy to respond to some detected anomalies. The experiments demonstrate how the proposed architecture prevents unnecessary wastage of resources due to anomaly detection and prediction.
Article
Full-text available
In the rapidly evolving landscape of cloud computing, the integration of Artificial Intelligence (AI) has become essential for enhancing data-driven decision-making and improving operational efficiency. However, ensuring data reliability in AI-powered cloud architectures remains a significant challenge, as the performance of AI models heavily relies on the integrity, accuracy, and availability of the underlying data. This research aims to develop an innovative framework designed to enhance data reliability within AI-driven cloud environments. The proposed framework incorporates advanced techniques such as real-time data validation, error detection, and fault tolerance mechanisms to address common issues like data inconsistency, loss, and corruption. By leveraging both AI models and cloud infrastructure best practices, the framework seeks to provide a robust solution for maintaining data integrity and ensuring uninterrupted AI performance. The results of this study demonstrate the framework’s effectiveness in improving data reliability, reducing error rates, and enhancing the overall efficiency of AI systems in cloud environments. This work offers valuable insights for organizations seeking to adopt AI technologies while maintaining high standards of data reliability, with implications for both cloud service providers and AI developers. Future research directions focus on refining the framework for scalability and exploring its application in diverse industries.
Article
Ensuring data reliability and mitigating failures are critical challenges in large-scale cloud infrastructures, given their complexity, dynamic nature, and the increasing demand for real-time data processing. Traditional approaches often struggle with scalability, adaptability, and predictive accuracy, necessitating innovative solutions. Deep learning, with its ability to model complex patterns and predict outcomes, has emerged as a transformative tool for addressing these challenges. This article explores the application of deep learning techniques to enhance data reliability and failure mitigation in large-scale cloud systems. It examines methods such as anomaly detection using auto-encoders and convolutional neural networks (CNNs), predictive maintenance through recurrent neural networks (RNNs) and long short-term memory (LSTM) models, and fault localization enabled by deep reinforcement learning. Additionally, intelligent resource allocation, adaptive scaling, and data recovery processes are highlighted as critical areas where deep learning delivers significant advancements. Through real-world case studies and experimental evaluations, the research demonstrates the superiority of deep learning approaches over traditional methods in terms of accuracy, scalability, and efficiency. While the findings underscore deep learning's potential, the discussion also addresses limitations, ethical considerations, and integration challenges. This study not only establishes a framework for leveraging deep learning in cloud reliability and resilience but also outlines future directions for research, emphasizing model interpret-ability, federated learning, and sustainable AI practices.
Article
Full-text available
Telecommunication networks are becoming increasingly dynamic and complex due to the massive amounts of data they process. As a result, detecting abnormal events within these networks is essential for maintaining security and ensuring seamless operation. Traditional methods of anomaly detection, which rely on rule-based systems, are no longer effective in today’s fast-evolving telecom landscape. Thus, making AI useful in addressing these shortcomings. This review critically examines the role of Artificial Intelligence (AI), particularly deep learning, in modern anomaly detection systems for telecom networks. It explores the evolution from early strategies to current AI-driven approaches, discussing the challenges, the implementation of machine learning algorithms, and practical case studies. Additionally, emerging AI technologies such as Generative Adversarial Networks (GANs) and Reinforcement Learning (RL) are highlighted for their potential to enhance anomaly detection. This review provides AI’s transformative impact on telecom anomaly detection, addressing challenges while leveraging 5G/6G, edge computing, and the Internet of Things (IoT). It recommends hybrid models, advanced data preprocessing, and self-adaptive systems to enhance robustness and reliability, enabling telecom operators to proactively manage anomalies and optimize performance in a data driven environment.
Article
The widespread application of ultra-dense and multivariate Internet of Things (IoT) benefits from Network Function Virtualization (NFV) that provides flexible frameworks and effective management. NFV leverages the virtualization technologies to integrate the existing network functions of devices into standard servers, storages, and switches. Then, the network functions are achieved in software form to displace the private, dedicated and closed network devices. However, NFV also brings instability and challenges to the network management where the network dynamics, lack of visibility, and high frequency and abundant types of faults will increase the difficulty. Therefore, diagnosing the faults embedded in the generic NFV framework is crucial for the effective adoption of NFV to the IoT environment and thus ensuring the user services. This paper summarizes the differences and connections of fault diagnosis between the NFV framework and traditional networks, and introduces the challenges faced by NFV. Moreover, we provide a comprehensive survey of the state-of-the-art fault detection methods for the NFV framework. After an in-depth discussion of the fault propagation characteristics, we further present a detailed taxonomy of the fault localization approaches. Finally, we highlight the future research directions to provide ample space for improvement in applying NFV to the IoT environment.
Conference Paper
Full-text available
In this paper, we present the Framework for building Failure Prediction Models (F 2 PM), a Machine Learning-based Framework to build models for predicting the Remaining Time to Failure (RTTF) of applications in the presence of software anomalies. F 2 PM uses measurements of a number of system features in order to create a knowledge base, which is then used to build prediction models. F 2 PM is application-independent, i.e. it solely exploits measurements of system-level features. Thus, it can be used in differentiated contexts, without the need for any manual modification or intervention to the running applications. To generate optimized models, F 2 PM can perform a feature selection to identify, among all the measured system features, which have a major impact in the prediction of the RTTF. This allows to produce different models, which use different set of input features. Generated models can be compared by the user by using a set of metrics produced by F 2 PM, which are related to the model prediction accuracy, as well as to the model building time. We also present experimental results of a successful application of F 2 PM, using the standard TPC-W e-commerce benchmark.
Conference Paper
Full-text available
In this paper, we present a detailed evaluation of a set of well-known Machine Learning classifiers in front of dynamic and non-deterministic software anomalies. The system state prediction is based on monitoring system metrics. This allows software proactive rejuvenation to be triggered automatically. Random Forest approach achieves validation errors less than 1% in comparison to the well-known ML algorithms under a valuation. In order to reduce automatically the number of monitored parameters, needed to predict software anomalies, we analyze Lasso Regularization technique jointly with the Machine Learning classifiers to evaluate how the prediction accuracy could be guaranteed within an acceptable threshold. This allows to reduce drastically (around 60% in the best case) the number of monitoring parameters. The framework, based on ML and Lasso regularization techniques, has been validated using an ecommerce environment with Apache Tomcat server, and MySql database server.
Conference Paper
Recent advances in cloud technologies and on-demand network circuits have created an unprecedented opportunity to enable complex scientific workflow applications to run on dynamic, networked cloud infrastructure. However, it is extremely challenging to reliably execute these workflows on distributed clouds because performance anomalies and faults are frequent in these systems. Hence, accurate, automatic, proactive, online detection of anomalies is extremely important to pinpoint the time and source of the observed anomaly and to guide the adaptation of application and infrastructure. In this work, we present an anomaly detection algorithm that uses auto-regression (AR) based statistical methods on online monitoring time-series data to detect performance anomalies when scientific workflows and applications execute on networked cloud systems. We present a thorough evaluation of our auto-regression based anomaly detection approach by injecting artificial, competing loads into the system. Results show that our AR based detection algorithm can accurately detect performance anomalies for a variety of exemplar scientific workflows and applications.
Conference Paper
Problem diagnosis is one crucial aspect in the cloud operation that is becoming increasingly challenging. On the one hand, the volume of logs generated in today's cloud is overwhelmingly large. On the other hand, cloud architecture becomes more distributed and complex, which makes it more difficult to troubleshoot failures. In order to address these challenges, we have developed a tool, called LOGAN, that enables operators to quickly identify the log entries that potentially lead to the root cause of a problem. It constructs behavioral reference models from logs that represent the normal patterns. When problem occurs, our tool enables operators to inspect the divergence of current logs from the reference model and highlight logs likely to contain the hints to the root cause. To support these capabilities we have designed and developed several mechanisms. First, we developed log correlation algorithms using various IDs embedded in logs to help identify and isolate log entries that belong to the failed request. Second, we provide efficient log comparison to help understand the differences between different executions. Finally we designed mechanisms to highlight critical log entries that are likely to contain information pertaining to the root cause of the problem. We have implemented the proposed approach in a popular cloud management system, OpenStack, and through case studies, we demonstrate this tool can help operators perform problem diagnosis quickly and effectively.
The Design and Implementation of Open vSwitch
  • Ben Pfaff
  • Justin Pettit
  • Teemu Koponen
  • Ethan Jackson
  • Andy Zhou
  • Jarno Rajahalme
  • Jesse Gross
  • Alex Wang
  • Joe Stringer
  • Pravin Shelar
Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson, Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, Joe Stringer, and Pravin Shelar: "The Design and Implementation of Open vSwitch," in 12th USENIX Symposium on Networked Systems Design and Implementation, Oakland, CA, USENIX Association, 2015, pp. 117--130
Available: https://www.openstack.org
  • Openstack
  • Online
OpenStack [Online]. Available: https://www.openstack.org/.