Kishor S TrivediDuke University | DU · Department of Electrical and Computer Engineering (ECE)
Kishor S Trivedi
PhD, UIUC; MS, UIUC; B. Tech (EE), IIT-Mumbai,
An interview on reliability (both hardware and software) that I did: https://www.youtube.com/watch?v=cz_iw5PgyiM
About
883
Publications
211,360
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
42,533
Citations
Introduction
Current research:
1) Software failures and their mitigation focusing on environment-dependent bugs
2) Parametric uncertainty propagation thru stochastic models
3) I conduct online weekly zoom session on Upanishads
4) My h-index is 113.
5) My Academic tree:
https://www.genealogy.math.ndsu.nodak.edu/id.php?id=76227
6) Latest News Item:
https://pratt.duke.edu/about/news/trivedi-ieee--lifetime-award
Additional affiliations
April 2018 - August 2021
July 2014 - August 2014
Education
August 1970 - July 1974
July 1963 - June 1968
Publications
Publications (883)
Smart grids are fostering a paradigm shift in the realm of power distribution systems. Whereas traditionally different components of the power distribution system have been provided and analyzed by different teams through different lenses, smart grids require a unified and holistic approach that takes into consideration the interplay of communicati...
With software systems becoming increasingly large and complex, many difficulties in coping with software bugs arise for developers. Despite good development practices, thorough testing, and proper maintenance policies, a non-negligible number of bugs remain in the released software. Understanding the type of residual bugs is fundamental for adoptin...
The growing complexity of mission-critical space mission software makes it prone to suffer failures during operations. The success of space missions depends on the ability of the systems to deal with software failures, or to avoid them in the first place. In order to develop more effective mitigation techniques, it is necessary to understand the na...
As space mission software becomes more complex, the ability to effectively deal with faults is increasingly important. The strategies that can be employed for fighting a software bug depend on its fault type. Bohrbugs are easily isolated and removed during software testing. Mandelbugs appear to behave chaotically. While it is more difficult to dete...
Even if software developers don't fully understand the faults or know their location in the code, software rejuvenation can help avoid failures in the presence of aging-related bugs. This is good news because reproducing and isolating an aging-related bug can be quite involved, similar to other Mandelbugs. Moreover, monitoring for signs of software...
This article discusses model-driven methods with analytic-numeric solutions. In addition to traditional non-state-space and state-space methods, multilevel methods are explored using real case studies. Challenges met while developing and solving dependability models of real systems are listed, and some solutions are outlined.
Given heavy dependence on man-made systems in our daily lives, reliability and availability of these systems clearly gain great importance. Together with methods of enhancing reliability and availability of systems, methods of quantitative assessment of these attributes thus needs attention. Quantification of these attributes via analytic-numeric s...
Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This paper shows that failures caused by so-called Mandelbugs (i.e., software faults whose activation and/or error propagation dep...
This paper aims to use analytical modeling technique to quantitatively study the dependability of Vehicle Platooning Application, which consists of Multiple Sub-Services (VPP-MSS) to achieve its functionality. Each sub-service (SS), based on network function virtualization technology, is executed in a container. Both SSes and OSes which SSes run on...
In NFV networks, service functions (SFs) can be deployed on virtual machines (VMs) across multiple domains and then form a service function chain (MSFC) for end-to-end network service provision. However, any software component in a VM-based MSFC must experience software aging issue after a long period of operation. This paper quantitatively investi...
Container technology, as the key enabler behind microservice architectures, is widely applied in Cloud and Edge Computing. A long and continuous running of operating system (OS) host-ing container-based services can encounter software aging that leads to performance deterioration and even causes system fail-ures. OS rejuvenation techniques can miti...
Multi-access edge
c
omputing (MEC)-enabled Internet of Things (IoT) is considered as a promising paradigm to deliver computation-intensive and delay-sensitive services to users. IoT service requests can be served by multiple
m
icro
s
ervices (MSs) that form a chain, called a micro
s
ervice
c
hain (MSC). However, the high complexity of MSs a...
As software plays an increasingly important role in our lives, it is essential to maintain its reliability, and generally dependability. Software bugs can cause huge financial losses and dangerous accidents; the safety risks from software are underscored these days to even the non-technical public by the emergence of autonomous software-based syste...
Since the publication of the first paper on software aging and rejuvenation by Huang et al. in 1995 [1], considerable research has been devoted to this topic. It deals with the phenomenon that continuously-running software systems may show an increasing failure rate and/or a degrading performance, either because error conditions accumulate inside t...
This chapter introduces the moment-based epistemic uncertainty propagation in Markov models. The epistemic uncertainty in Markov models introduces the uncertainty of model parameters, and it can be propagated by regarding parameters as random variables. The idea behind the moment-based approach is to approximate the multiple integration with a seri...
Given heavy dependence on man-made systems in our daily lives, reliability and availability of these systems clearly gain great importance. Together with methods of enhancing reliability and availability of systems, methods of quantitative assessment of these attributes thus needs attention. Quantification of these attributes via analytic-numeric s...
Unmanned aerial vehicle (UAV) and network function virtualization (NFV) facilitate the deployment of multi-access edge computing (MEC). In the UAV-based MEC (UMEC) network, virtualized network function (VNF) can be implemented as a lightweight container running on UMEC host operating system (OS). However, UMEC network is vulnerable to attack, which...
Network function virtualization (NFV) has been explored to be integrated with multi-access edge computing (MEC) to facilitate the development of 5G (fifth-generation) network. Latency-sensitive applications can be deployed as serial-parallel hybrid service function chains (SP-SFCs) in the MEC-NFV environment. SP-SFCs are deployed on resource-limite...
Software can show symptoms of two different types of aging. Sometimes, it is even subject to both types.
The Multi-access Edge Computing (MEC) and Network Function Virtualization (NFV) integrated architecture is a key enabling platform for 5G to run multiple customized services in the form of service function chain (SFC) configured as an ordered set of service functions (SFs). However, memory-related software aging in the SF that can be exploited by a...
A fundamental aspect of software reliability engineering is to understand how software failures manifest, identifying and comprehending their causes and effects. In this paper, we perform ex-post analyses of field software failure data, looking for characterizing their causes. The failures analyzed were collected from hundreds of computer systems l...
Understanding and predicting types of bugs are of practical importance for developers to improve the testing efficiency and take appropriate steps to address bugs in software releases. However, due to the complex conditions under which faults manifest and the complexity of the classification rules, the automatic classification of Mandelbugs is a di...
This part 3 of chapter 1 of this upanishad. There will 3 more such parts.
Vehicle platooning can be applied to cooperative downloading and uploading (CDU) services through the cooperation between lead vehicle and non-lead vehicles. CDU service can be completed cooperatively by containers constructed in vehicles of vehicle platooning system. Containers in vehicles may suffer from potential attacks which can lead to resour...
The safety-critical applications of vehicular ad hoc networks (VANETs) require high reliability and low transmission latency. IEEE 802.11p and IEEE 802.11bd are two standards proposed for such vehicular communication systems. In this paper, we propose an effective SINR-based model to conduct the QoS analysis of IEEE 802.11p/bd driven VANETs for saf...
Kathopanishad Chapter 1 Valli 2
Tutorial at the 26th Asia and South Pacific Design Automation Conference (ASP-DAC 2021)
As multi-hop wireless networks are attracting more attention, the need to evaluate their performance becomes essential. In order to evaluate the performance metrics of multi-hop wireless networks, including sending and receiving rates of a node as well as the collision probability, a model based on Stochastic Reward Nets (SRNs) is proposed. The pro...
Empirical studies have shown robust evidence of OS failure patterns characterized by multiple combinations of failure events composed of the same or different failure types. In this paper, we present a statistical approach to predict OS failures based on multiple failures association. Once we identify systematic failure associations in field data,...
Mandelbug-caused software failures are significant threats to system availability, especially in the context of mission-critical and safety-critical systems. However, there is still no systematic method for keeping the software free from Mandelbugs before release. To guarantee the availability of systems suffering from Mandelbugs, environmental-div...
I have put together Kath Upanishad with commentary. This is Chapter 1, section 1. The material is collected from many sources and uniformized. Furthermore, my ideas are added. This was prepared for a weekly zoom discussion session on the topic. Comments/criticism most welcome.
The recent trend of network softwarization suggests a radical shift in the implementation of traditional network intelligence. In Software Defined Networking (SDN), for instance, the control plane functions of forwarding devices are outsourced to the controller. Softwarized network components are expected to provide uninterrupted service during lon...
Intrusion tolerance is an ability to keep the correct service by masking the intrusion based on fault-tolerant techniques. With the rapid development of virtualization, the virtual machine (VM)-based intrusion tolerance scheme has been developed according to the concept of state machine replication with Byzantine fault tolerant technique. In this a...
With the rapid and wide development and deployment of system virtualization, service availability analysis has become increasingly important in a virtualized system (VS) which suffers from software aging. Software rejuvenation techniques can be applied to improve service availability but its effectiveness depends on the rejuvenation policy, which d...
Any comments/corrections/suggestions are welcome.
A brief intro to Hinduism. Any comments/corrections/suggestions are most welcome.
Software is crucial in the provision of communication services. Most functions related to control, management and operation are realized in software. With the ongoing virtualization and shift of network functions to new software platforms, the role and criticality of software for ordinary operations as well as handling of disasters increase signifi...
Migration-based Dynamic Platform (MDP) technique, a type of Moving Target Defense (MTD) techniques, defends against sophisticated cyber-attacks by randomly and dynamically selecting a platform for executing service/job. Security defense mechanisms protect service/job usually at the cost of degrading its performance. Therefore, it is valuable to mak...
In Software Defined Networking (SDN), network programmability is enabled through a logically centralized control plane. Production networks deploy multiple controllers for scalability and reliability reasons, which in turn rely on distributed consensus protocols to operate in a logically centralized manner. However, bugs in distributed control plan...
The book "Reliability and Availability Engineering: Modeling, Analysis, and Applications" by Kishor S. Trivedi and Andrea Bobbio (1st edition), Cambridge University Press, 2017, covers the analytical and modeling techniques currently in use for evaluating the reliability/availability of engineered systems. The book was recommended to me when I was...
A fundamental need for software reliability engineering is to comprehend how software systems fail, which means understanding the dynamics that govern different types of failure manifestation. In this paper, we present an exploratory study on multiple-event failures, looking for systematic patterns of sequences of failures in logs of a commodity op...
A fundamental need for software reliability engineering is to comprehend how software systems fail, which means understanding the dynamics that govern different types of failure manifestation. In this paper, we present an exploratory study on multiple-event failures, looking for systematic patterns of sequences of failures in logs of a commodity op...
High reliability and availability are requirements for most technical systems including computer and communication systems. Reliability and availability assurance methods based on probabilistic models is the topic addressed in this talk. Non-state-space solution methods are often used to solve models based on reliability block diagrams, fault trees...
Software aging, which is caused by Aging-Related Bugs (ARBs), tends to occur in long-running systems and may lead to performance degradation and increasing failure rate during software execution. ARB prediction can help developers discover and remove ARBs, thus alleviating the impact of software aging. However, ARB-prone files occupy a small percen...
The extent of epistemic uncertainty in modeling and analysis of complex systems is ever growing, mainly due to increasing levels of the openness, heterogeneity and versatility in cloud-based applications that are being adopted in critical sectors, like banking and finance. State-of-the-art approaches for model-based performance assessment do not em...
This paper presents an empirical study of 5741 bug reports for the Linux kernel from an evolutionary perspective, with the aim of obtaining a deep understanding of bug characteristics in the Linux operating system. Bug classification is performed based on the fault triggering conditions, followed by an analysis of the proportions and evolution of t...
In some applicable scenarios such as community patrolling, mobile nodes are restricted to move only in their own communities. Exploiting the meetings of the nodes within the same community and the nodes within the neighboring communities, a Delay Tolerant Network (DTN) can provide communication between any two nodes. In this paper, two analytical m...
Software failures caused by data race bugs have always been major concerns in parallel and distributed systems, despite significant efforts spent in software testing. Due to their nondeterministic and hard-to-reproduce features, when evaluating systems’ operational reliability, a rather long period of experimental execution time is expected to be s...
The recovery and repair durations of large fault-tolerant systems generally span several orders of magnitude. The distributions also violate the common modeling assumption of an exponential distribution for the recovery and repair time. A reward-based semi-Markov model is presented that can be used to predict the steady-state availability of such s...
Modern systems implement multiple and complex operations to manage the user demand, thereby ensuring adequate quality levels. They are usually made of a collection of interconnected (autonomous) subsystems, with a common goal to be pursued, that are perceived as a whole, single, integrated facility.
The Android operating system (OS) is a sophisticated man-made system and is the dominant OS in the current smartphone market. Due to the accumulation of errors in the system internal state and the incremental consumption of resources, such as the Dalvik heap memory of software applications and the physical memory, software aging is observed frequen...
Malicious lateral movement-based attacks have become a potential risk for many systems, bringing highly likely threats to critical infrastructures and national security. When launching this kind of attacks, adversaries first compromise a fraction of the targeted system and then move laterally to the rest of the system until the whole system is infe...
In this paper, the performance of a grid resource is modeled and evaluated using stochastic reward nets (SRNs), wherein the failure–repair behavior of its processors is taken into account. The proposed SRN is used to compute the blocking probability and service time of a resource for two different types of tasks: grid and local tasks. After modelin...
In long running systems, software tends to encounter performance degradation and increasing failure rate during execution. This phenomenon has been named software aging, which is caused by aging-related bugs (ARBs). Testing resource allocation can be optimized by identifying ARB-prone modules with ARB prediction. However, due to the low presence an...
Due to the increasing need for computational power, the market has shifted towards big centralized data centers. Understanding the nature of the dynamics of these data centers from machine and job/task perspective is critical to design efficient data center management policies like optimal resource/power utilization, capacity planning and optimal (...
Software aging often affects the performance of software systems and may eventually cause them to fail. A complementary approach to handle transient software failures due to the software aging is called software rejuvenation. It is a preventive and proactive solution that is particularly useful for counteracting the phenomenon of software aging. In...
Outpatient centers comprised of many concurrent clinics increasingly see higher patient volumes. In these centers, decisions to improve clinic flow must account for the high degree of interdependence when critical personnel or equipment is shared between clinics. Discrete event simulation models have provided clinical decision support, but rarely a...
In Software Defined Networking (SDN) critical control plane functions are offloaded to a software entity known as the SDN controller. Today’s SDN controllers are complex software systems, owing to heterogeneity of networks and forwarding devices they support, and are inherently prone to bugs. Our previous work showed that Software Reliability Growt...
Data-centers have recently experienced a fast growth in energy demand, mainly due to cloud computing, a paradigm that lets the users access shared computing resources (e.g., servers, storage, etc.). Several techniques have been proposed in order to alleviate this problem, and numerous power models have been adopted to predict the servers' power con...
The increasing shift of various critical services towards Infrastructure-as-a-Service (IaaS) cloud data centers (CDCs) creates a need for analyzing CDCs’ availability, which is affected by various factors including repair policy and system parameters. This paper aims to apply analytical modeling and sensitivity analysis techniques to investigate th...
Details and video of my talk at Alibaba on Dec. 8, 2107:
https://102.alibaba.com/detail/?id=23
As enterprises continue to move their workloads from traditional server-room environments to private cloud-based systems, there is an increasing desire and ability for companies like IBM to centrally monitor the systems on behalf of their customers to proactively help to mitigate any potential failure scenarios. In this paper, we investigate failur...
Infrastructure as a Service (IaaS) is one of the most significant and fastest growing fields in cloud computing. To efficiently use the resources of an IaaS cloud, several important factors such as performance, availability, and power consumption need to be considered and evaluated carefully. Evaluation of these metrics is essential for cost-benefi...
Linux operating system is a complex system that is prone to suffer failures during usage, and increases difficulties of fixing bugs. Different testing strategies and fault mitigation methods can be developed and applied based on different types of bugs, which leads to the necessity to have a deep understanding of the nature of bugs in Linux. In thi...
The Internet world is moving toward a scenario where users and applications have very diverse service expectation, making the current best-effort model inadequate and limiting. To be able to design high-availability service systems, it is essential to consider not only the actual failure and recovery behavior of the service infrastructure, but also...
Transient performance analysis of power distribution network (PDN) after a failure occurrence could facilitate the better design of smart grid. Researchers have proposed analytical models and the numerical solutions to analyze the PDN's transient behaviors by applying homogeneous continuous-time Markov chain (CTMC). However, the PDN system may be t...
While Blockchain network brings tremendous benefits, there are concerns whether their performance would match up with the mainstream IT systems. This paper aims to investigate whether the consensus process using Practical Byzantine Fault Tolerance (PBFT) could be a performance bottleneck for networks with a large number of peers. We model the PBFT...
Do you need to know what technique to use to evaluate the reliability of an engineered system? This self-contained guide provides comprehensive coverage of all the analytical and modeling techniques currently in use, from classical non-state and state space approaches, to newer and more advanced methods such as binary decision diagrams, dynamic fau...
Network
Cited