ArticlePDF Available

Exploiting Stand-in Redundancy to Improve Resilience in a System-of-Systems (SoS)

Authors:

Abstract and Figures

Resilience is the ability of a system or organization to react to and recover from disturbances with minimal effect on its dynamic stability. While the resilience of system-of systems (SoSs) depends on the reliability of their constituent systems, traditional reliability approaches cannot adequately quantify their resilience. Given the heterogeneity and often wide geographic distribution of SoS constituent systems, inclusion of backup redundant systems for a SoS is usually impractical and costly. In this paper, we quantitatively assess the impact of compensating for a loss of performance in one constituent system by re-tasking the remaining systems. We call this “stand-in redundancy”, and we develop two concepts to implement stand-in redundancy in a SoS. First, reactive resilience deals with performance recovery after a system failure has occurred. We provide a method to determine feasible alternative SoS configurations based on performance level recovery and cost of implementation. Second, proactive resilience takes into account the gradual degradation of systems over time. The corresponding reduction in SoS performance could initiate a forcible transition to a different SoS configuration before actual failure of the system. These concepts, and their resulting upstream effects on development costs and risks, can be used by decision-makers to quantitatively assess the impact on resilience of different SoS architectures and their inherent ability to resist failures throughout the SoS lifecycle.
Content may be subject to copyright.
Procedia Computer Science 16 ( 2013 ) 532 541
1877-0509 © 2013 The Authors. Published by Elsevier B.V.
Selection and/or peer-review under responsibility of Georgia Institute of Technology
doi: 10.1016/j.procs.2013.01.056
Conference on Systems Engineering Research (CSER’13)
Eds.: C.J.J. Paredis, C. Bishop, D. Bodner, Georgia Institute of Technology, Atlanta, GA, March 19-22, 2013.
Exploiting stand-in redundancy to improve resilience in a system-
of-systems (SoS)
Payuna Uday
a
*, Karen Marais
a
a
Purdue University, 401 W. Stadium Avenue, West Lafayette, IN-47906, USA
Abstract
Resilience is the ability of a system or organization to react to and recover from disturbances with minimal effect on its dynamic
stability. While the resilience of system-of systems (SoSs) depends on the reliability of their constituent systems, traditional
reliability approaches cannot adequately quantify their resilience. Given the heterogeneity and often wide geographic distribution
of SoS constituent systems, inclusion of backup redundant systems for a SoS is usually impractical and costly. In this paper, we
quantitatively assess the impact of compensating for a loss of performance in one constituent system by re-tasking the remaining
systems. We call this “stand-in redundancy”, and we develop two concepts to implement stand-in redundancy in a SoS. First,
reactive resilience deals with performance recovery after a system failure has occurred. We provide a method to determine
feasible alternative SoS configurations based on performance level recovery and cost of implementation. Second, proactive
resilience takes into account the gradual degradation of systems over time. The corresponding reduction in SoS performance
could initiate a forcible transition to a different SoS configuration before actual failure of the system. These concepts, and their
resulting upstream effects on development costs and risks, can be used by decision-makers to quantitatively assess the impact on
resilience of different SoS architectures and their inherent ability to resist failures throughout the SoS lifecycle.
© 2013 The Authors. Published by Elsevier B.V.
Selection and/or peer-review under responsibility of Georgia Institute of Technology.
Keywords: System-of-systems; resilience; stand-in redundancy; functional reconfigurability
1. Introduction
The emergence of complex systems over the past few decades has led to increased interest in exploring methods
to incorporate high levels of inherent resilience within them. A complex system can be defined as “ an open system
with continually cooperating and competing elements – a system that continually evolves and changes according to
its own condition and external environment” [1]. Examples of complex systems include satellites, aircraft, and the
space shuttle. These systems are expensive to design and build, they operate in harsh or remote environments, and
* Corresponding author. Tel.: 1-765-464-4966; fax: +1-765-494-0307.
E-mail address: puday@purdue.edu.
Available online at www.sciencedirect.com
© 2013 The Authors. Published by Elsevier B.V.
Selection and/or peer-review under responsibility of Georgia Institute of Technology
533
Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
any failure of these systems is typically a high publicity event. In some cases, such as satellites, maintenance and
repair is difficult or impossible in physically inaccessible environments.
In recent years, networks of complex systems, known as system-of-systems (SoS), have garnered increased
attention [2,3]. Specifically, the term system-of-systems is used to denote networks that are formed from the
integration of independently operating complex systems that interact with one another to provide an overall
capability, which cannot be achieved by the individual systems alone [1]. Examples of SoSs include the national air
space (NAS) and the military’s ballistic missile defense system. These meta-systems are characterized by the
operational and managerial independence of the constituent systems, the evolutionary nature and emergent behavior
of the larger SoS, and the geographic distribution of the sub-systems [4]. High levels of interdependency add to the
overall complexity of the SoS. As a result, designing and operating a SoS is challenging both from an engineering as
well as a managerial perspective. In particular, resilience in an SoS, though as vital as in the case of complex
systems, is hard to capture and design through traditional means.
Resilience is the ability of a system or organization to react to and recover from disturbances at an early stage
with minimal effect on its dynamic stability [5]. Typically, in large complex systems, redundancy features are used
to increase the resilience of the system to perturbations. For instance, commercial satellites are fitted with multiple
backup systems to limit performance loss in the event of failures. While resilience and redundancy are sometimes
thought of as analogous, they are significantly different concepts. Redundancy is essentially the inclusion of
secondary components or systems to provide operability when a primary system fails. It can, therefore, be thought of
as an input to the design process, which ultimately provides some level of resilience (output) to the overall system.
In other words, redundancy is just one way to achieve resilience in a system.
In this paper, we show how the inherent structure and traits of an SoS can be leveraged to improve the resilience
of the overall network. We develop a method that combines the ability of the constituent systems to function
independently with the evolutionary nature of the SoS to maintain dynamic stability in the event of a failure. That is,
if one system fails, we compensate for this loss by re-tasking all or a subset of the remaining systems. We call this
approach “stand-in redundancy”, and propose two perspectives to address it:
1) Reactive resilience: Once a failure has occurred, we study different feasible configurations of the remaining
systems that can be set up to meet the immediate needs of the SoS.
2) Proactive resilience: Given gradual performance degradation over time, we investigate the benefits of
forcibly transitioning the SoS to a new configuration before actual failure of the system.
The remainder of this paper is organized as follows: Section 2 highlights the motivation behind this study,
Section 3 describes the method developed, and Section 4 discusses the results of the analysis using an illustrative
example. Section 5 concludes the paper.
2. Motivation
Traditional systems engineering practices try to anticipate and resist disruptions through classical reliability
methods, such as inclusion of redundancy at the component level and use of preventive maintenance at the system
level. Reliability analysis techniques, such as fault trees and event trees, are used to determine the level and types of
redundancy to be included in the system design. Similar methods are used to develop maintenance plans to reduce
the likelihood of failures at the system level.
However, these approaches do not adequately satisfy the resilience needs of a SoS. Given the heterogeneity and,
often wide geographic distribution, of the constituent systems, redundant systems in a SoS are impractical and
costly. Additionally, high levels of interdependency between the systems imply increased risks of failures cascading
throughout the SoS. These hurdles offer the opportunity to improve the resilience of the overarching system through
unconventional means. This view echoes that of researchers who raise the need for a different perspective of
resilience in the context of SoSs [6,7]. A recent paper succinctly sums up the need for greater emphasis on resilient
systems by stating that “systems should be made resilient, rather than merely reliable” as they “need to be able to
recover from unexpected perturbations, disruptions, and degradations of the operational environment” [8].
Here, we study a way to compensate for a loss of performance in one constituent system by re-tasking the
remaining systems. Specifically, as one entity, or node in a SoS, experiences degraded performance or a failure
mode, other entities can alter their operations to compensate for this loss. We call this “stand-in redundancy”. This
concept raises several interesting questions, such as: (1) given the failure of a specific system, what is the best
534 Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
configuration to compensate for the loss?; (2) what level of performance can be recovered with the new
configuration?; and (3) what is the upstream effect of stand-in redundancy on development costs and risks?
To answer these questions, we develop two concepts: (1) reactive resilience, and (2) proactive resilience.
Reactive resilience deals with performance recovery after a failure has occurred. In this case, for a specific
capability, we study the reduction in overall SoS performance given various nodal failures, and then determine the
level of performance that can be recovered by reconfiguring the rest of the SoS.
Having studied the impact of total nodal failures on overall SoS performance, we expand the method to track the
impact of gradual degradation of nodes on the same overall performance. As the nodes degrade over time, the
corresponding reduction in SoS performance could result in a situation where a different configuration might
perform better than the current one. This implies that one might proactively transition the SoS to a different
configuration before actual failure of the node. We call this proactive resilience, as this transition to a new
configuration before nodal failure occurs improves the robustness of the overall SoS.
Figure 1 summarizes the above discussion and illustrates the different ways resilience in an SoS can be achieved.
Failures can be addressed after they occur, through repair; or they can be anticipated and addressed before they
occur, through preventive maintenance. Typically, exogenous methods are required to provide these services. Here,
we emphasize designing systems with the inherent ability to react to failures, without causing any downtime for
either maintenance and/or repair (as highlighted by the shaded region of the figure). This internal ability to resist
failures can be achieved through means endogenous to the SoS. This alternative to traditional reliability engineering
practices helps design systems-of-systems with inherent resilience by taking advantage of fundamental properties
such as diversity, adaptability, and evolutionary behavior. These different kinds of resilience can be used to study
various SoS architectures and evaluate their inherent ability to resist failures, thus in effect providing information for
decision-makers to help identify “optimally” resilient SoS structures. Additionally, consideration of these resilience
improvement techniques enables designers to better target risk resolution resources.
Figure 1. Classification of SoS resilience based on when and how failures are addressed
3. Analytic framework
3.1. Representation of an SoS
We consider a system-of-systems that consists of n systems. Typical representations of a SoS involve networked
combinations of the constituent systems that ultimately provide SoS-level capabilities, as shown in Figure 2a. In this
work, we represent a SoS as a hierarchical structure of multi-system capabilities and single-system functions, as
shown in Figure 2b. At the highest level, the SoS is essentially a collection of the capabilities that it is designed to
provide. Subsequently, each capability emerges from a collection of system-level functions.
At the lowest level, the functions are performed by the individual systems. For example, geosynchronous
satellites can image large swathes of area for extended periods of time, and weapon-fitted UAVs can be used to
535
Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
strike targets in hostile environments. Consideration of these functions rather than the individual systems as the base
level of SoS capabilities allows us to analyze the resilience of the larger system. As mentioned previously, the
traditional practice of incorporating redundancy in complex engineered systems may not fit well with respect to
evolving SoSs. Thus, this view of recapturing lost or deteriorating functions plays a key role in improving the
resilience of SoSs. In the above example, if surveillance is a key requirement of the mission and if the satellite fails,
then with appropriate imaging capabilities, the UAV can be re-tasked to perform surveillance functions and to
network with systems that were originally linked to the satellite.
Figure 2. (a) Physical [9] and (b) functional representation of system-of-systems
At the middle level, groups of functions work together to provide a higher-level capability (). For example, the
satellite could collaborate with a reconnaissance UAV to not only provide surveillance of a large area but also to
provide high-definition imaging capabilities for target identification. Performance and reliability are two important
metrics with respect to any operational system. In this work, we consider these measures at the capability level and
refer to them as the Level of Performance () and Level of Reliability () respectively.
Determination of the level of performance of SoS capabilities is challenging when compared to calculating these
metrics for simpler systems. The level of performance achieved by a particular system function can be expressed by
a direct measure of how well it performs its task. For instance, a direct measure of performance might be the area,
say 1000 sq. mi., that a satellite can image with a given level of resolution. In contrast, determination of the level of
performance for a particular capability is context and SoS dependent. As shown in Equation (1), computing this
metric relies on a number of factors such as the architecture of the constituent systems, the availability of these
systems, the performance capabilities of each system, and the functions achievable by each system.

    (1)
The level of reliability provided by a capability is relatively simpler to determine. There are well-established
methods that help determine the reliability of systems [10]. Using these traditional methods and tabulated failure
rate information, the reliability of a system (
), and by extension the reliability of the function it provides, at any
time after it has been deployed can be computed. For this work, we define the level of reliability () of a
capability as the probability that all the nc constituent systems contributing to that particular capability are
operational at a particular time.

 


(2)
536 Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
3.2. Framework
Figure 3 provides a graphical illustration of SoS operations using the metrics of Level of Performance (LoP) and
Level of Reliability (LoR) described above. The desirable region of operation is in the top right of the graph, that is,
the high-performance, high-reliability portion (as indicated by the blue circle). After the initial deployment, the
systems gradually degrade with time, and the SoS region of operation moves to the left of the graph. In some cases,
this degradation can result in a simultaneous reduction in performance levels as well as reliability (not shown in
figure). If a system fails, the immediate loss in its functionality leads to a decrease in the overall performance level
of the SoS (as shown by the red circle). Given the inherent characteristics of SoSs, a single system loss typically
does not lead to a complete failure of the larger SoS and hence, the overall LoP does not fall to zero. However, we
propose that incorporating a certain level of stand-in redundancy will allow the SoS to minimize this performance
loss without relying on external agents to either maintain or replace the failed system. This idea is illustrated in
Figure 4. In the event of a system failure, by allowing multiple systems in the SoS to perform the same functions,
the remaining systems in the SoS can be re-tasked to perform the lost functions, even if to a lesser degree of
performance. This is represented by a sequence of green circles, indicating that different levels of performance can
be regained depending on the functional reconfigurability of the SoS.
Figure 3. Notional LoP-LoR graph showing impact of system failure on SoS performance
The framework to improve SoS resilience using stand-in redundancy is based on a combinatorial optimization
approach. This method aims to choose the SoS configuration that minimizes the operational cost of the SoS, while
achieving a certain level of capability performance, as well as a threshold level of reliability. For a SoS with m
capabilities, the optimization formulation thus becomes:
Minimize:
 (3)
Subject to:


(4)


(5)
3.2.1. SoS operations cost
The SoS cost depends on a variety of factors and is highly context dependent. For this work, we divide SoS
operations into three broad situations:
Fully functional state: The SoS is in its original configuration with no system failures. The SoS cost depends on
537
Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
the number of systems contributing to the overall SoS capabilities as well as the cost of operating each system.
System loss state: The SoS has suffered either single or multiple system failures and is now operating at a much
lower level of performance. Here, the SoS cost depends on the operating costs of the remaining systems as well
as costs that may be incurred in repairing and/or replacing the failed system.
Re-tasked state: The remaining systems are re-tasked to recover some of the lost functionality after a system
failure. In this situation, the SoS cost depends on the operating costs of the remaining systems, the acquisition
costs to include additional features that enable this functional redundancy, and the marginal costs to re-
configure the SoS so that systems can take on their new tasks.
Figure 4. Notional LoP-LoR graph showing impact of functional reconfigurability on SoS performance
3.2.2. Level of Performance (
) and Level of Reliability (
)
These metrics are defined for each capability as described in Section 3.1.
3.2.3. Target values for Level of Performance (
) and Level of Reliability (
)
The constraints force the optimal solution to meet the target values of performance and reliability for each
capability, defined by the decision-maker. In this study, these target values are defined under two broad categories:
Desired target values (



): These values are used to determine the region of operation (in the LoP-
LoR graph) of the original, fully functional SoS. They represent the level of performance and reliability the SoS
is expected to satisfy. For example, for a fully functional and newly deployed SoS, the systems can be chosen to
provide relatively high target values for each capability.
Acceptable target values (



): These values are used to determine the region of operation (in the
LoP-LoR graph) of the re-tasked SoS. They represent the minimum acceptable level of performance that the
remaining systems must provide. As these values of 

and 

are varied, the costs to achieve the
corresponding value of stand-in redundancy, vary accordingly.
4. Illustrative example
Consider the need for a SoS that provides the ability to detect and eliminate targets in hostile environments as
well as provides large-scale surveillance in both hostile and non-hostile situations. The SoS consists of five systems:
a geosynchronous satellite, three UAVs, and a ground station (see Figure 5). UAV-1 is primarily a surveillance
drone with no weapons on board. On the other hand, UAV-2 and UAV-3 are fitted with weapons for target
elimination tasks and they carry basic cameras on board to provide confirmation of the strikes that have been carried
out. Combinations of these system functions yield three primary SoS capabilities: (1) surveillance, (2) target
538 Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
identification, and (3) target elimination. Using the representation developed in Section 3.1, we decompose the SoS-
level need into capabilities as shown in Figure 5. For example, to provide surveillance capabilities, the
corresponding systems need to be able to image large areas of land with high revisit rates. Figure 6 indicates the
functions that each system can provide. With this setup, the systems that contribute to the SoS capabilities are:
1. Surveillance provided by the satellite
2. Target identification provided by a collaboration between the satellite and UAV-1
3. Target elimination provided by UAV-2 and UAV-3
The main focus of this work is to assess the impact of incremental modifications to existing SoSs such that the
overall architecture becomes more resilient to disruptions. With this focus in mind, we identify a few key ways this
notional SoS can be modified. Due to challenges related to its accessibility, the features on the satellite cannot be
changed. In contrast, it is relatively easier to retrofit the UAVs with higher performance devices, such as
sophisticated imaging and communication equipment. Additionally, UAVs may also be reprogrammed for higher
revisit rates. The grey arrows in Figure 6 indicate the systems whose corresponding features can be changed.
Figure 5. Five-system SoS and the corresponding decomposition of capabilities
Next, we determine the Level of Performance (LoP) and Level of Reliability (LoR) associated with each of the
three capabilities. Both these metrics depend on the needs of the particular SoS, the systems available to perform
tasks, and the specific functions that can be provided by these systems. In this study, we assess the LoP for each
capability as the probability of achieving the particular capability. We know that combinations of system-level
functions result in the corresponding capability. Let

denote the set of system functions that contribute to a
capability (
, and let denote all possible combinations of these functions. Assuming binomial states of the
systems, that is, each system is either fully functional or completely degraded (failed), we can determine the
probability of achieving a particular capability by applying the law of total probability:





(6)
Equation (6) means that the probability of achieving a certain capability depends on (a) the conditional
probability that the capability is achievable given the operational status of the systems and the functions they
provide, as well as (b) the probability that the systems, and by extension, their functions, are operational. The latter
probabilities can be determined based on the reliability of the systems at the time of interest, as below. On the other
hand, calculation of the conditional probability of success is relatively more complex since it depends on a variety of
factors, such as the actual number of operational systems, the strength of the links between the systems, and the
availability of sub-systems within the systems. The reliabilities of each of these systems can be calculated by
applying historical failure rate information to Equation (2). We now perform the optimization to determine the least
costly configuration under each of the following scenarios:
1. When nothing fails, that is, the original SoS performs as designed.
2. When one system fails.
3. When one or more systems are used to recover functionality after a system has failed.
539
Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
System (x
i
)
Functions (fn
i
)
Area imaged
(fn
1
)
Imaging
resolution (fn
2
)
Revisit rate (fn
3
)
Target strike rate
(fn
4
)
Satellite (x
1
)
-
UAV-1 (x
2
)
-
UAV-2 (x
3
)
UAV-3 (x
4
)
Indicates feature present in system in the original SoS
Indicates feature that can be upgraded/modified in system
Figure 6. Systems and the functions they provide
The results are shown in Figure 7. The horizontal axis denotes the system that has failed; the vertical axis
represents the LoP for the capability under interest. For each failed system, the first bar indicates the original fully
functional SoS (the baseline case for further comparisons), the second bar represents the impact of the system failure
on a particular capability, and the last bar shows the effect of re-tasking the remaining systems in the SoS to achieve
a minimum acceptable level of performance. For example, consider
(surveillance) in Figure 7a. For this
capability, the satellite alone provides the maximum LoP (100%) in the original SoS. If, however, the satellite fails,
UAV-1 (with high definition imaging capabilities) can provide some surveillance capability (55%). On the other
hand, if the systems were retrofitted with better imaging devices and reprogrammed for increased revisit rates, a
combination of UAV-1 and UAV-2 are used to “stand-in” for the satellite, providing a higher LoP (72.5%) for the
same capability.
In the case of
(target identification) (see Figure 7b) loss of either the satellite or UAV-1 has a significant
impact on this capability. Additionally, given the reliance of
(target elimination) on the ability to accurately track
a target, it is vital that drastic performance decrements, especially in urgent hostile situations, do not occur. When
the satellite fails, assuming improved imaging capabilities on UAV-2, it can collaborate with UAV-1 to provide a
marginally higher LoP. On the other hand, if UAV-1 were to fail, providing better imaging equipment on board
UAV-2 (despite its primary role as an attack drone), raises the LoP by a significant amount (from 56% to 76%).
The impact of stand-in redundancy on
leads to some interesting observations (see Figure 7c). While this
capability directly stems from the targeting capabilities of the attack drones, namely UAV-2 and UAV-3, it relies
heavily on accurate target identification information provided by
. Intuitively, the loss of either of the attack
UAVs results in performance losses. For example, failure of UAV-2 alone leads to a direct loss of half the weapons
striking capability of the SoS, and this loss cannot be recovered by any other system in the SoS as none of them are
equipped to carry and launch weapons. In contrast, failure of either the satellite and/or UAV-1 adversely impacts the
target identification ability of the SoS, thereby hindering its target elimination ability, despite the full functionality
of UAVs 2 and 3. This highlights the fact that system failures can have significant impacts on capabilities that they
are not directly designed to satisfy. In such situations, allowing other systems to take over some of the lost
functionality helps maintain key capabilities at acceptable (as determined by decision-makers or operators) levels.
For example, if UAV-1 fails in the midst of a raid, a properly equipped UAV-2 can provide target identification
capability so that UAV-3 can carry out the actual target elimination tasks. Although this implies that UAV-2 may
not be available to carry its own attack function, depending on the criticality of the situation, this marginal loss in
firepower may be traded in for improved reconnaissance. As a result, it is important to keep in mind the immediate
needs of the mission when deciding which configuration to transition to.
Cost considerations are highlighted in Figure 7a. 

is the cost associated with re-tasking the
remaining system and it depends on: (a) the features that are either modified or added to the existing systems (such
as, inclusion of high performance cameras in UAV-1 and UAV-2), and (b) the operating costs of these systems. This
metric is relatively easier to compute than the corresponding cost for the failed system, 

. Calculating this
cost is challenging as it depends on: (a) the operating costs of the systems remaining in the SoS after a nodal failure,
(b) the costs to replace the failed system (for example, deploy a new UAV to replace the failed one), as well as (c)
the costs accrued in the downtime between system failure and system replacement.
540 Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
Figure 7. Impact of system failure and stand-in redundancy on (a) LoP of Capability 1 (Surveillance), (b) LoP of Capability 2 (Target
identification), and (c) LoP of Capability 3 (Target elimination)
These results indicate that resilience in SoSs can be improved without having to use traditional practices of
backing up systems. Instead, systems can be designed to contribute to SoS-level capabilities in the ideal case, and to
“stand-in” for failed functions in the event of a failure. It is also important to note that there is a limit to the level of
stand-in redundancy that can be incorporated in such systems. Returning to the notional example used in this study,
re-tasking of some systems is possible because the UAVs can be retrofitted with better cameras and imaging
mechanisms, and/or can be reprogrammed for higher revisit rates. On the other hand, UAVs used for purely
surveillance purposes cannot be easily fitted with weapons deploying capabilities, nor can the designed imaging
capabilities of the satellite be changed. This further leads to interesting implications regarding the balance between
improving the resilience of an SoS, the costs associated with these improvements, and the need to incorporate
features that not only improve the resilience but also ensure performance levels as the context of operations of the
SoS changes with time. This is especially true of large-scale SoSs, such as multi-modal transportation networks, that
are designed for long lifetimes with modifications and upgrades being incorporated in a gradual manner. Stand-in
redundancy has the potential to improve the resilience of these systems, however, features and technological
a
b
c
541
Payuna Uday and Karen Marais / Procedia Computer Science 16 ( 2013 ) 532 – 541
modifications that bring about stand-in redundancy need to be chosen keeping in mind the trade-offs between costs,
resilience in face of current and future threats, and adaptability of the SoS in an uncertain future.
5. Summary and future work
Traditionally, systems have been designed to be resilient through over-design. The emergence of large-scale
system-of-systems (SoSs) has made it relatively hard to incorporate resilience in this manner as the system itself
evolves with time along with its changing environment. Our approach indicates that incremental enhancements
and/or modifications to existing systems in these SoSs can provide inherent resilience. The concept of stand-in
redundancy, allows cost-effective re-tasking and reconfiguration of SoSs. The resulting resilience capability
minimizes performance loss at the SoS level in the event of an unanticipated system failure, and thus, enables
improved operability of the SoS through uncertain futures. The next step of this study includes expanding this static
model to a dynamic one with the use of stochastic tools to design for resilience under uncertainty. Additionally, we
will track the degradation of systems over time in order to assess whether it might be beneficial to force a transition
to a different configuration before actual failure of a system (proactive resilience). While we limited the scope of
this study to single system failures, ongoing research aims at investigating the implications of stand-in redundancy
in the case of multi-system failures. These concepts, and their resulting upstream effects on development costs and
risks, can be used by decision-makers to quantitatively assess the impact on resilience of different SoS architectures
and their inherent ability to resist failures throughout the SoS lifecycle.
Acknowledgements
This material is based on work supported, in whole or in part, by the U.S. Department of Defense through the
Systems Engineering Research Center (SERC) under Contract H98230-08-D-0171. SERC is a federally funded
University Affiliated Research Center managed by Stevens Institute of Technology. The authors would like to thank
Dr. William Crossley for insightful discussions that enhanced this work.
References
1. B. E. White, “Fostering Intra-Organizational Communication of Enterprise Systems Engineering Practices”, National Defense Industrial
Association (NDIA), 9th Annual Systems Engineering Conference, San Diego CA, October 23-26, 2006.
2. D. DeLaurentis, W. Crossley, and M. Mane, “Taxonomy to Guide Systems-of-Systems Decision-Making in Air Transportation Problems”,
Journal of Aircraft, Vol. 48, No. 3, pp 760-770, May-June 2011.
3. B.G. McCarter and B.E. White, "Emergence of SoS, Socio-Cognitive Aspects", Chapter 3 in M. Jamshidi's book, "System of Systems
Engineering- Principles and Applications", 2007.
4. M. W. Maier, “Architecting Principles for System-of-systems”, Journal of Systems Engineering, Vol.1, No. 4, pp. 267-284, 1998.
5. E. Hollnagel, D. W. Woods, and N. Leveson, Resilience Engineering: Concepts and Precepts. Ashgate, 2010.
6. R. Neches and A. Madni, “Towards Affordably Adaptable and Effective Systems”, Journal of Systems Engineering,
doi: 10.1002/sys.21234, October 2012
7. S. Sheard and A. Mostashari, “A Framework for System Resilience Discussions”, 18
th
Annual International Symposium of INCOSE,
Utrecht, Netherlands, June 15-19, 2008.
8. A. Madni and S. Jackson, “Towards a Conceptual Framework for Resilience Engineering”, IEEE Systems Journal, Vol. 3, No. 2, pp. 181-
191, 2009.
9. S. Y. Han, K. Marais, and D. DeLaurentis, “Evaluating System of Systems Resilience using Interdependency Analysis”, IEEE International
Conference on Systems, Man, and Cybernetics, Seoul, Korea, October 14-17, 2012.
10. M. Rausand and A. Hoyland, System Reliability Theory: Models, Statistical Methods, and Applications. Second edition. New Jersey:
Wiley – Interscience, 2004.
... The lack of reliability can result in damages to property and environment, economic losses, injury to people, and even loss of human lives, i.e., it has the potential to affect different sectors of society [6]. However, ensuring reliability in SoS is challenging due to their dynamic nature, i.e., they can assume different architectural configurations at runtime [7], resulting from the independence of the constituent systems, which are developed/acquired and maintained by different organizations and teams and using different technologies, processes, and techniques [8]. Constituent systems can then exhibit unexpected behaviors over time. ...
... A natural consequence of the constituent systems' independence is the dynamic architecture of SoS, i.e., architecture changes at runtime and can assume different configurations [7]. This dynamism implies that different desired or even undesired SoS behaviors can emerge from the actions and interactions among constituent systems. ...
... Although software reliability engineering started more than four decades ago [10], SoS reliability can be considered a new research field [14]. While failures in traditional software systems can occur either because they do not comply with the specification or because the specification does not adequately describe their function [19], failures in SoS may also occur due to the temporary inability of constituent systems to adequately provide functionalities [7,20,21]. Such inability may not necessarily be related to failures in constituent systems rather to the constituent system's unavailability to meet the SoS demands at the appropriate time [22]. ...
Article
Context: Large-scale software-intensive Systems-of-Systems (SoS) have become present in several critical domains and have sometimes depended on diverse trending technologies, such as cloud computing and machine learning. At the same time, the SoS dynamic architecture makes it difficult to assure SoS reliability leading to diverse studies with specific solutions, while the need for a shared view of what precisely SoS reliability refers to still exists. Objective: The main contribution of this article is to go towards an understanding of SoS reliability. We present a conceptual model whose concepts as well as their definitions and relationships were defined by systematically examining the literature of the field. Methods: We surveyed 36 practitioners and researchers regarding ambiguity, explanatory power, parsimony, generality, and utility of our model. Next, we adjusted our model according to their contribution. Results: We reach a conceptual model containing 29 concepts and their relationships that help to comprehend SoS reliability. In addition, we provided a glossary with a definition of each concept of our conceptual model. We also proposed a SoS reliability definition grounded on the literature. Conclusions: By organizing the knowledge of SoS reliability, this conceptual model makes it possible to expand the body of knowledge in the area and opens several opportunities for further investigations; in particular, this model serves as a basis for novel solutions aiming to assure SoS reliability.
... Affordability is taken here to account for the capital and operational costs of an SoS. Resilience is defined as the ability of an SoS to survive and recover from disruptions [5,7] and is of particular importance when operations are in high risk environments. ...
... The total operational cost of the SoS is calculated as the sum of the operational costs of each participating system. The performance level (PL) of the SoS is formulated as shown in Eq. 10 and is based on [7]. A is the area covered by a surveillance system (in sq. ...
... Standard techniques for improving resilience in complex systems, like physical and functional redundancy and localized capacity (amongst others) [23], still beg the question of how much redundancy or distribution of capacity is "enough" and how much is "too much?" An excessively redundant or distributed SoS may be indifferent to the loss of a single system but would be extremely expensive [7] or could cause organizational interoperability issues [10], both can reduce overall performance. A highly efficient SoS may have an acceptable performance level at low operational costs but would be vulnerable to catastrophic failure if even one of the participating systems was disturbed. ...
Chapter
This research investigates a bioinspired framework for analyzing and predicting trade-offs between system of systems’ (SoS) performance, affordability, and resilience early in the design process – without the need for highly detailed simulations or disruption models. This framework builds on ecological research that has found a unique balance between redundancy and efficiency in biological ecosystems. This balance implies that highly efficient ecosystems tend to be inflexible and vulnerable to perturbations, while highly redundant ecosystems fail to utilize resources effectively for survival. Twenty architectures for a notional hostiles’ surveillance SoS are investigated, showing that highly efficient SoS architectures fail catastrophically in the face of disruptions, while highly redundant architectures are unnecessarily expensive: indicating that engineered SoS architectures follow a fitness trend akin to complex ecological networks. The results suggest that SoS may benefit from mimicking a balance of redundancy and efficiency similar to that found in ecological networks.
... SoS exhibits an inherent dynamism, which is a natural consequence of the independence of constituent systems. This means that the constituent systems can change at runtime (Uday and Marais, 2013), affecting the SoS overall behavior. Such dynamism implies that a range of potential behaviors, both desired and undesired, can emerge from the interactions among constituent systems. ...
... Some studies employed alternative constituent systems as heterogeneous redundancies to compensate for failures or the low performance of primary constituent systems. Uday and Marais (2013) introduced the ''stand-in redundancy'' concept, offering a methodology to define feasible architectural configurations in the face of constituent system failures. Mokhtarpour and Stracener (2015) proposed sharing data among constituent systems as a form of heterogeneous redundancy, demonstrating its impact on increasing overall reliability. ...
Article
Context: Systems-of-Systems (SoS) increasingly permeate everyday life in various critical domains. Due to their dynamic nature, guaranteeing their fault tolerance is challenging. Fault-tolerant SoS must deal with behavioral changes in constituent systems, whether accidental or deliberate. Goal: This work proposes ReViTA, a framework to assist professionals in designing fault-tolerant SoS that can continue to provide their function even in the presence of disturbances, i.e., events that affect the ability of an SoS to fulfill its mission. Methods: By adopting ReViTA, fault tolerance can be achieved by reconfiguring an SoS architecture to meet the critical mission requirements. Results: We performed two studies to evaluate the ReViTA acceptance by professionals. In the former, we gathered perceptions and suggestions from 14 professionals through individual interviews. In the latter, we involved a group of four professionals who applied ReViTA to a real-world scenario. Conclusion: The results demonstrate that ReViTA can effectively support professionals in designing fault-tolerant SoS. Employing ReViTA also brings insights into costs and planning that are crucial for implementing fault-tolerance strategies. Using ReViTA facilitates a comprehensive understanding of conflicts and weaknesses in constituent systems and fosters collaboration between domain experts and decision-makers. Employing ReViTA also improves stakeholder communication and enhances resource utilization.
... The other group employs operational-physical architecture allocation to examine the reconfiguration policies. For instance, Uday and Marais [31] considered reconfiguration as pro-active function-system re-tasking under "stand-in redundancy". Fang et al. [32] pro-vided an approximate dynamic programming (ADP) method to conduct function-capability re-allocation for achieving agile response under disruptions. ...
Article
Full-text available
Delivering persistent values in a dynamic environment is a challenging but imperative capability for a system-of-systems (SoS). Practitioners in the SoS and defense domains are exploring the benefits of the operational-level reconfiguration strategies via new operational concepts such as mosaic warfare. However, an architecture design that allows reconfiguration is also a crucial task, but has not yet received adequate attention, not to mention accounting for the mutual impact between architecture design alternatives and reconfiguration options. Therefore, this paper proposes an integrated method that can select the architecture with a specific inherent structure in the design phase that supports dynamic reconfiguration during the operational phase. This method firstly builds a structural framework that connects architecture design and reconfiguration, and identifies the enablers for SoS architecture reconfiguration. After developing an SoS effectiveness evaluator, the method constructs an integrated multi-objective formulation for the initial architecture selection and reconfiguration process, and provides a solution algorithm based on a fast non-dominated sorting genetic algorithm. An application to an air and missile defense SoS illustrates the effectiveness of the proposed method. The generated Pareto optimal set of solutions that have non-dominated recoverability and survivability provide useful decision support for SoS composition and initial architecture configuration, based upon which an SoS can also respond effectively to disruptions by computing the reconfiguration decisions.
... SiSoS inherit the characteristics of software-intensive embedded systems, such as high reliability and availability requirements [20]. Therefore, the development of SiSoS systems is conducted with thorough testing and validation of the functional and non-functional requirements at multiple testing levels, such as basic, function, and system tests [21]. ...
... A system can have high resilience, but still fluctuate greatly and have low stability. Preventive maintenance and redundancy are two main methods for disruptions management 21 . In manufacturing, redundancy is often related to having a backup machineries and workforce. ...
Article
Full-text available
Manufacturing companies’ preparedness level against external and internal disruptions is complex to assess due to a lack of widely recognized or standardized models. Resilience as the measure to characterize preparedness against disruptions is a concept with various numerical approaches, but still lacking in the industry standard. Therefore, the main contribution of the research is the comparison of existing resilience metrics and the selection of the practically usable quantitative metric that allows manufacturers to start assessing the resilience in digitally supported human-centered workstations more easily. An additional contribution is the detection and highlighting of disruptions that potentially influence manufacturing workstations the most. Using five weighted comparison criteria, the resilience metrics were pairwise compared based on multi-criteria decision-making Analytic Hierarchy Process analysis on a linear scale. The general probabilistic resilience assessment method Penalty of Change that received the highest score considers the probability of disruptions and related cost of potential changes as inputs for resilience calculation. Additionally, manufacturing-related disruptions were extracted from the literature and categorized for a better overview. The Frequency Effect Sizes of the extracted disruptions were calculated to point out the most influencing disruptions. Overall, resilience quantification in manufacturing requires further research to improve its accuracy while maintaining practical usability.
... In general, a common definition is that resilience includes three focal components: (1) an ability to absorb impact of disruptions (absorption), (2) adaptation to disruptions (adaptation), and (3) recovery to its normal regime (restoration). Traditional methods to manage disruptions are inclusion of redundancy in component level and preventive maintenance in system level (Uday and Marais, 2013). In manufacturing, redundancy basically means having backup machineries, tools, or workforce to absorb disturbances. ...
Conference Paper
Full-text available
Enterprise Resource Planning (ERP) software systems have a crucial role in planning and management of manufacturing plants. The level of efficiency in ERP usage is strongly related with the architecture and hierarchy designed in its implementation. Additionally, manufacturing long term values as digital resilience should be taken as precondition in the designing process. Therefore, in this paper digital resilience supported ERP architecture design is proposed through a use case of an ERP implementation. Results present the digital resilience supported architecture of tangible machining and assembling resources, hierarchy of warehouse locations in an environment of limited resources and routing for sample product. Furthermore , the research covers preparation for further digital twin integration to the worker assistant systems as well as a didactic purpose.
Article
Modern complex systems should be resiliently designed to enable recovery in a variety of expected or unexpected environments. Resilience is defined as the ability to withstand and recover from disruptive events. The objective of developing resilient systems drives the need of analysis tools to guide the system architecture process. There is a need for the creation of resilience tools that are time-based and are applicable for the system architecture process. The larger literature offers a variety of methods and quantitative metrics for assessing resilience. Still, there is a lack of system architecting tools that focus on assessing the resilience of system architecture options considering the dual nature of the system's physical and functional aspects while taking into account the design of redundancy into the system's recoverability behavior. To bridge this gap, this paper proposes a dynamic network-based resilience assessment method that models systems as a dual layer functional and physical network. The method, which has been developed into a computational tool, generates a measure of resilience that serves as a quantitative evaluation indicator during system architecting. As a case study, the method is applied to eight power and propulsion system architecture options. The findings demonstrate that, even before a system architecture has matured, the tool supports informed decision-making, for example in terms of measuring the effectiveness of redundancy introduced to improve resilience, as well as early detection of system vulnerabilities.
Book
This book is a comprehensive exploration of computational mathematics and its impact on enhancing the reliability and maintainability of industrial systems. With its careful blend of theoretical foundations, practical applications, and future perspectives, this book is a vital reference for researchers, engineers, and professionals seeking to optimize industrial systems' performance, efficiency, and resilience.
Chapter
This comprehensive exploration goes into the principles, techniques, and real-world applications of Reliability-Centered Design (RCD) and system resilience in engineering. The paper begins by elucidating the core principles of RCD, which include identifying critical components, assessing failure modes, designing for redundancy, devising effective maintenance strategies, and mitigating the consequences of failures. In-depth discussions on these principles provide engineers and designers with a robust framework for enhancing the reliability of products, systems, and processes. The chapter proceeds to dissect powerful design techniques, emphasizing the critical role of the Design of Experiments (DOE), tolerance analysis, and quality control in improving reliability. Systematically addressing variations and uncertainties, engineers can develop products and systems that consistently meet performance standards, even under adverse conditions. System resilience and redundancy analysis are explored extensively, focusing on diverse types of redundancy and implementing failover mechanisms to absorb shocks and recover from disruptions. Risk assessment is a central element, as the paper guides readers through identifying critical parameters, quantifying risks, and developing effective risk mitigation strategies. Through compelling case studies and best practices, this paper offers practical insights into how RCD and resilience principles are applied across industries. Industry-specific examples showcase the successful application of these principles, while lessons from past failures underscore the importance of continuous improvement in engineering and design. This chapter is a comprehensive resource for engineers, designers, and practitioners seeking to create robust, reliable, and adaptable systems that can withstand challenges and disruptions while minimizing risks and failures. This paper empowers professionals with the knowledge and tools to excel in the dynamic and demanding engineering.
Article
Full-text available
This paper presents a framework for discussions of system resilience. The framework has five aspects: time periods, system types, events, resilience actions, and properties to preserve. The five time periods are well defined, but the other four aspects vary according to author, type of system, and purpose of the discussion, so what is presented is the variation that occurs among definitions. The framework is followed by a cataloging of principles for creating emergence, including a number of rubrics. Finally the paper discusses factors affecting resilience, including improving resilience, tradeoffs, and loss of resilience. An appendix details specific definitions and distinguishers from related properties.
Article
Full-text available
It is assumed that many systems engineering organizations are struggling to define, understand, and apply enterprise systems engineering, complex systems engineering, or related disciplines, in addition to their traditional systems engineering techniques. This paper describes an approach for coping with this challenge with the primary focus on the creation of an evolving electronic resource for internal consumption and contribution. The lexicon of terms developed from many sources is highlighted as a means for combating the terminology barrier that often prevents progress in changing organizational culture to better apply sound systems engineering principles in one’s work.
Article
Full-text available
As systems continue to grow in size and complexity, they pose increasingly greater safety and risk management challenges. Today when complex systems fail and mishaps occur, there is an initial tendency to attribute the failure to human error. Yet research has repeatedly shown that more often than not it is not human error but organizational factors that set up adverse conditions that increase the likelihood of system failure. Resilience engineering is concerned with building systems that are able to circumvent accidents through anticipation, survive disruptions through recovery, and grow through adaptation. This paper defines resilience from different perspectives, provides a conceptual framework for understanding and analyzing disruptions, and presents principles and heuristics based on lessons learned that can be employed to build resilient systems.
Article
While the phrase “system-of-systems” is commonly seen, there is less agreement on what they are, how they may be distinguished from “conventional” systems, or how their development differs from other systems. This paper proposes a definition, a limited taxonomy, and a basic set of architecting principles to assist in their design. As it turns out, the term system-of-systems is infelicitous for the taxonomic grouping. The grouping might be better termed “collaborative systems.” The paper also discusses the value of recognizing the classification in system design, and some of the problems induced by misclassification. One consequence of the classification is the identification of principal structuring heuristics for system-of-systems. Another is an understanding that, in most cases, the architecture of a system-of-systems is communications. The architecture is nonphysical, it is the set of standards that allow meaningful communication among the components. This is illustrated through existing and proposed systems. © 1999 John Wiley & Sons, Inc. Syst Eng 1: 267–284, 1998
Chapter
This chapter offers a human-centric treatment of the concepts of multi-scale analysis and emergence in system of systems (SoS) engineering, or more generally, complex systems engineering. Complexity requires that leaders, in particular, look at the desired big picture, and “tweak” selected variables they can control to try to steer their organization in that general direction. Emergence in an SoS is a result of what happens within and among the systems comprising the SoS, including human actions and relationships. Again, consider the problem facing an Organization attempting to manage an SoS. Some things can be learned about how such Organizations can become more successful, or at least remain viable, by examining factors that may have led to the collapse of such complex systems as human civilizations. Several areas of research that could greatly improve progress toward the goal of the effective/efficient development and sharing of information in such environments. Chapter link: https://www.taylorfrancis.com/chapters/edit/10.1201/9781420065893-3/emergence-sos-sociocognitive-aspects-beverly-gay-mccarter-brian-white
Article
The phrase system of systems has been in use for well over 10 years. As customers of the aerospace and defense industries began asking for broad capabilities rather than for single systems to meet specific requirements, the notion of a system composed of multiple independently operating systems has become more important as the way to meet the desired set of capabilities. This brings new challenges to system or systems-of-systems engineers and their ability to design and analyze alternatives for systems of systems. Because individual systems can operate independently within a system of systems, many engineering methods and tools used to design and analyze large-scale, but monolithic, systems do not appear to work for systems of systems. This paper presents a three-axis taxonomy that can guide design method development and analysis of alternatives for aeronautical systems of systems. Based on this perspective, two experiments in applying the methods are presented for system-of-systems problems that involve aircraft and/or air transportation.
Article
Introduction Experimental Designs for ALT Parametric Models Used in ALT Nonparametric Models Used in ALT Problems
Article
While the phrase “system-of-systems” is commonly seen, there is less agreement on what they are, how they may be distinguished from “conventional” systems, or how their development differs from other systems. This paper proposes a definition, a limited taxonomy, and a basic set of architecting principles to assist in their design. As it turns out, the term system-of-systems is infelicitous for the taxonomic grouping. The grouping might be better termed “collaborative systems.” The paper also discusses the value of recognizing the classification in system design, and some of the problems induced by misclassification. One consequence of the classification is the identification of principal structuring heuristics for system-of-systems. Another is an understanding that, in most cases, the architecture of a system-of-systems is communications. The architecture is nonphysical, it is the set of standards that allow meaningful communication among the components. This is illustrated through existing and proposed systems. © 1999 John Wiley & Sons, Inc. Syst Eng 1: 267–284, 1998
Conference Paper
A System-of-Systems (SoS) is a collection of distributed independent individual systems that interact with one another to achieve an SoS capability requirement that cannot be achieved by individual systems alone (in contrast, in a monolithic system, hardware or software components are integrated to form a single entity). Interdependency between SoS systems, while enabling new capability, also means that failures can cascade throughout the SoS, creating development delays or additional system failures. Here we develop a method based on Bayesian networks to evaluate the resilience of SoS design alternatives to failures during operations. We propose a conditional resilience metric that measures each constituent system's contribution to overall SoS resilience, and a resilience pattern that shows how SoS performance degrades as systems fail. SoS resilience is determined by both the SoS architecture and the constituent system reliability. In a simple example based on the Littoral Combat Ship SoS, we determine the two most critical systems using the conditional resilience metric. Adding a communications link between these two systems increases the resilience, resulting in higher expected performance and slower expected performance degradation as a result of system failure. The conditional resilience and resilience pattern can be used by designers and other decision makers to identify SoS architectures and systems that increase SoS resilience to failure.
Article
Resilience means different things in different disciplines. From a systems engineering perspective, we define resilience as the ability of a system to adapt affordably and perform effectively across a wide range of operational contexts, where context is defined by mission, environment, threat, and force disposition. A key issue in engineering resilient systems is the lengthy and costly upfront engineering process, which program managers justifiably find unacceptable. This paper presents how advances in computational technology can potentially transform the system development process in new and novel ways to enable fast, efficient, and inexpensive upfront engineering—the key to engineering resilient systems. These processes, in turn, can enable rapid development, deployment, and operation of affordably adaptable and effective systems. ©2012 Wiley Periodicals, Inc. Syst Eng 16