ArticlePDF Available

Mean Time Between Failure: Explanation and Standards

Authors:
Mean Time Between Failure:
Explanation and Standards
Revision 1
by Wendy Torell and Victor Avelar
Introduction 2
What is a failure? What are the
assumptions?
2
Reliability, availability, MTBF,
and MTTR defined
3
Methods of predicting and
estimating MTBF
5
Conclusion 9
Resources 10
Click on a section to jump to it
Contents
White Paper 78
Mean time between failure is a reliability term used
loosely throughout many industries and has become
widely abused in some. Over the years the original
meaning of this term has been altered which has led to
confusion and cynicism. MTBF is largely based on
assumptions and definition of failure and attention to
these details are paramount to proper interpretation.
This paper explains the underlying complexities and
misconceptions of MTBF and the methods available for
estimating it.
Executive summary
>
white papers are now part of the Schneider Electric white paper library
produced by Schneider Electric’s Data Center Science Center
DCSC@Schneider-Electric.com
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 2
Mean time between failure (MTBF) has been used for over 60 years as a basis for various
decisions. Over the years more than 20 methods and procedures for lifecycle predictions
have been developed. Therefore, it is no wonder that MTBF has been the daunting subject of
endless debate. One area in particular where this is evident is in the design of mission
critical facilities that house IT and telecommunications equipment. When minutes of down-
time can negatively impact the market value of a business, it is crucial that the physical
infrastructure supporting this networking environment be reliable. The business reliability
target may not be achieved without a solid understanding of MTBF. This paper explains
every aspect of MTBF using examples throughout in an effort to simplify complexity and
clarify misconception.
These questions should be asked immediately upon reviewing any MTBF value. Without the
answers to these questions, the discussion holds little value. MTBF is often quoted without
providing a definition of failure. This practice is not only misleading but completely useless.
A similar practice would be to advertise the fuel efficiency of an automobile as “miles per
tank” without defining the capacity of the tank in liters or gallons. To address this ambiguity,
one could argue there are two basic definitions of a failure:
1. The termination of the ability of the product as a whole to perform its required func-
tion.1
2. The termination of the ability of any individual component to perform its required func-
tion but not the termination of the ability of the product as a whole to perform.2
The following two examples illustrate how a particular failure mode in a product may or may
not be classified as a failure, depending on the definition chosen.
Example 1:
If a redundant disk in a RAID array fails, the failure does not prevent the RAID array from
performing its required function of supplying critical data at any time. However, the disk
failure does prevent a component of the disk array from performing its required function of
supplying storage capacity. Therefore, according to definition 1, this is not a failure, but
according to definition 2, it is a failure.
Example 2:
If the inverter of a UPS fails and the UPS switches to static bypass, the failure does not
prevent the UPS from performing its required function which is supplying power to the critical
load. However, the inverter failure does prevent a component of the UPS from performing its
required function of supplying conditioned power. Similar to the previous example, this is
only a failure by the second definition.
If there existed only two definitions, then defining a failure would seem rather simple.
Unfortunately, when product reputation is on the line, the matter becomes almost as compli-
cated as MTBF itself. In reality there are more then two definitions of failure, in fact they are
infinite. Depending on the type of product, manufacturers may have numerous definitions of
failure. Manufacturers that are quality driven track all modes of failure for the purpose of
process control which, among other benefits, drives out product defects. Therefore, addition-
al questions are needed to accurately define a failure.
1 IEC-50
2 IEC-50
Introduction
What is a failure?
What are the
assumptions?
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 3
Is customer misapplication considered a failure? There may have been human factors that
designers overlooked, leading to the propensity for users to misapply the product. Are load
drops caused by a vendor’s service technician counted as a failure? Is it possible that the
product design itself increases the failure probability of an already risky procedure? If a light
emitting diode (LED) on a computer were to fail is it considered a failure even though it hasn’t
impacted the operation of the computer? Is the expected wear out of a consumable item
such as a battery considered a failure if it failed prematurely? Are shipping damages
considered failures? This could indicate a poor packaging design. Clearly, the importance of
defining a failure should be evident and must be understood before attempting to interpret
any MTBF value. Questions like those above provide the bedrock upon which reliability
decisions can be made.
It is said that engineers are never wrong; they just make bad assumptions. The same can be
said for those who estimate MTBF values. Assumptions are required to simplify the process
of estimating MTBF. It would be nearly impossible to collect the data required to calculate an
exact number. However, all assumptions must be realistic. Throughout the paper common
assumptions used in estimating MTBF are described.
MTBF impacts both reliability and availability. Before MTBF methods can be explained, it is
important to have a solid foundation of these concepts. The difference between reliability and
availability is often unknown or misunderstood. High availability and high reliability often go
hand in hand, but they are not interchangeable terms.
Reliability is the ability of a system or component to perform its required functions under
stated conditions for a specified period of time [IEEE 90].
In other words, it is the likelihood that the system or component will succeed within its
identified mission time, with no failures. An aircraft mission is the perfect example to illustrate
this concept. When an aircraft takes off for its mission, there is one goal in mind: complete
the flight, as intended, safely (with no catastrophic failures).
Availability, on the other hand, is the degree to which a system or component is operational
and accessible when required for use [IEEE 90].
It can be viewed as the likelihood that the system or component is in a state to perform its
required function under given conditions at a given instant in time. Availability is determined
by a system’s reliability, as well as its recovery time when a failure does occur. When
systems have long continuous operating times (for example, a 10-year data center), failures
are inevitable. Availability is often looked at because, when a failure does occur, the critical
variable now becomes how quickly the system can be recovered. In the data center exam-
ple, having a reliable system design is the most critical variable, but when a failure occurs,
the most important consideration must be getting the IT equipment and business processes
up and running as fast as possible to keep downtime to a minimum.
MTBF, or mean time between failure, is a basic measure of a system’s reliability. It is
typically represented in units of hours. The higher the MTBF number is, the higher the
reliability of the product. Equation 1 illustrates this relationship.
Equation 1:
=MTBF
Time
eyReliabilit
Reliability,
availability,
MTBF, and MTTR
defined
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 4
A common misconception about MTBF is that it is equivalent to the expected number of
operating hours before a system fails, or the “service life”. It is not uncommon, however, to
see an MTBF number on the order of 1 million hours, and it would be unrealistic to think the
system could actually operate continuously for over 100 years without a failure. The reason
these numbers are often so high is because they are based on the rate of failure of the
product while still in their “useful life” or “normal life”, and it is assumed that they will continue
to fail at this rate indefinitely. While in this phase of the products life, the product is expe-
riencing its lowest (and constant) rate of failure. In reality, wear-out modes of the product
would limit its life much earlier than its MTBF figure. Therefore, there should be no direct
correlation made between the service life of a product and its failure rate or MTBF. It is quite
feasible to have a product with extremely high reliability (MTBF) but a low expected service
life. Take for example, a human being:
There are 500,000 25-year-old humans in the sample population.
Over the course of a year, data is collected on failures (deaths) for this population.
The operational life of the population is 500,000 x 1 year = 500,000 people years.
Throughout the year, 625 people failed (died).
The failure rate is 625 failures / 500,000 people years = 0.125% / year.
The MTBF is the inverse of failure rate or 1 / 0.00125 = 800 years.
So, even though 25-year-old humans have high MTBF values, their life expectancy (service
life) is much shorter and does not correlate.
The reality is that human beings do not exhibit constant failure rates. As people get older,
more failures occur (they wear-out). Therefore, the only true way to compute an MTBF that
would equate to service life would be to wait for the entire sample population of 25-year-old
humans to reach their end-of-life. Then, the average of these life spans could be computed.
Most would agree that this number would be on the order of 75-80 years.
So, what is the MTBF of 25-year-old humans, 80 or 800? It’s both! But, how can the same
population end up with two such drastically different MTBF values? It’s all about assump-
tions!
If the MTBF of 80 years more accurately reflects the life of the product (humans in this case),
is this the better method? Clearly, it’s more intuitive. However, there are many variables that
limit the practicality of using this method with commercial products such as UPS systems.
The biggest limitation is time. In order to do this, the entire sample population would have to
fail, and for many products this is on the order of 10-15 years. In addition, even if it were
sensible to wait this duration before calculating the MTBF, problems would be encountered in
tracking products. For example, how would a manufacturer know if the products were still in
service if they were taken out of service and never reported?
Lastly, even if all of the above were possible, technology is changing so fast, that by time the
number was available, it would be useless. Who would want the MTBF value of a product
that has been superseded by several generations of technology updates?
MTTR, or mean time to repair (or recover), is the expected time to recover a system from a
failure. This may include the time it takes to diagnose the problem, the time it takes to get a
repair technician onsite, and the time it takes to physically repair the system. Similar to
MTBF, MTTR is represented in units of hours. As Equation 2 shows, MTTR impacts availabil-
ity and not reliability. The longer the MTTR, the worse off a system is. Simply put, if it takes
longer to recover a system from a failure, the system is going to have a lower availability.
The formula below illustrates how both MTBF and MTTR impact the overall availability of a
system. As the MTBF goes up, availability goes up. As the MTTR goes up, availability goes
down.
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 5
Equation 2:
)(
tyAvailabili MTTRMTBF
MTBF
+
=
For Equation 1 and Equation 2 above to be valid, a basic assumption must be made when
analyzing the MTBF of a system. Unlike mechanical systems, most electronic systems don’t
have moving parts. As a result, it is generally accepted that electronic systems or compo-
nents exhibit constant failure rates during the useful operating life. Figure 1, referred to as
the failure rate “bathtub curve”, illustrates the origin of this constant failure rate assumption
mentioned previously. The "normal operating period" or “useful life period" of this curve is the
stage at which a product is in use in the field. This is where product quality has leveled off to
a constant failure rate with respect to time. The sources of failures at this stage could include
undetectable defects, low design safety factors, higher random stress than expected, human
factors, and natural failures. Ample burn-in periods for components by the manufacturers,
proper maintenance, and proactive replacement of worn parts should prevent the type of
rapid decay curve shown in the "wear out period". The discussion above provides some
background on the concepts and differences of reliability and availability, allowing for the
proper interpretation of MTBF. The next section discusses the various MTBF prediction
methods.
0
Rate
of
failure
Constant failure rate region
Early failure
period
Normal operating
period
Wear out
period
Time
Oftentimes the terms “prediction” and “estimation” are used interchangeably, however this is
not correct. Methods that predict MTBF, calculate a value based only on a system design,
usually performed early in the product lifecycle. Prediction methods are useful when field
data is scarce or non-existent as is the case of the space shuttle or new product designs.
When sufficient field data exists, prediction methods should not be used. Rather, methods
that estimate MTBF should be used because they represent actual measurements of failures.
Methods that estimate MTBF, calculate a value based on an observed sample of similar
systems, usually performed after a large population has been deployed in the field. MTBF
estimation is by far the most widely used method of calculating MTBF, mainly because it is
based on real products that are experiencing actual usage in the field.
All of these methods are statistical in nature, which means they provide only an approxima-
tion of the actual MTBF. No one method is standardized across an industry. It is, therefore,
critical that the manufacturer understands and chooses the best method for the given
Figure 1
Bathtub curve to illustrate
constant rate of failures
Methods of
predicting and
estimating MTBF
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 6
application. The methods presented below, although not a complete list, illustrate the
breadth of ways MTBF can be derived.
Reliability prediction methods
The earliest methods of reliability prediction came about in the 1940’s with a German scientist
named Von Braun and a German mathematician named Eric Pieruschka. While trying to
improve numerous reliability problems with the V-1 rocket, Pieruschka assisted Von Braun in
modeling the reliability of his rocket thereby creating the first documented modern predictive
reliability model. Subsequently, NASA along with the growth of the nuclear industry prompted
additional maturation in the field of reliability analysis. Today, there are numerous methods
for predicting MTBF.
MIL-HDBK 217
Published by the U.S. military in 1965, the Military Handbook 217 was created to provide a
standard for estimating the reliability of electronic military equipment and systems so as to
increase the reliability of the equipment being designed. It sets common ground for compar-
ing the reliability of two or more similar designs. The Military Handbook 217 is also referred
to as Mil Standard 217, or simply 217. There are two ways that reliability is predicted under
217: Parts Count Prediction and Parts Stress Analysis Prediction.
Parts Count Prediction is generally used to predict the reliability of a product early in the
product development cycle to obtain a rough reliability estimate relative to the reliability goal
or specification. A failure rate is calculated by literally counting similar components of a
product (i.e. capacitors) and grouping them into the various component types (i.e. film
capacitor). The number of components in each group is then multiplied by a generic failure
rate and quality factor found in 217. Lastly, the failure rates of all the different part groups
are added together for the final failure rate. By definition, Parts Count assumes all compo-
nents are in series and requires that failure rates for non-series components be calculated
separately.
Parts Stress Analysis Prediction is usually used much later in the product development cycle,
when the design of the actual circuits and hardware are nearing production. It is similar to
Parts Count in the way the failure rates are summed together. However, with Parts Stress,
the failure rate for each and every component is individually calculated based on the specific
stress levels the component is subjected to (i.e. humidity, temperature, vibration, voltage,).
In order to assign the proper stress levels to each component, a product design and its
expected environment must be well documented and understood. The Parts Stress Method
usually yields a lower failure rate then the Parts Count Method. Due to the level of analysis
required, this method is extremely time consuming compared to other methods.
Today, 217 is rarely used. In 1996 the U.S. Army announced that the use of MIL-HDBK-217
should be discontinued because it "has been shown to be unreliable, and its use can lead to
erroneous and misleading reliability predictions"3. 217 has been cast off for many reasons,
most of which have to do with the fact that component reliability has improved greatly over
the years to the point where it is no longer the main driver in product failures. The failure
rates given in 217 are more conservative (higher) then the electronic components available
today. A thorough investigation of the failures in today’s electronic products would reveal that
failures were most likely caused by misapplication (human error), process control or product
design.
3 Cushing, M., Krolewski, J., Stadterman, T., and Hum, B., U.S. Army Reliability Standardization
Improvement Policy and Its Impact, IEEE Transactions on Components, Packaging, and Manufacturing
Technology, Part A, Vol. 19, No. 2, pp. 277-278, 1996
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 7
Telcordia
The Telcordia reliability prediction model evolved from the telecom industry and has made its
way through a series of changes over the years. It was first developed by Bellcore Commu-
nications Research under the name Bellcore as a means to estimate telecom equipment
reliability. Although Bellcore was based on 217, its reliability models (equations) were
changed in 1985 to reflect field experiences of their telecom equipment. The latest revision
of Bellcore was TR-332 Issue 6, dated December 1997. SAIC subsequently bought Bellcore
in 1997 and renamed it Telcordia. The latest revision of the Telcordia Prediction Model is
SR-332 Issue 1, released in May 2001 and offers various calculation methods in addition to
those of 217. Today, Telcordia continues to be applied as a product design tool within this
industry.
HRD5
HRD5 is the Handbook for Reliability Data for Electronic Components used in telecommuni-
cation systems. HRD5 was developed by British Telecom and is used mainly in the United
Kingdom. It is similar to 217 but doesn’t cover as many environmental variables and provides
a reliability prediction model that covers a wider array of electronic components including
telecom.
RBD (Reliability Block Diagram)
The Reliability Block Diagram or RBD is a representative drawing and a calculation tool that
is used to model system availability and reliability. The structure of a reliability block diagram
defines the logical interaction of failures within a system and not necessary their logical or
physical connection together. Each block can represent an individual component, sub-
system or other representative failure. The diagram can represent an entire system or any
subset or combination of that system which requires failure, reliability or availability analysis.
It also serves as an analysis tool to show how each element of a system functions, and how
each element can affect the system operation as a whole.
Markov Model
Markov modeling provides the ability to analyze complex systems such as electrical architec-
tures. Markov models are also known as state space diagrams or state graphs. State space
is defined as a collection of all of the states a system can be in. Unlike block diagrams, state
graphs provide a more accurate representation of a system. The use of state graphs
accounts for component failure dependencies as well as various states that block diagrams
cannot represent, such as the state of a UPS being on battery. In addition to MTBF, Markov
models provide various other measures of a system, including availability, MTTR, the
probability of being in a given state at a given time and many others.
FMEA / FMECA
Failure Mode and Effects Analysis (FMEA) is a process used for analyzing the failure modes
of a product. This information is then used to determine the impact each failure would have
on the product, thereby leading to an improved product design. The analysis can go a step
further by assigning a severity level to each of the failure modes in which case it would be
called a FMECA (Failure Mode, Effects and Criticality Analysis). FMEA uses a bottom to top
approach. For instance, in the case of a UPS, the analysis starts with the circuit board level
component and works its way up to the entire system. Apart from being used as a product
design tool, it can be used to calculate the reliability of the overall system. Probability data
needed for the calculations can be difficult to obtain for various pieces of equipment, espe-
cially if they have multiple states or modes of operation.
Fault Tree
Fault tree analysis is a technique that was developed by Bell Telephone Laboratories to
perform safety assessments of the Minuteman Launch Control System. It was later applied
to reliability analysis. Fault trees can help detail the path of events, both normal and fault
related, that lead down to the component-level fault or undesired event that is being investi-
gated (top to bottom approach). Reliability is calculated by converting a completed fault tree
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 8
into an equivalent set of equations. This is done using the algebra of events, also referred to
as Boolean algebra. Like FMEA, the probability data needed for the calculations can be
difficult to obtain.
HALT
Highly Accelerated Life Testing (HALT) is a method used to increase the overall reliability of a
product design. HALT is used to establish how long it takes to reach the literal breaking point
of a product by subjecting it to carefully measured and controlled stresses such as tempera-
ture and vibration. A mathematical model is used to estimate the actual amount of time it
would have taken the product to fail in the field. Although HALT can estimate MTBF, its main
function is to improve product design reliability.
Reliability estimation methods
Similar Item Prediction Method
This method provides a quick means of estimating reliability based on historical reliability
data of a similar item. The effectiveness of this method is mostly dependent on how similar
the new equipment is to the existing equipment for which field data is available. Similarity
should exist between manufacturing processes, operating environments, product functions
and designs. For products that follow an evolutionary path, this prediction method is
especially useful since it takes advantage of the past field experience. However, differences
in new designs should be carefully investigated and accounted for in the final prediction.
Field Data Measurement Method
The field data measurement method is based on the actual field experience of products. This
method is perhaps the most used method by manufacturers since it is an integral part of their
quality control program. These programs are often referred to as Reliability Growth Man-
agement. By tracking the failure rate of products in the field, a manufacturer can quickly
identify and address problems thereby driving out product defects. Because it is based on
real field failures, this method accounts for failure modes that prediction methods sometimes
miss. The method consists of tracking a sample population of new products and gathering
the failure data. Once the data is gathered, the failure rate and MTBF are calculated. The
failure rate is the percentage of a population of units that are expected to "fail" in a calendar
year. In addition to using this data for quality control, it also is used to provide customers and
partners with information about their product reliability and quality processes. Given that this
method is so widely used by manufacturers, it provides a common ground for comparing
MTBF values. These comparisons allow users to evaluate relative reliability differences
between products, which provide a tool in making specification or purchasing decisions. As
in any comparison, it is imperative that critical variables be the same for all systems being
compared. When this is not the case, wrong decisions are likely to be made which can result
in a negative financial impact. For more information on comparing relative MTBF values see
White Paper 112, Performing Effective MTBF Comparisons for Data Center Infrastructure.
Performing Effective MTBF
Comparisons for Data
Center Infrastructure
Related resource
White Paper 112
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 9
MTBF is a “buzz word” commonly used in the IT industry. Numbers are thrown around
without an understanding of what they truly represent. While MTBF is an indication of
reliability, it does not represent the expected service life of the product. Ultimately an MTBF
value is meaningless if failure is undefined and assumptions are unrealistic or altogether
missing.
Conclusion
Wendy Torrell is a Strategic Research Analyst at Schneider Electric in West Kingston, RI. She
consults with clients on availability science approaches and design practices to optimize the
availability of their data center environments. She received her Bachelors of Mechanical
Engineering degree from Union College in Schenectady, NY and her MBA from University of
Rhode Island. Wendy is an ASQ Certified Reliability Engineer.
Victor Avelar is a Senior Research Analyst at Schneider Electric. He is responsible for data
center design and operations research, and consults with clients on risk assessment and
design practices to optimize the availability and efficiency of their data center environments.
Victor holds a Bachelor’s degree in Mechanical Engineering from Rensselaer Polytechnic
Institute and an MBA from Babson College. He is a member of AFCOM and the American
Society for Quality.
About the author
Mean Time Between Failure: Explanation and Standards
Schneider Electric – Data Center Science Center White Paper 78 Rev 1 10
Performing Effective MTBF Comparisons for
Data Center Infrastructure
White Paper 112
White Paper Library
whitepapers.apc.com
TradeOff Tools™
tools.apc.com
1. Pecht, M.G., Nash, F.R., Predicting the Reliability of Electronic Equipment, Proceed-
ings of the IEEE, Vol. 82, No. 7, July 1994
2. Leonard, C., MIL-HDBK-217: It’s Time To Rethink It, Electronic Design, October 24,
1991
3. http://www.markov-model.com
4. MIL-HDBK-338B, Electronic Reliability Design Handbook, October 1, 1998
5. IEEE 90 – Institute of Electrical and Electronics Engineers, IEEE Standard Computer
Dictionary: A Compilation of IEEE Standard Computer Glossaries. New York, NY:
1990
Resources
Click on icon to link to resource
References
For feedback and comments about the content of this white paper:
Data Center Science Center
DCSC@Schneider-Electric.com
If you are a customer and have questions specific to your data center project:
Contact your Schneider Electric representative
Contact us
... Although effective FM strategies aim to enhance the efficiency of these systems, breakdowns are inevitable through the lifecycle of building systems. In addition to several other definitions, Torell and Avelar [41] discussed the definition of "failure" within equipment and component levels. Equipment level failure was identified as the breakdown of essential parts that sustain the equipment's required functions. ...
... Equipment level failure was identified as the breakdown of essential parts that sustain the equipment's required functions. Additionally, component-level failure comprises the breakdown of components, which decreases the required performance of equipment without the termination of its required function [41]. Based on these definitions, the focus of this study is on equipment level failures, the management of which is critical to reduce the downtime and functional loss of equipment that affect the building performance and occupant comfort. ...
Article
Full-text available
Condition information is essential to develop effective facility management (FM) strategies. Visual inspections and walk-through surveys are common practices of condition assessment (CA), generally resulting in qualitative and subjective outcomes such as “poor,” “good,” etc. Furthermore, limited resources of the FM process demand that CA practices be efficient. Given these, the purpose of this study is to develop a resource efficient quantitative CA framework that can be less subjective in establishing a condition rating. The condition variables of the study—mean time between failures, age-based obsolescence, facility condition index, occupant feedback, and preventive maintenance cycle—are identified through different sources, such as a computerized maintenance management system, expert opinions, occupants, and industry standards. These variables provide proxy measures for determining the condition of equipment with the implementation example for heating, ventilating, and air conditioning equipment. Fuzzy sets theory is utilized to obtain a quantitative condition rating while minimizing subjectivity, as fuzzy sets theory deals with imprecise, uncertain, and ambiguous judgments with membership relations. The proposed CA framework does not require additional resources, and the obtained condition rating value supports decision-making for building maintenance management and strategic planning in FM, with a comprehensive and less subjective understanding of condition.
... The ICT system components from sensors to servers in the control center function normally or have failed, i.e., partial failures are not considered. ICT components do not have a uniform failure rate throughout their lifespan (Torell and Avelar 2004). The rates used in this paper model the useful life phase of the failure rate curve in (Torell and Avelar 2004), which features a quasi-constant failure rate leading to exponentially distributed failure times. ...
... ICT components do not have a uniform failure rate throughout their lifespan (Torell and Avelar 2004). The rates used in this paper model the useful life phase of the failure rate curve in (Torell and Avelar 2004), which features a quasi-constant failure rate leading to exponentially distributed failure times. As repair times for failures may vary, mean time to repair (MTTR) is used to calculate the repair rate (Matz et al. 2002;Theristis and Papazoglou 2013). ...
Article
Full-text available
Increasing interdependencies between power and ICT systems amplify the possibility of cascading failures. Resilience against such failures is an essential property of modern and sustainable power systems and networks. To assess the resilience and predict the behaviour of a system consisting of interdependent subsystems, the interconnection requires adequate modeling. This work presents an approach to model and determine the state of these so-called interconnectors in future cyber-physical energy systems with strongly coupled ICT and power systems for a resilience analysis. The approach can be used to capture the impact of various parameters on system performance upon suitable modification. An hierarchical modeling approach is developed with atomic models that demonstrate the interdependencies between a power and ICT system. The modeling approach using stochastic activity nets is applied to an exemplary redispatch process in a cyber-physical energy system. The performance of an interconnector when facing limited performance from the ICT subsystem and its subsequent impact on the power system is analysed using the models. The state of the interconnector, as well as the service level are mapped to a resilience state-space diagram. The representation of system state on the resilience state-space diagram allows interpretation of system performance and quantification of resilience metrics.
... Thus, study on the evaluation of the system reliability has attracted more and more attentions in recent decades. Based on the existing literature, MTBF (mean time between failures) is currently the most frequently used index to indicate the system reliability [1]. For instance, Wang and Chu discussed an algorithm of determining the best-fitted distribution for MTBF assessment, and then demonstrated the proposed algorithm by a sample of 1.8 million LCD panels [2]. ...
... According to the existing studies, λ (t) branches out into three categories: (a) monotonic failure rates; (b) bathtub failure rates, where λ (t) has a bathtub or a U shape; and (c) generalized bathtub failure rates [8]. As most complex systems have non-monotonic failure rate functions, bathtub-shaped life distribution is utilized widely, which can be divided into three periods: early failure period, normal operating period and wear-out period [1,7,8]. Figure 1 illustrates the bathtub-shaped Rate-of-Failure curve of a typical two-parameter Weibull distribution, in which the cut-off points, T1 and T2, are denoted. ...
Article
Full-text available
The system reliability is significant importance because it is involved into most stages of the whole life-cycle of product. According to the existing literature, MTBF is the most frequently used index to indicate the system reliability. However, in engineering calculation, early failure data are always involved directly into MTBF’s estimation, which results in that the estimated MTBF is always smaller than the actual MTBF, especially for the small samples. In order to alleviate this issue, the paper presents three integration algorithms to reconstruct PDF based on Weibull distribution, by reallocating the early failure data. Then relevant prediction expressions are derived by the reconstructed PDFs. Eventually, three new prediction methods are illustrated by a real example. As a result, the prediction value of Method 3 is very close to the actual observed value, moreover the prediction deviation rate is only 5.1%, which proves the proposed algorithm is reasonable. Finally the proposed methods are suggested applying to other distribution patterns.
... While the failure behavior of hardware is widely agreed upon [66,159], this is not true in the case of software failure rate development. There are various theories and models regarding the behavior of software failure rates over time; these are going to be discussed here briefly. ...
Thesis
Full-text available
The current movement towards a smart grid serves as a solution to present power grid challenges by introducing numerous monitoring and communication technologies. A dependable, yet timely exchange of data is on the one hand an existential prerequisite to enable Advanced Metering Infrastructure (AMI) services, yet on the other a challenging endeavor, because the increasing complexity of the grid fostered by the combination of Information and Communications Technology (ICT) and utility networks inherently leads to dependability challenges. To be able to counter this dependability degradation, current approaches based on high-reliability hardware or physical redundancy are no longer feasible, as they lead to increased hardware costs or maintenance, if not both. The flexibility of these approaches regarding vendor and regulatory interoperability is also limited. However, a suitable solution to the AMI dependability challenges is also required to maintain certain regulatory-set performance and Quality of Service (QoS) levels. While a part of the challenge is the introduction of ICT into the power grid, it also serves as part of the solution. In this thesis a Network Functions Virtualization (NFV) based approach is proposed, which employs virtualized ICT components serving as a replacement for physical devices. By using virtualization techniques, it is possible to enhance the performability in contrast to hardware based solutions through the usage of virtual replacements of processes that would otherwise require dedicated hardware. This approach offers higher flexibility compared to hardware redundancy, as a broad variety of virtual components can be spawned, adapted and replaced in a short time. Also, as no additional hardware is necessary, the incurred costs decrease significantly. In addition to that, most of the virtualized components are deployed on Commercial-Off-The-Shelf (COTS) hardware solutions, further increasing the monetary benefit. The approach is developed by first reviewing currently suggested solutions for AMIs and related services. Using this information, virtualization technologies are investigated for their performance influences, before a virtualized service infrastructure is devised, which replaces selected components by virtualized counterparts. Next, a novel model, which allows the separation of services and hosting substrates is developed, allowing the introduction of virtualization technologies to abstract from the underlying architecture. Third, the performability as well as monetary savings are investigated by evaluating the developed approach in several scenarios using analytical and simulative model analysis as well as proof-of-concept approaches. Last, the practical applicability and possible regulatory challenges of the approach are identified and discussed. Results confirm that—under certain assumptions—the developed virtualized AMI is superior to the currently suggested architecture. The availability of services can be severely increased and network delays can be minimized through centralized hosting. The availability can be increased from 96.82% to 98.66% in the given scenarios, while decreasing the costs by over 60% in comparison to the currently suggested AMI architecture. Lastly, the performability analysis of a virtualized service prototype employing performance analysis and a Musa-Okumoto approach reveals that the AMI requirements are fulfilled.
... One type of reliability is defined as MTBF (Mean Time Between Failures) [9] so that user requests, or it is the period of the system that is facing the incoming requests. All jobs to be investigated by the scheduler should be completely finished. ...
... e Availability -it represents the value that measures if the service is accessible to the users; it is defined by the percentage of time that the consumer can access the service. In addition, it represents the degree to which the service is operational (Garg et al., 2013;Nadanam and Rajmohan, 2012;Rajeswari et al., 2014;Torell and Avelar, 2004). It is calculated using equation (4): ...
Article
Nowadays, cloud computing is regarded as a new paradigm in the field of information technology. Since the emergence of this paradigm, most of corporations have started using it through the deployment or the use of its services. Furthermore, service deployment in the cloud forms its kernel function. To ensure a good function of service deployment we need to combine a set of services. In fact, to find a limited number candidate services constitutes a problem due to the absence of a unified service description. In this paper, we propose a solution to solve the deployment problem by considering it as a service composition taking into account the quality of service. Since we have many atomic services, this paper transforms this composition problem into an optimisation problem. Moreover, to reduce the number of service candidates we propose a new service description language to assist the discovery process. The implementation of this model has been provided in order to evaluate our system. The obtained results demonstrate the effectiveness of our proposed system.
Article
Smart logistics is a part of Industry 4.0. With the increased development of the technology in the vehicle industry, the machine learning algorithms are applied on sensor data in order to detect the failure of the components of the vehicle. Several systems for vehicle health monitoring are presented in the literature for delivery of services in real-time. The sensor values obtained from the cloud are processed with machine learning algorithms, but have problem with delay in execution and data center failure. Edge computing is introduced in recent years so that intensive operations are performed at the edge of the device than at the cloud. This paper presents edge computing based fault prediction system that will predict vehicle health using internal and external sensors in real-time. Risk details are displayed through a mobile application in the form of notifications as well as a dashboard at the terminal. Such a system reduces latency between sending and processing vehicle data. The proposed system uses ensemble of ANN and k-NN classifiers named as AK-NN so as to improve prediction performance. In the first step, ANN is trained and validated on the Chevrolet car OBD dataset. Error reported from this best trained network is used by k-NN for statistical analysis of the error distributions. Three different experiments based on ANN, k-NN and AK-NN are made and evaluated using NRMSE, COD, cross entropy loss, accuracy and ROC measures. 85% accuracy for k-NN model when k = 3, 78% accuracy using ANN and 98.7% for ensemble method are achieved. Comparative study using key performance indicators such as Mean time between failure (MTBF) and Mean time to Repair (MTTR) is also made on 84 vehicles for prediction alert over mobile phone using analytical dashboard and proved to reach availability objective.
Conference Paper
In this paper we consider the application of education course “Neural Network modeling of Complex Technical Systems” in the student’s scientific and research work. Student’s scientific work is an obligatory part of education in the Chair of Electronic Technologies in Mechanical Engineering of Bauman Moscow State Technical University. Besides successful research depends on confided usage of common methods, algorithms and tools of statistical community special attention in degree programs is given to data analysis and modeling methods. For engineers and nanoscientists neural network models have become a powerful tool of scientific research. The course “Neural Network modeling of Complex Technical Systems” is taught in the second year of the Master’s programs “Electronics and Nanoelectronics” and “Nanoengineerig”. After completing the study of the discipline the student are expected to understand neural networks algorithms; know modern neural networks software products; be capable to prepare data, design and train neural network and to apply neural networks algorithms in practice. The course has rather practical than theoretical nature, it plays a significant role in students research work. During the homework students learn to create, tune and train their own neural networks and get full and detailed understanding of research in computer vision. Neural Networks models are successfully used in students scientific and research projects and presented in graduation works. In this paper we presented few examples of how neural networks are used in students projects.
Predicting the Reliability of Electronic Equipment
  • M G Pecht
  • F R Nash
Pecht, M.G., Nash, F.R., "Predicting the Reliability of Electronic Equipment", Proceedings of the IEEE, Vol. 82, No. 7, July 1994
MIL-HDBK-217: It's Time To Rethink It
  • C Leonard
Leonard, C., "MIL-HDBK-217: It's Time To Rethink It", Electronic Design, October 24, 1991