Article

Redundanz für verfügbare Systeme

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Verfugbarkeit ist in der Automation ein nicht zu vernachlassigender Aspekt bei Design und Betrieb von Systemen. Ausfalle konnen zu unvorhergesehenen Problemen fuhren und verursachen meist hohe Kosten. Daher werden Redundanzkonzepte haufig in industriellen Applikationen und Systemen angewandt. Um derartige Konzepte entwerfen sowie effizient und effektiv umsetzen zu konnen, geben die Autoren im Beitrag auf Basis hierarchisch strukturierter Designelemente Leitlinien zur Definition von Anforderungen sowie zu Auswahl und Design eines passenden Redundanzmusters. Am Beispiel von Software-basierter Standby-Redundanz werden auserdem existierende Implementierungsalternativen aufgezeigt und analytisch ausgewertet. Auch hierbei ergeben sich Leitlinien zur Auswahl einer geeigneten Alternative.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Due to the reduction of structure sizes in modern embedded systems, tolerating soft errors presenting itself as bit flips becomes a mandatory task even for moderate critical applications. Accordingly, software-based fault tolerance mechanisms recently gained in popularity and a multitude of approaches that differ in the number and frequency of tolerated errors as well as their associated overhead have been proposed. As a consequence, an application- and environment-tailored selection of mechanisms is required to balance protection and costs. Accounting the diverse solution space, we propose to make software-based fault tolerance a matter of configuration that should be transparent to the applications. While this would be cumbersome when using an unsafe programming language, we show that in the context of KESO, a JVM for deeply embedded systems, this can be achieved by utilizing the Java type system and static code analysis. As an initial technique we decided to add redundant execution to KESO, which enables us to selectively and transparently replicate an application. This essentially builds a first step to a JVM, which offers reliable execution of components as demanded by the system configuration.
Conference Paper
Full-text available
As modern supercomputing systems reach the peta-flop performance range, they grow in both size and complexity. This makes them increasingly vulnerable to failures from a variety of causes. Checkpointing is a popular technique for tolerat- ing such failures in that it allows applications to periodically save their state and restart the computation after a failure. Although a variety of automated system-level checkpointing solutions are currently available to HPC users, manual application-level checkpointing remains by far the most popular approach because of its superior perfor- mance. This paper focuses on improving the performance of automated checkpointing via a compiler analysis for incremental checkpointing. This analysis is shown to signif- icantly reduce checkpoint sizes (upto 78%) and to enable asynchronous checkpointing.
Conference Paper
Full-text available
Given the scale of massively parallel systems, occurrence of faults is no longer an exception but a regular event. Periodic checkpointing is becoming increasingly important in these systems. However, huge memory footprints of parallel applications place severe limitations on scalability of normal checkpointing techniques. Incremental checkpointing is a well researched technique that addresses scalability concerns, but most of the implementations require paging support from hardware and the underlying operating system, which may not be always available. In this paper, we propose a software based adaptive incremental checkpoint technique which uses a secure hash function to uniquely identify changed blocks in memory. Our algorithm is the first self-optimizing algorithm that dynamically computes the optimal block boundaries, based on the history of changed blocks. This provides better opportunities for minimizing checkpoint file size. Since the hash is computed in software, we do not need any system support for this. We have implemented and tested this mechanism on the BlueGene/L system. Our results on several well-known benchmarks are encouraging, both in terms of reduction in average checkpoint file size and adaptivity towards application's memory access patterns.
Article
Full-text available
Developers of early distributed systems took a simplistic approach to providing fault tolerance: They just used another copy of the same hardware as a backup. Later, others developed replication software to work on off-the-shelf hardware. Since neither of these methods is especially economical, a logical course is to take it one step further and eliminate the extra hardware altogether. Fully software-based replication relies on sophisticated techniques to keep track of server communications and ensure the consistency of information across several server replicas. How do yu know that each server shares the same view of the data or program semantics? What happens if a server replica crashes? How do you make sure that a system processes invocations in the correct order! These are all problems that a replication technique has to handle. The authors describe two fundamental techniques, primary backup and active replication, and illustrate how they handle these problems. At this point, both have advantages and disadvantages that depend on the application. The authors also propose that group communication provides a sufficient framework for implementing software-based replication. The concept of static and dynamic groups proves useful in thinking about how to implement replication techniques. Replication techniques can also use total-order and view-synchronous multicast primitives from group communication.
Article
Over the last few years, embedded systems have been increasingly used in safety-critical applications where failure can have serious consequences. The design of these systems is a complex process, which is requiring the integration of common design methods both in hardware and software to fulfill functional and non-functional requirements for these safety-critical applications. Design patterns, which give abstract solutions to commonly recurring design problems, have been widely used in the software and hardware domain. In this thesis, the concept of design patterns is adopted in the design of safety-critical embedded system. A catalog of design patterns was constructed to support the design of safety-critical embedded systems. This catalog includes a set of hardware and software design patterns which cover common design problems such as handling of random and systematic faults, safety monitoring, and sequence control. Furthermore, the catalog provides a decision support component that supports the decision process of choosing a suitable pattern for a particular problem based on the available resources and the requirements of the applicable patterns. As non-functional requirements are an important aspect in the design of safety-critical embedded systems, this work focuses on the integration of implications on non-functional properties in the existing design pattern concept. A pattern representation is proposed for safety-critical embedded application design methods by including fields for the implications and side effects of the represented design pattern on the non-functional requirements of the systems. The considered requirements include safety, reliability, modifiability, cost, and execution time. Safety and reliability represent the main non-functional requirements that should be provided in the design of safety-critical applications. Thus, reliability and safety assessment methods are proposed to show the relative safety and reliability improvement which can be achieved when using the design patterns under consideration. Moreover, a Monte Carlo based simulation method is used to illustrate the proposed assessment method which allows comparing different design patterns with respect to their impact on safety and reliability. Seit einigen Jahren werden eingebettete Systeme zunehmend in sicherheitskritischen Anwendungen eingesetzt, in welchen ein Ausfall zu ernsthaften Konsequenzen führen kann. Der Entwurf solcher Systeme stellt einen komplexen Prozess dar, der die Integration gängiger Entwurfsmuster sowohl für Hardware als auch für Software voraussetzt, um funktionale sowie nicht-funktionale Anforderungen sicherheitskritischer Anwendungen zu erfüllen. Entwurfsmuster, die abstrakte Lösungen für häufig wiederkehrende Entwurfsprobleme bieten, finden Anwendung im der Software- und Hardware-Entwurf. In dieser Dissertation wird das Konzept der Entwurfsmuster auf den Entwurf sicherheitskritischer eingebetteter Systeme angewendet. Um den Entwurf zu unterstützen, wurde ein Katalog entsprechender Entwurfsmuster erstellt. Diese Sammlung umfasst Entwurfsmuster für Hardware- und Softwaredesign, die weit verbreitete Entwurfsprobleme, wie zum Beispiel die Behandlung von zufälligen und systematischen Fehlern, Sicherheitsüberwachung und Ablaufkontrolle abdecken. Des Weiteren enthält der Katalog eine Komponente zur Entscheidungsfindung, die basierend auf verfügbaren Ressourcen und Anforderungen die Auswahl geeigneter Entwurfsmuster für ein bestimmtes Problem unterstützt. Da nicht-funktionale Anforderungen eine wichtige Rolle im Entwurf sicherheitskritischer, eingebetteter Systeme spielen, legt diese Arbeit den Schwerpunkt auf die Integration von Auswirkungen auf nicht-funktionale Anforderungen in das existierende Entwurfsmusterkonzept. Die Auswahl eines Entwurfsmusters für eine sicherheitskritische, eingebettete Anwendung erfolgt unter Einbezug der Auswirkungen und Seiteneffekte, die das repräsentierte Entwurfsmuster auf die nicht-funktionalen Anforderungen des Systems haben kann. Dabei werden Sicherheit, Änderbarkeit, Kosten und Ausführungszeit als Anforderungen in Betracht gezogen. Sicherheit und Zuverlässigkeit stellen die wichtigsten nicht-funktionalen Anforderungen dar, die beim Entwurf sicherheitskritischer Anwendungen sichergestellt sein sollten. Daher präsentiert diese Arbeit Bewertungsmethoden, um relative Sicherheits- und Zuverlässigkeitsverbesserungen bewerten zu können, die sich aus der Anwendung der vorgeschlagenen Entwurfsmuster ergeben. Außerdem wird eine Monte-Carlo-basierte Simulationsmethode angewendet, um die vorgeschlagene Bewertungsmethode zu illustrieren. Dadurch wird der Vergleich der Eignung verschiedener Entwurfsmuster für Sicherheit und Zuverlässigkeit ermöglicht.
Article
The loss of hardware fault tolerance which often arises when design diversity is used to improve the fault tolerance of computer software is considered analytically, and a unified design approach is proposed to avoid the problem. The fundamental theory of fault-tolerant (FT) architectures is reviewed; the current status of design-diversity software development is surveyed; and the FT-processor/attached-processor (FTP/AP) architecture developed by Lala et al. (1986) is described in detail and illustrated with diagrams. FTP/AP is shown to permit efficient implementation of N-version FT software while still tolerating random hardware failures with very high coverage; the reliability is found to be significantly higher than that of conventional majority-vote N-version software.
Conference Paper
The flight control system for the Boeing 777 airplane is a Fly-By-Wire (FBW) system. The FBW system must meet extremely high levels of functional integrity and availability. The heart of the FBW concept is the use of triple redundancy for all hardware resources: computing system, airplane electrical power, hydraulic power and communication path. The Primary Flight Computer (PFC) is the central computation element of the FBW system. The triple modular redundancy (TMR) concept also applies to the PFC architectural design. Further, the N-version dissimilarity issue is integrated to the TMR concept. The PFCs consist of three similar channels (of the same part number), and each channel contains three dissimilar computation lanes. The 777 program design is to select the ARINC 629 bus as the communication media for the FBW
Assessing the Financial Impact of Downtime
Vision Solutions, " Assessing the Financial Impact of Downtime. " Aug-2010
Fault-Tolerant Turbine Controller
  • M Steiger
M. Steiger, "Fault-Tolerant Turbine Controller," ETH Zurich, 2008
IEC/EN 61508: Funktionale Sicherheit sicherheitsbezogener elektrischer/elektronischer/programmierbar elektronischer Systeme
International Electrotechnical Commission (IEC), "IEC/EN 61508: Funktionale Sicherheit sicherheitsbezogener elektrischer/elektronischer/programmierbar elektronischer Systeme, Edition 2.0." Apr-2010
Triple-triple redundant 777 primary fl ight computer
  • Y C Yeh
Y. C. Yeh, "Triple-triple redundant 777 primary fl ight computer," in IEEE, 1996, vol. 1, pp. 293-307
Automated application of fault tolerance mechanisms in a component-based system
  • I Thomm
  • M Stilkerich
  • R Kapitza
  • W Schröder-Preikschat
  • D Lohmann
I. Thomm, M. Stilkerich, R. Kapitza, W. Schröder-Preikschat, and D. Lohmann, "Automated application of fault tolerance mechanisms in a component-based system," Proc. 9th Int. Work. Java Technol. Real-Time Embed. Syst., pp. 87 -95, Sep. 2011
Compilerenhanced incremental checkpointing
  • G Bronevetsky
  • D Marques
G. Bronevetsky and D. Marques, "Compilerenhanced incremental checkpointing," in Languages and Compilers for Parallel Computing, Springer, 2008, pp. 1-15