To read the full-text of this research, you can request a copy directly from the author.
... 2PC protocol is the widely used ACP [2]. We recall the basics of 2PC because a precise understanding of its major steps is required for this paper. ...
... Otherwise, the decision is Abort. When a participant receives the final decision, it sends back an acknowledgment [2,9,6]. ...
... If a crash prohibits a participant to conform to the decision, the coordinator forward-recovers the corresponding branch. The idea of One-Phase Commit (1PC) has been first suggested in [2]. 1PC eliminates the need of verification during the protocol execution by assuming following properties already guaranteed in commit time at every participant [6,7]: 1. ...
Recently, databases are hosted by a growing number of lightweight devices like PDAs,
laptops, cellular phones etc. In mobile environment, transactions may be initiated by mobile units (MU) and distributed among mobile or fixed units. Mobile transactions have two main problems: MU may move during transaction and they may be disconnected. In
this paper we have analyzed atomic transaction commit protocols Two-Phase Commit (2PC), Transaction Commit on Timeout (TCOT), Mobile 2PC (M-2pc) and Unilateral Commit for Mobile Protocol (UCM) used in mobile environment and provide a comparison between them.
... Another challenge arises from the various mining rates across different shards when processing cross-shard transactions (cross-txs), which may lead to inconsistent cross-shard asset transfers. The standard atomic commitment protocol Two-Phase Commit (2PC) [20] employed by existing sharding protocols such as OmniLedger [6], is unsuitable for Manifoldchain. For instance, 2PC requires the verification process of a cross-tx to pause until confirmed vote messages are received from all involved shards, introducing prohibitive latency for confirming cross-txs. ...
... Another challenge is to develop an asynchronous atomic commitment mechanism to ensure the atomic multi-inputmulti-output UTXO transfer across shards under various mining rates. The standard atomic commitment 2PC [20] incorporates Voting Phase and Decision Phase to handle this problem. Specifically, a coordinator node prompts participants to vote, and if all vote to commit, broadcast a commit message; otherwise, broadcast an abort message if any votes to abort. ...
Bandwidth limitation is the major bottleneck that hinders scaling throughput of proof-of-work blockchains. To guarantee security, the mining rate of the blockchain is determined by the miners with the lowest bandwidth, resulting in an inefficient bandwidth utilization among fast miners. We propose Manifoldchain, an innovative blockchain sharding protocol that alleviates the impact of slow miners to maximize blockchain throughput. Manifoldchain utilizes a bandwidth-clustered shard formation mechanism that groups miners with similar bandwidths into the same shard. Consequently, this approach enables us to set an optimal mining rate for each shard based on its bandwidth, effectively reducing the waiting time caused by slow miners. Nevertheless, the adversary could corrupt miners with similar bandwidths, thereby concentrating hashing power and potentially creating an adversarial majority within a single shard. To counter this adversarial strategy, we introduce sharing mining, allowing the honest mining power of the entire network to participate in the secure ledger formation of each shard, thereby achieving the same level of security as an unsharded blockchain. Additionally, we introduce an asynchronous atomic commitment mechanism to ensure transaction atomicity across shards with various mining rates. Our theoretical analysis demonstrates that Manifoldchain scales linearly in throughput with the increase in shard numbers and inversely with network delay in each shard. We implement a full system prototype of Manifoldchain, comprehensively evaluated on both simulated and real-world testbeds. These experiments validate its vertical scalability with network bandwidth and horizontal scalability with network size, achieving a substantial improvement of 186% in throughput over baseline sharding protocols, for scenarios where bandwidths of miners range from 5Mbps to 60Mbps.
... When multiple nodes are working together, many complexities arise due to communication uncertainties and the possibility of machine failures. This is the case in fundamental data management problems such as distributed atomic commitment and database replication [21], [56], [108], [129], [146]. Solving the intricacies of distributed coordination, network uncertainties, and failures in such complex data management problems is a daunting challenge. ...
... The section begins with an overview of the problem of atomic commitment and the significance of this problem in distributed and partitioned databases. This includes a detailed description of seminal protocols such as Two-Phase Commit (2PC) [9], [56], [108]. Then, we present more details about distributed atomic commit protocols that use consensus as a foundation. ...
The problem of distributed consensus has played a major role in the development of distributed data management systems. This includes the development of distributed atomic commit and replication protocols. In this monograph, we present foundations of consensus protocols and the ways they were utilized to solve distributed data management problems. Also, we discuss how distributed consensus contributes to the development of emerging blockchain systems. This includes an exploration of consensus protocols and their use in systems with malicious actors and arbitrary faults.
Our approach is to start with the basics of representative consensus protocols where we start from classic consensus protocols and show how they can be extended to support better performance, extended features, and/or adapt to different system models. Then, we show how consensus can be utilized as a tool in the development of distributed data management. For each data management problem, we start by showing a basic solution to the problem and highlighting its shortcomings that invites the utilization of consensus. Then, we demonstrate the integration of consensus to overcome these shortcomings and provide desired design features. We provide examples of each type of integration of consensus in distributed data management as well as an analysis of the integration and its implications.
... In the introduction, Fischer, Lynch, and Patterson emphasised the importance of agreement among remote processes for solving the transaction commit problem in distributed database systems. The critical importance of deterministic consensus for atomicity of committed distributed database transactions is well known and undisputed [5] [12] [4]. Yet a few decades later, the significance of distributed transactions for practice gradually fades away under the heavy influence of their distinct conceptual vulnerabilities. ...
... A distributed transaction is a set of local transactions executed by the participating databases under third-party coordination and unavoidably requires binary consensus [6]. Every implementation that aims coordination of distributed transactions relies on a choice from a family of distributed commit protocols [5] [11] [10]. All protocols of this family are susceptible to blocking and vulnerability to leave the participating databases in inconsistent state [1] [17]. ...
We demonstrate possibility for vector consensus under the model and conditions used by Fischer, Lynch, and Patterson (FLP) to prove impossibility of binary consensus - full asynchrony and one faulty process. Under that model, we also demonstrate that with any binary consensus protocol: i) binary outcome is produced from a vector value; ii) elaboration on a vector value is an unavoidable necessity; and iii) binary agreement can be reached with voting on a vector value. Key finding: the FLP impossibility result is about impossibility to produce a binary value from any allowed vector value, i.e., from any data set assembled from an allowed initial state.
... However, in these serialization approaches, the replication of objects is not considered. Physical servers might stop by fault [1,18,19,26]. If a physical server stops by fault, methods which are being performed on objects in the physical server also stop. ...
... , T k } (k 1) of transactions created in a system. Multiple conflicting methods issued by multiple transactions are required be serializable [1,18,26] to keep replicas of every object mutually consistent. A notation sch denotes a schedule of transactions in a set T of transactions. ...
In current information systems, a huge number of IoT (Internet of Things) devices are interconnected with various kinds of networks like WiFi and 5G networks. A large volume of data is gathered into servers from a huge number of IoT devices and is manipulated to provide application services. Gathered data is encapsulated along with methods to manipulate the data as an object like a database system. In object-based systems, each application is composed of multiple objects. In addition, each object is replicated on multiple physical servers in order to increase availability, reliability, and performance of an application service. On the other hand, replicas of each object is required to be mutually consistent in presence of multiple transactions. Here, a larger amount of electric energy and computation resources are consumed in physical servers than non-replication approaches to serialize conflicting transactions on multiple replicas. Many algorithms to synchronize conflicting transactions are so far proposed like 2PL (Two-Phase Locking) and TO (Timestamp Ordering). However, the electric energy consumption is not considered. In this paper, an EEQBL-OMM (Energy-Efficient Quorum-Based Locking with Omitting Meaningless Method) protocol is newly proposed to reduce not only the average execution time of each transaction but also the total electric energy consumption of servers by omitting the execution of meaningless methods on replicas of each object. Evaluation results show the total electric energy consumption of servers, the average execution time of each transaction, and the number of aborted instances of transactions in the EEQBL-OMM protocol can be on average reduced to 79%, 62%, and 80% of the ECLBQS (Energy Consumption Laxity-Based Quorum Selection) protocol which is proposed in our previous studies in a homogeneous set of servers, respectively. In addition, the evaluation results show the total electric energy consumption of servers, the average execution time of each transaction, and the number of aborted instances of transactions in the EEQBL-OMM protocol can be on average reduced to 73%, 50%, and 67% of the ECLBQS protocol in a heterogeneous set of servers, respectively. The evaluation results also show at most 48% and 51% of the total number of methods can be omitted as meaningless methods in a homogeneous set and heterogeneous set of servers, respectively, in the EEQBL-OMM protocol.
... Deterministic synchronous consensus has conceptual significance for the purposes of atomicity of committed distributed database transactions [44] [25]. It is impossible with a large fraction of all communication links being faulty [27]. With all links being correct and process faults up to a fraction of all processes, it was solved with an algorithm where the decision making takes as input a vector, having an element for each process, loaded with values received from all processes [47] [37]. ...
... Distributed transactions across these systems aim the same outcome and also by using local transactions, yet executed under external coordination. It involves use of a protocol from a family of distributed commit protocols [27] [40] [39], known for susceptibility to blocking [7] [57] and/or possibility to leave a database system participating in a distributed transaction in inconsistent state. ...
We present an algorithm for synchronous deterministic Byzantine consensus, tolerant to links failures and links asynchrony. It cares for a class of networks with specific needs, where both safety and liveness are essential, and timely irrevocable consensus has priority over highest throughput. The algorithm operates with redundant delivery of messages via indirect paths of up to 3 hops, aims all correct processes to obtain a coherent view of the system state within a bounded time, and establishes consensus with no need of leader. Consensus involves exchange of 2 * n³ asymmetrically authenticated messages and tolerates up to < n/2 faulty processes.
We show that in a consensus system with known members: 1) The existing concepts for delivery over a fraction of links and gossip-based reliable multicast can be extended to also circumvent asynchronous links and thereby convert the reliable delivery into a reliable bounded delivery. 2) A system of synchronous processes with bounded delivery does not need a leader – all correct processes from connected majority derive and propose the same consensus value from atomically consistent individual views on system’s state. 3) The required for bounded delivery asymmetric authentication of messages is sufficient for safety of the consensus algorithm.
Key finding: the impossibility of safety and liveness of consensus in partial synchrony is not valid in the entire space between synchrony and asynchrony. A system of synchronized synchronous processes, which communicate with asymmetrically authenticated messages over a medium susceptible to asynchrony and faults, can operate with: 1) defined tolerance to number of asynchronousand/or faulty links per number of stop-failed and/or Byzantine processes; 2) leaderless algorithm with bounded termination; and 3) conceptually ensured simultaneous safety and bounded liveness.
... Deterministic synchronous consensus has conceptual significance for the purposes of atomicity of committed distributed database transactions [44] [25]. It is impossible with a large fraction of all communication links being faulty [27]. With all links being correct and process faults up to a fraction of all processes, it was solved with an algorithm where the decision making takes as input a vector, having an element for each process, loaded with values received from all processes [47] [37]. ...
... Distributed transactions across these systems aim the same outcome and also by using local transactions, yet executed under external coordination. It involves use of a protocol from a family of distributed commit protocols [27] [40] [39], known for susceptibility to blocking [7] [57] and/or possibility to leave a database system participating in a distributed transaction in inconsistent state. ...
We present an algorithm for synchronous deterministic Byzantine consensus, tolerant to links failures and links asynchrony. It cares for a class of networks with specific needs, where both safety and liveness are essential, and timely irrevocable consensus has priority over highest throughput. The algorithm operates with redundant delivery of messages via indirect paths of up to 3 hops, aims all correct processes to obtain a coherent view of the system state within a bounded time, and establishes consensus with no need of leader. Consensus involves exchange of 2*n*n*n asymmetrically authenticated messages and tolerates up to < n/2 faulty processes. We show that in a consensus system with known members: 1) The existing concepts for delivery over a fraction of links and gossip-based reliable multicast can be extended to also circumvent asynchronous links and thereby convert the reliable delivery into a reliable bounded delivery. 2) A system of synchronous processes with bounded delivery does not need a leader - all correct processes from connected majority derive and propose the same consensus value from atomically consistent individual views on system state. 3) The required for bounded delivery asymmetric authentication of messages is sufficient for safety of the consensus algorithm. Key finding: the impossibility of safety and liveness of consensus in partial synchrony is not valid in the entire space between synchrony and asynchrony. A system of synchronized synchronous processes, which communicate with asymmetrically authenticated messages over a medium susceptible to asynchrony and faults, can operate with: 1) defined tolerance to number of asynchronous and/or faulty links per number of stop-failed and/or Byzantine processes; 2) leaderless algorithm with bounded termination; and 3) conceptually ensured simultaneous safety and bounded liveness.
... Database technology is the foundation of all kinds of applications, a large number of application systems are based on the database system, so the performance of the database system is directly related to the success or failure of the application system and the value of the promotion and use. Due to the expansion of database application requirements and the development of computer hardware environment, especially the development of network technology, the distributed database system are more important than ever before [1][2] [3]. ...
... Transaction is the core concept in the database system. The concept of transaction was proposed in [1]. It can be used in the relational database to solve the problem that it is difficult to ensure the database size increases, complex structure, sharing user data brought about an increase in data integrity, security, concurrency and reliability. ...
... When more than two parties are involved, all have to "agree to agree" on something first -Accept messages indicate that each party agrees to a particular proposal, and then a ConfirmAccept message is used to initiate turning the agreement into a commitment. This is a variation of the two-phase commit approach (Gray, 1978) used for distributed databases to commit an action atomically. In DiNE, commitments are not actually made within the NCsit is left to the host components to arrange a commitment separately. ...
p>With the growth of the Internet over recent years, the use of distributed systems has increased dramatically. Components of distributed systems require a communications infrastructure in order to interact with other components. One such method of communication is a notification service (NS). which delivers notifications of events between publishers and consumers that have subscribed to these events. A distributed NS is made up of multiple NS instances, enabling publishers and consumers to be connected to different NSs and still communicate. The NSs attempt to optimise message flow between them by sharing subscriptions between consumers with similar interests. In many cases, there is a mismatch between the dissemination notifications from a publisher and the delivery preferences of the consumer in terms of frequency of delivery, quality, etc. Consumers wish to receive a high quality of service, while a service provider acting as a publisher wishes to make its service available to many consumers without overloading itself. Negotiation is applicable to the resolution of this mismatch. However, existing forms of negotiation are incompatible with distributed NSs, where negotiation needs to take into account the preferences of the publisher and consumer, as well as existing subscriptions held by NSs. We introduce the concept of chained negotiation, where one or more intermediaries sit between the client and supplier in a negotiation, as a solution to this problem. Automated chained negotiation can enable a publisher and consumer to find a mutually acceptable set of delivery preferences for a service to be delivered through a distributed NS, while still enabling NSs to share subscriptions between consumers with similar interests. In this thesis, we present the following contributions: first, we show that by using negotiation over quality of service conditions, a service provider can serve more clients with a lower load on itself, presenting a direct negotiation engine for this purpose. We present chained negotiation as a novel form of negotiation enabling quality of service negotiations to involve intermediaries which may be able to satisfy a client's request without involving the service provider. Finally, we present a distributed notification service with support for chained negotiation, showing the benefit gained from chained quality of service negotiation in a real application.</p
... • The objects which cannot be defined by a sequential specification. Example of such objects are Rendezvous objects or Non-blocking atomic commit objects [10]. These objects require processes to wait each other, and their correct behavior cannot be captured by sequences of operations applied to them. ...
This paper presents a simple generalization of causal consistency suited to any object defined by a sequential specification. As causality is captured by a partial order on the set of operations issued by the processes on shared objects (concurrent operations are not ordered), it follows that causal consistency allows different processes to have different views of each object history.
... Como um algoritmo clássico para prover controle de concorrência, a votação com pesos (Weighted Voting) foi proposta em [Gifford 1979] . e, portanto, devendo ser o igual em cada réplica consiste de: N, R, W, versão do metadado, versão do objeto, número de votos (peso), e a lista de participantes que possuem réplicas. ...
Aplicações Peer-to-Peer estão cada vez mais populares, ainda que restritas a um
pequeno número de domínios de aplicação, em especial, o compartilhamento de
arquivos. No entanto, ainda é difícil desenvolver aplicações para disponibilizar e
consumir dados na rede, em face da inexistência de um conjunto de serviços
fundamentais a exemplo daqueles prestados pelos frameworks de linguagens de
programação e pelos bancos de dados tradicionais. Assim, este trabalho procurou
projetar, implementar e testar modelos conceituais para suprir estas demandas e ainda
chegou a uma proposta de um mecanismo de controle de concorrência descentralizado e
tolerante a falhas silenciosas adequado ao ambiente instável das redes Peer-to-Peer.
We introduce Galaxybase, a native distributed graph database that addresses the increasing demands for processing large volumes of graph data in diverse industries like finance, manufacturing, and government. Designed to handle the requirements of both transactional and analytical workloads, Galaxybase stands out with its novel data storage and transaction mechanisms. At its core, Galaxybase utilizes a Log-Structured Adjacency List coupled with an Edge Page structure, optimizing read-write operations across a spectrum of tasks such as graph traversals and single edge queries. A notable aspect of Galaxybase is its execution of custom distributed transaction modes tailored for HTAP transactions, allowing for the facilitation of bidirectional and interactive transactions. It ensures data integrity and minimal latency while enabling simultaneous processing of OLTP and OLAP workloads without blocking. Experimental results show that Galaxybase achieves high throughput and low latency in both OLTP and OLAP workloads, across various graph query scenarios and resource conditions. Galaxybase has been deployed in leading banks, education, telecommunication and energy sectors in China, consistently maintaining robust performance for HTAP workloads over the years.
Traditional definitions of common ground in terms of iterative de re attitudes do not apply to conversations where at least one conversational participant is not acquainted with the other(s). I propose and compare two potential refinements of traditional definitions based on Abelard’s distinction between generality in sensu composito and in sensu diviso.
We demonstrate a deterministic Byzantine consensus algorithm with synchronous operation in partial synchrony. It is naturally leaderless, tolerates any number of f< n/2 Byzantine processes with two rounds of exchange of originator-only signed messages, and terminates within a bounded interval of time. The algorithm is resilient to transient faults and asynchrony in a fraction of links with known size per number of faulty processes, as it circumvents asynchronous and faulty links with 3-hop epidemic dissemination. Key finding: the resilience to asynchrony of links and the enabled by it leaderless consensus in partial synchrony ensure algorithm operation with simultaneous validity, safety, and bounded liveness. CCS CONCEPTS • Computing methodologies → Distributed computing method-ologies; • Computer systems organization → Dependable and fault-tolerant systems and networks. KEYWORDS consensus in partial synchrony, f-resilience with two messaging rounds, time-bounded liveness in partial synchrony, synchronous consensus in partial synchrony.
This thesis studies the methods of evaluating the performance of centralized and distributed database systems by using analytic modeling, simulation and system measurement. Numerous concurrency control and locking mechanisms for distributed database systems have been proposed and implemented in recent years, but relatively little work has been done to evaluate and compare these mechanisms. It is the purpose of this thesis to address these problems. The analytic modeling intends to provide a consistent and novel modeling method to evaluate the performance of locking algorithms and concurrency control protocols in both centralized and distributed databases. In particular, it aims to solve the problems of waiting in database locking, and blocking in concurrency control protocol which have not been solved analytically before. These models, which are based on queueing network and stochastic analysis, are able to achieve a high degree of accuracy in comparison with published simulation results. In addition, detailed simulation models are built to validate the analytic models and to study various concurrency control protocols and distributed locking algorithms; these simulation models are able to incorporate system details at very low levels such as the communication protocols, elementary file server operations, and the lock management mechanisms. In order to further validate the findings through measurements, an actual' distributed database management system is specifically implemented which adopts the two phase commit protocol, majority consensus update algorithm, multicast communication primitives, dynamic server configuration, and failure recovery. Various performance measurements are obtained from the system such as the service time characteristics of communication and file servers, system utilization and throughput, response time, queue length and lock conflict rates. The performance results reveal some interesting phenomena such as systems with coarse granularity outperform those with fine granularity when lock overhead is not negligible, and that the effect of the database granularity is small in comparison with the effect of the number of replicated copies. Results also suggest that centralized two phase commit protocol outperforms other types of two phase commit protocol, such as basic, majority consensus and primary copy two phase commit protocol under some circumstances.
Concurrency control mechanisms including the wait, time-stamp and rollback mechanisms have been briefly discussed. The concepts of validation in optimistic approach are summarized in a detailed view. Various algorithms have been discussed regarding the degree of concurrency and classes of serializability. Practical questions relating arrival rate of transactions have been presented. Performance evaluation of concurrency control algorithms including degree of concurrency and system behavior have been briefly conceptualized. At last, ideas like multidimensional timestamps, relaxation of two-phase locking, system defined prewrites, flexible transactions and adaptability for increasing concurrency have been summarized.
Transactional memory is an appealing paradigm for concurrent systems. Many software implementations of the paradigm were proposed in the past two decades for both shared memory multi-core systems and clusters of distributed machines. Chip manufacturers have however started producing many-core architectures, with low network-on-chip communication latencies and limited support for cache coherence, rendering existing transactional-memory implementations inapplicable. This paper presents , the first software transactional memory protocol for many-core systems, hence featuring transactions that are both distributed and leverage shared memory. exploits fast messages over network-on-chip to make accesses to shared data coherent. In particular, it allows visible read accesses to detect conflicts eagerly and incorporates the first distributed contention manager that guarantees the commit of all transactions. We evaluate on Intel, AMD and Tilera architectures, ranging from common multi-cores to experimental many-cores. We build upon new message-passing protocols, based on both software and hardware, which are interesting in their own right. Our results on various benchmarks, including realistic banking and MapReduce applications, show that scales well regardless of the underlying platform.
Information retrieval techniques have to face both the growing amount of data to be processed and the “natural” distribution of these data over the network. Various technologies and solutions have been developed to handle this situation, including the Mobile Agent technology. A mobile agent is an executing program that can migrate during execution from machine to machine in a heterogeneous network. On each machine, the agent interacts with the local resources to accomplish its task. Mobile Agents are particularly attractive in distributed information retrieval applications for several reasons, including reduction of network load and server access time delay and achievement of higher error tolerance and security. Moreover, Mobile Agents can offer a flexible, versatile, and powerful framework for efficiently developing distributed applications on the Internet. In this research work, a mobile agent for Retrieving Exam results in a distributed database system has been described. The University of Calabar was used as a case study. In the system, two types of sites are used. There are Student Affairs Site and Department Site. The system has been developed using the java programming and the mobile agents have been implemented using the IBM’s Aglet Software Development Kit (ASDK).
This paper explores some issues of the database management systems (RDBMS) running on the Local Area Multiprocessor (LAMP) architecture. LAMP is an approach to high-performance/low-cost parallel processing and high performance networking. LAMP consists of a number of machines, single or multiple processors, sharing physical memory among them. The interconnection is the Scalable Coherent Interface (SCI) which provides cache coherent, physically shared memory for multiprocessors via its bus-like point-to-point connections with high bandwidth and low latency. The article describes both challenges and opportunities this architecture presents for the database management systems, mostly in the area of fault-tolerance. Some of the issues discussed here are also likely to apply to other subsystems which provide transactional semantics, such as TP monitors and various Resource Managers.
Object queries are essential in information seeking and decision making in vast areas of applications. However, a query may involve complex conditions on objects and sets, which can be arbitrarily nested and aliased. The objects and sets involved as well as the demand---the given parameter values of interest---can change arbitrarily. How to implement object queries efficiently under all possible updates, and furthermore to provide complexity guarantees?
This paper describes an automatic method. The method allows powerful queries to be written completely declaratively. It transforms demand as well as all objects and sets into relations. Most importantly, it defines invariants for not only the query results, but also all auxiliary values about the objects and sets involved, including those for propagating demand, and incrementally maintains all of them. Implementation and experiments with problems from a variety of application areas, including distributed algorithms and probabilistic queries, confirm the analyzed complexities, trade-offs, and significant improvements over prior work.
Die zunehmende Abhängigkeit der Arbeitsvorgänge von Rechnerleistung und elektronisch gespeicherten Daten, die sich aus der wachsenden Verbreitung von daten- und informationsverarbeitenden Systemen in allen Bereichen ergibt, führt zwangsläufig zu einem erhöhten Schadensrisiko beim Ausfall dieser Arbeitsmittel. Daraus resultiert die Forderung nach höherer Zuverlässigkeit bei praktisch allen Rechnersystemen. Der Aspekt der Fehler-toleranz wird deshalb beim Entwurf von Rechnersystemen immer stärker berücksichtigt.
Many traditional transaction-processing applications such as banking and stock trading are write intensive
in nature: they involve a great number of concurrent, relatively short updating transactions that have stringent response-time requirements, besides strict consistency requirements. The shift from reads to writes in modern web applications
has also been observed in recent years. This trend poses challenges to the performance of a traditional transaction-processing system based on write-ahead logging and random writes of B-tree pages.
Locking
is the most commonly used method for enforcing transactional isolation. Most database management systems apply some kind of locking, possibly coupled with some other mechanism (such as transient versioning). With locking-based concurrency control, transactions are required to protect their actions by acquiring appropriate locks on the parts of the database they operate on. A read action on a data item is usually protected by a shared lock
on the data item, which prevents other transactions from updating the data item, and an update action is protected by an exclusive lock
, which prevents other transactions from reading or updating the data item.
For the purposes of transaction rollback and restart recovery, a transaction log
is maintained during normal transaction processing. The log is shared by all transactions, and it keeps, in chronological order, a record of each update on the database. The log record for an update action makes possible the redoing of the update on the previous version of an updated page when that update has been lost due to a failure. The log record for a forward-rolling update action also makes possible the undoing of the update in a backward-rolling transaction or in a transaction that must be aborted due to a failure. The log records are buffered in main memory before they are taken onto disk, but, unlike database pages, the log records are flushed onto the log disk whenever some transaction commits, so that each committed transaction is guaranteed to have every one of its updates recorded either on the disk version of the database or on the log disk.
Processing data in bulks
of many data tuples is usually more efficient than processing each tuple individually. A bulk of tuples to be inserted into a relation or a set of keys of tuples to be read or deleted can be sorted in key order before accessing the B-tree-indexed relation. In this chapter we show that processing tuples in key order on a B-tree is far more efficient than in random order, even when using the standard algorithms presented in the previous chapters.
The tuple collections of the logical database are stored in an underlying physical database
, which consists of fixed-size database pages stored in non volatile random-access storage
, usually on magnetic disk
. For reading or updating tuples of the logical database, the pages that contain the tuples must be fetched from disk to the main-memory buffer
of the database, from where the updated pages are later flushed back onto disk, replacing the old versions. The buffer-management component of the database management system takes care that frequently used database pages are kept available in the buffer as long as possible so as to reduce the need of expensive random reads and writes of the disk.
Thus far we have assumed that a database server, in a centralized as well as in a distributed environment, operates using the query-shipping
(transaction-shipping
) paradigm, so that client application processes
send SQL queries and update statements to the server, which executes them on behalf of the client transaction on the database stored at the server and returns the results to the client application process. The queries and updates are executed on database pages fetched from the server’s disk to the server’s buffer. The server has exclusive access to the database pages and the buffer and is responsible for the entire task of query processing, that is, parsing, optimizing, and executing the queries and update statements.
In the preceding chapters, we have considered transaction processing in a centralized database environment: we have assumed that each transaction accesses data items of a single database only and is thus run entirely on a single database server. However, a transaction may need access to data distributed across several databases governed by a single organization. This gives rise to the concept of a distributed transaction
, that is, a transaction that contains actions on several intraorganization databases connected via a computer network.
The locking protocols presented thus far assume that the lockable units
of the database are single tuples. Such a choice is appropriate to transactions that access a few tuples only. If a transaction accesses many tuples, it must also acquire many locks. Each such access incurs the computational overhead of requesting and perhaps waiting for the granting of the lock, and, in the case of a huge number of commit-duration locks, the storage overhead on storing the locks in the lock table until the commit of the transaction.
In the preceding chapters, we have assumed that tuples in a relation are accessed via a sparse B-tree index on the primary key of the relation. In this chapter we extend our database and transaction model with read, delete, and update actions based on ranges of non-primary-key attributes. To accelerate these actions, secondary indexes
must be constructed. A secondary index, as considered here, is a dense B-tree
whose leaf pages contain index records that point to the tuples stored in leaf pages of the sparse primary B-tree index of the relation.
The B-tree
, or, more specifically, the B+-tree, is the most widely used physical database structure for primary
and secondary indexes
on database relations. Because of its balance conditions
that must be maintained under all circumstances, the B-tree is a highly dynamic structure in which records are often moved from one page to another in structure modifications
such as page splits
caused by insertions and page merges
caused by deletions. In a transaction-processing environment based on fine-grained
concurrency control, this means that a data page can hold uncommitted updates by several transactions at the same time and an updated tuple can migrate
from a page to another while the updating transaction is still active.
All the concepts related to transactional isolation and concurrency control discussed in the previous chapters pertain to a single-version database
model in which for each data item (identified by a unique key) in the logical database, only a single version, namely, the most recent or the current version
, of the data item is available at any time. When a transaction is permitted, at the specified isolation level, to read or update a data item, the database management system always provides the transaction with the current version of the data item.
The latching protocol applied when accessing pages in the physical database maintains the integrity of the physical database during transaction processing, so that, for example, a B-tree index structure is kept structurally consistent and balanced. When the physical database is consistent, the logical database consisting of the tuples in the data pages is action consistent
, meaning that the logical database is the result of a sequence of completely executed logical database actions.
A B-tree structure modification is an update operation that changes the tree structure of the B-tree, so that at least one index record
(a parent-to-child link) is inserted, deleted, or updated. The structure modifications are encompassed by the following five types of primitive modifications: page split
, page merge
, records redistribute
, tree-height increase
, and tree-height decrease
. Each of these primitive modifications modifies three B-tree pages, namely, a parent page and two child pages.
During normal processing, total or partial rollbacks occur when transactions themselves request such actions. In the event of a system crash
or startup
, restart recovery
is performed before normal transaction processing can be resumed. This includes restoring the database state that existed at the time of the crash or shutdown
from the disk version of the database and from the log records saved on the log disk, followed by the abort and rollback of all forward-rolling transactions and running the rollback of all backward-rolling transactions intocompletion.
In the previous chapter we discussed the database architecture which provides a skeleton to support entity characteristics, user languages, storage structure and data independence. The additional facilities which are needed to build a proper DBMS around that skeleton are as follows.
Our goal in Chap. 12 is to identify the best options for implementing high-speed data replication and other tools needed for fault-tolerant, highly assured Web Services and other forms of distributed computing. Given the GMS created in Chap. 11, one option would be to plunge right in and build replicated applications using the protocol directly in the application. The approach builds on the GMS, but then uses it to create protocols that can only be operated under the assumption that if a failure occurs, the GMS will be notified and will reconfigure the system appropriately, notifying the new system configuration members of their new state, and taking steps to shut down any old members that are unreachable but later recover. We arrive at a rich collection of protocols and establish a subtle linkage to the Paxos framework.
In this and the next two chapters, we will be focused on mechanisms for replicating data and computation while guaranteeing some form of consistent behavior to the end-user. For example, we might want to require that even though information has been replicated, the system behaves as if that information was not replicated and instead resides at a single place. This is an intuitively attractive model, because developers find it natural to think in terms of non-distributed systems, and it is reasonable to expect that a distributed system should be able to mimic the behavior of a non-distributed one. At the same time, though, it is not a minor undertaking to ensure that a distributed system will behave just like a non-distributed one. The technical content of the chapter centers on the components of Lamport’s widely known Paxos protocol.
Up to now we have looked at cloud computing from a fairly high level, and used terms such as “client” and “server” in ways intended to evoke the reader’s intuition into the way that modern computing systems work: our mobile devices, laptops and desktop systems operate fairly autonomously, requesting services from servers that might run in a machine up the hall, or might be situated in a massive cloud-computing data center across the country. Here we focus on client/server computing as a model and look at the issues that arise when a client platform cooperates with a server, potentially replicating state or holding locks.
We first encountered the transactional execution model in Chap. 7, in conjunction with client/server architectures. As noted at that time, the model draws on a series of assumptions to arrive at a style of computing that is especially well matched to the needs of applications operating on databases. In this chapter we consider some of the details that Chap. 7 did not cover: notably the issues involved in implementing transactional storage mechanisms and the problems that occur when transactional architectures are extended to encompass transactional access to distributed objects in a reliable distributed system.
Database systems have emerged into a ubiquitous tool in computer applications over the past 35 years, and they offer comprehensive capabilities for storing, retrieving, querying, and processing data that allow them to interact efficiently and appropriately with the information-system landscape found in present-day federated enterprise and Web based environments. They are standard software on virtually any computing platform, and they are increasingly used as an “embedded” component in both large and small (software) systems (e.g., workflow management systems, electronic commerce platforms, Web services, smart cards); they continue to grow in importance as more and more data needs to get stored in a way that supports efficient and application-oriented ways of processing. As the exploitation of database technology increases, the capabilities and functionality of database systems need to keep track. Advanced database systems try to meet the requirements of present-day database applications by offering advanced functionality in terms of data modeling, multimedia data type support, data integration capabilities, query languages, system features, and interfaces to other worlds. This article surveys the state-of-the-art in these areas.
ResearchGate has not been able to resolve any references for this publication.