Providing Fault-tolerance and Improving Performance

Gratitude is the memory of the heart) Jean Baptiste Massieu I would like to thank all the people who, through their valuable advice and support, made this work possible. First and foremost, I am grateful to my advisor, Professor Jason Hickey, for his excellent mentorship, for the many insightful conversations that led to the development of the ideas in this thesis, for his constant support and his positive attitude, and for providing me the opportunity to work in the large research environment in the Mojave group at Caltech.

Background

The rest of this chapter introduces speculative primitives in an informal way, followed by a series of examples that highlight the advantages of speculative execution. We further compare our speculative execution approach for improving performance and providing fault tolerance with other existing mechanisms.

Speculative Execution

The proposed speculative execution model defines the interaction between speculations in a distributed environment as well as the actions performed when distributed speculations are started, committed, or aborted. If speculation is aborted during speculative execution, the program is rolled back to the speculation call and a non-zero value is returned.

Examples

When the airline confirms the booking, the information is sent to the customer and the speculation is committed on the flight agent's side. The speculation is only successfully committed if both the customer and the flight agent commit the speculation.

Figure 1.2: Simplified speculative main loop and the 2D domain decomposition for a scientific computation whose performance is improved by using speculations.

Comparative Analysis

Contributions

Outline of the Thesis

Theory of Transactions

In their approach, Prinz and Thalheim discuss ACID transactions and do not consider any relaxation of the properties. Non-isolated transactions have been studied in the context of long-lived transactions that can hold on to database resources for long periods of time, delaying the completion of other transactions.

Transactional Systems

Unlike speculative execution, their transactional memory only considers isolated atomic blocks that can be evaluated in parallel. Another interesting topic in the field of transaction memory support in hardware is building mechanisms that can, at runtime, identify lock-protected critical sections in programs and execute them without actually acquiring the lock [48].

Checkpoint and Recovery

The system specification requires that the state of each process be saved after each send or receive operation. It regularly saves checkpoints of running applications and in case of process failure it rolls the process back to one of the previously saved checkpoints.

Speculative execution

Mathematical Model for Speculations

When the last message with an ID lower than M is received, we are charged the abort time of the current speculation plus the time to enter (Te) a new speculation that is guaranteed to succeed, so we also add the commit time (Tc) ). First, for the immediate delivery case, we improve only if the expected delivery time is greater than the size of the waiting window T plus the time spent entering and committing speculation.

Figure 3.2: Two possible speculative executions of the speculative total order protocol.

Speculative Behavior

Semantics of Speculative Behavior

It contains the state of the process and is marked by the speculation on which it depends. The behavior of message passing primitives is presented only in the context of speculation.

Sample informal proof of a safety property

We have created a model that captures the state of a distributed system in a way that facilitates the specification of speculative actions. The third model is a refinement of the speculative distributed object system and deals with nested speculation.

Speculative Message Passing Model

Syntax

Each of the three models is introduced with a full description of the syntax of speculative primitives and the terminology used to describe the rules. The two models differ from each other in important ways, either in terms of syntax or in terms of rule representation, which warrant a full description of each rather than a comparative discussion.

Terminology and Notation

Thelet v=recv()in econstruct returns the first message in the receive queue and assigns the data to variablev. In the example rule (Figure 4.2), we show the evolution of the system-wide state when the process pi.

Operational Semantics Rules

In this case, the receiver of the message is caught up in the speculation to which the sender of the message belongs. The checkpoint is tagged with the speculation ID of the message(s) and the flag is set to client.

Model for a Speculative Distributed Objects System

Overview of the Language
Terminology and Notation
Speculate
Reading from a Shared Object
Writing Data to a Shared Object
Abort a Speculation
Commit a Speculation

If it executes anabort(), the speculation is aborted and the process returns and executes2. The local environment of the process at the time the process became part of the speculation. A process checkpoint is created to allow recovery if the speculation is subsequently interrupted.

If the process is a co-owner of the speculation, it returns to where it started its original speculation.

Table 4.1: Notation for speculative processes

Model for Nested Speculations in a Distributed Shared Objects System

Syntax of the Primitives
Terminology and Notation
Speculate
Reading from a Shared Object
Writing Data to a Shared Object
Aborting a Speculation
Commit a Speculation

The process's state depends on speculative information, so the process is absorbed into the object's speculation. The most interesting case for reading the value of a shared object is when the process and the object belong to different speculations. The reduction rule is similar to the case when neither the process nor the object is part of any speculation (RuleNested-Write-NoSpec).

The most interesting case for writing an object is when the process and the object are within different speculations.

Nonspeculative model

Nonspeculative operational semantics

When the process encounters a speculate call, it can choose non-deterministically to take either the commit or abort branch. With the aim of proving the equivalence of this model to the speculative model presented in Section 4.2, we introduce two history variables that are local to each process's environment. The second history variable, A, in each process refers to the interrupt branch that is ignored by the program.

Local variablev is assigned the value of objectoj and becomes part of the process's local environment.

Equivalence of the Speculative and Nonspeculative Versions of the Distributed Objects

Algebraic Representation of the Operational Semantics Rules
Definitions and Abstractions
Equivalence Theorems
MCC Overview
Speculative Support in MCC
Limitations of MCC’s Support for Speculations

This rule is equivalent to the Specreduction rule from the operational semantics of the speculative model. This rule is equivalent to the Comm-Owner reduction rule from the operational semantics of the speculative model. This rule is equivalent to the Read-NoSpecreduction rule from the operational semantics of the speculative model.

This rule is equivalent to the Write-NoSpecreduction rule from the operational semantics of the speculative model.

Figure 5.1: Pointer table representation

Kernel-level Implementation Overview

Assumptions

Implementation Details

The new process corresponds to the abort branch of the speculation and it represents the light checkpoint. This prevents unnecessary copying and preserves the original state of the process in the break branch (see Figure 5.9). This prevents unnecessary copying and preserves the original state of the process in the abort branch.

If the process is the sole owner of the speculation, it can fully perform the speculation.

Figure 5.7: The speculation header sent as part of the new introduced IP option and the speculation id data structures

Synthetic Experimental Results

The Testing Setup
Overhead of Executing Inside a Speculation
Speculation Overhead with Initial Nonspeculative Accesses
Cost of Speculative System Calls
Speculation Support in MojaveFS
The Distribution Component of MojaveFS
Support for Sequential Consistency in MojaveFS

A separate log is kept for each level of speculation, similar to the heap generations presented in Figure 5.4 in Section 5.1. If a given data object becomes a "hot spot", the virtual server serving it can dynamically recruit new machines to help carry the load. The only communication between processes is through read and write operations of shared object data.

Therefore, we choose O as follows — the messages in O are ordered according to the order (from the point of view of an external global clock) in which they are assigned a "group sequence number".

Figure 5.15: The overhead of randomly accessing the entries of an array of 128Mb for the first-time using various mutation percentiles

Implementation Overview of MojaveFS

The Indirect I/O layer

Each node in a virtual data server has a current copy of the data corresponding to the data group it serves. It then contacts the virtual data server and retrieves the metadata object, which contains information about the identifiers of the data objects that make up the file. To conserve resources, it may be desirable for the system to consolidate the responsibilities of the virtual data servers that correspond to all files in a subtree of the file system directory structure.

In this case, the search engine would perform a binary search to determine the most specific virtual data server that is responsible for the metadata of the file to be searched.

The Direct I/O layer

This requires the adoption of some kind of recovery policy that allows one to 'rebuild' an authoritative view from non-authoritative views. This condition means that two or more non-authoritative views can merge into an authoritative view (provided that the membership of the merged view contains a majority of the nodes that were members of the last authoritative view). This is necessary to ensure that, if two or more non-authoritative views merge to create an authoritative view, a resulting authoritative view will send the most recent set of messages (reads and writes) to the view before the split.

Without this additional information, it would not be possible for the DIO layer to determine whether the membership of a view resulting from the merging of two or more non-authoritative views contains a majority of the possible nodes that could be combined for to form an authoritative view (ie ensuring that there could be no other fusion involving a disjoint set of nodes which would result in a view being considered authoritative).

Implementation of the Lower Layer of the Group Communication Protocol . 121

It then contacts a directory service that provides hints about the current membership of the group. This resolution mechanism allows only one of the initiators to complete the view change successfully. The goal of this phase is to reduce the number of members of the view to the current set of active processes.

The view change initiator sets the epoch of the new view larger than the largest epoch involved in the view change.

Optimization

To achieve sequential consistency, we rely on the Group Common Order communication protocol, which ensures that messages sent by the same process to the same group are ordered in the order in which they were sent. When a process sends multiple messages to the same group, it is allowed to send a request for a sequence number and then continue execution until the sequence number is granted. Meanwhile, a process can send multiple sequence number requests to the same group.

However, when a process changes the group to which it is sending the message, it must wait for all of its previously requested sequence numbers to be granted before sending a request for a sequence number in the other group.

Related Work

All of the approaches listed above make use of a fixed set of dedicated metadata servers with little or no replication of the metadata. The protocol implementation was presented along with mathematical proofs that testify to the correctness of the protocol. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA'05), pages 316–327, San Franscisco, California, February 2005.

InCCGRID ’02: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, page 331, Washington, DC, USA, 2002.

Simplified speculative main loop and the 2D domain decomposition for a scientific

Algorithm for total-order communication using speculations

Speculative programs for reservation system

The reservation succeeds. Send operations are shown as squares, while receives are

The reservation fails because prices are too high. Send operations are shown as squares,

Pseudo-code for the speculative algorithm used to implement a total-order communi-

Two possible speculative executions of the speculative total order protocol

Syntactically valid terms

Notation used in the operational semantics

Syntactically valid terms

Graphical and algebraic representation of speculative states

Graphical and algebraic representation of nonspeculative states

Pointer table representation

The representation of a process in a stack-based compiler

Speculation variables

Heap data with multiple speculation levels

Interaction between user level processes and the Linux kernel

The speculation header sent as part of the new introduced IP option and the speculation

The speculate call uses the do fork() function to create the abort branch of the specu-

The copy-on-write mechanism associated with the do fork() function copies memory

The OSI Model and the Linux network stack

Data encapsulation between various network layers

The layout of the IP header encapsulated by IP datagrams

Skeleton of the benchmark program

The overhead of randomly accessing the entries of an array of 128Mb for the first-time

Skeleton of the benchmark program

The overhead of randomly modifying various percentages of an array of 128Mb. The

Cost of other system calls and that of context switch time (in µs). The grayed-out

Each filename maps to a virtual data server group

An “out of order” interleaving of messages from P 3 and P 4 is allowed during the passive

Representation of a file and of a directory in MojaveFS. Only the shaded areas are

The layered architecture of MojaveFS

Object look-up mechanism

Views split and merge, changing authoritative status

The view change event

The state machine for our protocol