Gratitude is the memory of the heart) Jean Baptiste Massieu I would like to thank all the people who, through their valuable advice and support, made this work possible. First and foremost, I am grateful to my advisor, Professor Jason Hickey, for his excellent mentorship, for the many insightful conversations that led to the development of the ideas in this thesis, for his constant support and his positive attitude, and for providing me the opportunity to work in the large research environment in the Mojave group at Caltech.
Background
The rest of this chapter introduces speculative primitives in an informal way, followed by a series of examples that highlight the advantages of speculative execution. We further compare our speculative execution approach for improving performance and providing fault tolerance with other existing mechanisms.
Speculative Execution
The proposed speculative execution model defines the interaction between speculations in a distributed environment as well as the actions performed when distributed speculations are started, committed, or aborted. If speculation is aborted during speculative execution, the program is rolled back to the speculation call and a non-zero value is returned.
Examples
When the airline confirms the booking, the information is sent to the customer and the speculation is committed on the flight agent's side. The speculation is only successfully committed if both the customer and the flight agent commit the speculation.
Comparative Analysis
Contributions
Outline of the Thesis
Theory of Transactions
In their approach, Prinz and Thalheim discuss ACID transactions and do not consider any relaxation of the properties. Non-isolated transactions have been studied in the context of long-lived transactions that can hold on to database resources for long periods of time, delaying the completion of other transactions.
Transactional Systems
Unlike speculative execution, their transactional memory only considers isolated atomic blocks that can be evaluated in parallel. Another interesting topic in the field of transaction memory support in hardware is building mechanisms that can, at runtime, identify lock-protected critical sections in programs and execute them without actually acquiring the lock [48].
Checkpoint and Recovery
The system specification requires that the state of each process be saved after each send or receive operation. It regularly saves checkpoints of running applications and in case of process failure it rolls the process back to one of the previously saved checkpoints.
Speculative execution
Mathematical Model for Speculations
When the last message with an ID lower than M is received, we are charged the abort time of the current speculation plus the time to enter (Te) a new speculation that is guaranteed to succeed, so we also add the commit time (Tc) ). First, for the immediate delivery case, we improve only if the expected delivery time is greater than the size of the waiting window T plus the time spent entering and committing speculation.
Speculative Behavior
Semantics of Speculative Behavior
It contains the state of the process and is marked by the speculation on which it depends. The behavior of message passing primitives is presented only in the context of speculation.
Sample informal proof of a safety property
We have created a model that captures the state of a distributed system in a way that facilitates the specification of speculative actions. The third model is a refinement of the speculative distributed object system and deals with nested speculation.
Speculative Message Passing Model
Syntax
Each of the three models is introduced with a full description of the syntax of speculative primitives and the terminology used to describe the rules. The two models differ from each other in important ways, either in terms of syntax or in terms of rule representation, which warrant a full description of each rather than a comparative discussion.
Terminology and Notation
Thelet v=recv()in econstruct returns the first message in the receive queue and assigns the data to variablev. In the example rule (Figure 4.2), we show the evolution of the system-wide state when the process pi.
Operational Semantics Rules
In this case, the receiver of the message is caught up in the speculation to which the sender of the message belongs. The checkpoint is tagged with the speculation ID of the message(s) and the flag is set to client.
Model for a Speculative Distributed Objects System
- Overview of the Language
- Terminology and Notation
- Speculate
- Reading from a Shared Object
- Writing Data to a Shared Object
- Abort a Speculation
- Commit a Speculation
If it executes anabort(), the speculation is aborted and the process returns and executes2. The local environment of the process at the time the process became part of the speculation. A process checkpoint is created to allow recovery if the speculation is subsequently interrupted.
If the process is a co-owner of the speculation, it returns to where it started its original speculation.
Model for Nested Speculations in a Distributed Shared Objects System
- Syntax of the Primitives
- Terminology and Notation
- Speculate
- Reading from a Shared Object
- Writing Data to a Shared Object
- Aborting a Speculation
- Commit a Speculation
The process's state depends on speculative information, so the process is absorbed into the object's speculation. The most interesting case for reading the value of a shared object is when the process and the object belong to different speculations. The reduction rule is similar to the case when neither the process nor the object is part of any speculation (RuleNested-Write-NoSpec).
The most interesting case for writing an object is when the process and the object are within different speculations.
Nonspeculative model
Nonspeculative operational semantics
When the process encounters a speculate call, it can choose non-deterministically to take either the commit or abort branch. With the aim of proving the equivalence of this model to the speculative model presented in Section 4.2, we introduce two history variables that are local to each process's environment. The second history variable, A, in each process refers to the interrupt branch that is ignored by the program.
Local variablev is assigned the value of objectoj and becomes part of the process's local environment.
Equivalence of the Speculative and Nonspeculative Versions of the Distributed Objects
- Algebraic Representation of the Operational Semantics Rules
- Definitions and Abstractions
- Equivalence Theorems
- MCC Overview
- Speculative Support in MCC
- Limitations of MCC’s Support for Speculations
This rule is equivalent to the Specreduction rule from the operational semantics of the speculative model. This rule is equivalent to the Comm-Owner reduction rule from the operational semantics of the speculative model. This rule is equivalent to the Read-NoSpecreduction rule from the operational semantics of the speculative model.
This rule is equivalent to the Write-NoSpecreduction rule from the operational semantics of the speculative model.
Kernel-level Implementation Overview
Assumptions
Implementation Details
The new process corresponds to the abort branch of the speculation and it represents the light checkpoint. This prevents unnecessary copying and preserves the original state of the process in the break branch (see Figure 5.9). This prevents unnecessary copying and preserves the original state of the process in the abort branch.
If the process is the sole owner of the speculation, it can fully perform the speculation.
Synthetic Experimental Results
- The Testing Setup
- Overhead of Executing Inside a Speculation
- Speculation Overhead with Initial Nonspeculative Accesses
- Cost of Speculative System Calls
- Speculation Support in MojaveFS
- The Distribution Component of MojaveFS
- Support for Sequential Consistency in MojaveFS
A separate log is kept for each level of speculation, similar to the heap generations presented in Figure 5.4 in Section 5.1. If a given data object becomes a "hot spot", the virtual server serving it can dynamically recruit new machines to help carry the load. The only communication between processes is through read and write operations of shared object data.
Therefore, we choose O as follows — the messages in O are ordered according to the order (from the point of view of an external global clock) in which they are assigned a "group sequence number".
Implementation Overview of MojaveFS
The Indirect I/O layer
Each node in a virtual data server has a current copy of the data corresponding to the data group it serves. It then contacts the virtual data server and retrieves the metadata object, which contains information about the identifiers of the data objects that make up the file. To conserve resources, it may be desirable for the system to consolidate the responsibilities of the virtual data servers that correspond to all files in a subtree of the file system directory structure.
In this case, the search engine would perform a binary search to determine the most specific virtual data server that is responsible for the metadata of the file to be searched.
The Direct I/O layer
This requires the adoption of some kind of recovery policy that allows one to 'rebuild' an authoritative view from non-authoritative views. This condition means that two or more non-authoritative views can merge into an authoritative view (provided that the membership of the merged view contains a majority of the nodes that were members of the last authoritative view). This is necessary to ensure that, if two or more non-authoritative views merge to create an authoritative view, a resulting authoritative view will send the most recent set of messages (reads and writes) to the view before the split.
Without this additional information, it would not be possible for the DIO layer to determine whether the membership of a view resulting from the merging of two or more non-authoritative views contains a majority of the possible nodes that could be combined for to form an authoritative view (ie ensuring that there could be no other fusion involving a disjoint set of nodes which would result in a view being considered authoritative).
Implementation of the Lower Layer of the Group Communication Protocol . 121
It then contacts a directory service that provides hints about the current membership of the group. This resolution mechanism allows only one of the initiators to complete the view change successfully. The goal of this phase is to reduce the number of members of the view to the current set of active processes.
The view change initiator sets the epoch of the new view larger than the largest epoch involved in the view change.
Optimization
To achieve sequential consistency, we rely on the Group Common Order communication protocol, which ensures that messages sent by the same process to the same group are ordered in the order in which they were sent. When a process sends multiple messages to the same group, it is allowed to send a request for a sequence number and then continue execution until the sequence number is granted. Meanwhile, a process can send multiple sequence number requests to the same group.
However, when a process changes the group to which it is sending the message, it must wait for all of its previously requested sequence numbers to be granted before sending a request for a sequence number in the other group.
Related Work
All of the approaches listed above make use of a fixed set of dedicated metadata servers with little or no replication of the metadata. The protocol implementation was presented along with mathematical proofs that testify to the correctness of the protocol. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture (HPCA'05), pages 316–327, San Franscisco, California, February 2005.
InCCGRID ’02: Proceedings of the 2nd IEEE/ACM International Symposium on Cluster Computing and the Grid, page 331, Washington, DC, USA, 2002.
Simplified speculative main loop and the 2D domain decomposition for a scientific
Algorithm for total-order communication using speculations
Speculative programs for reservation system
The reservation succeeds. Send operations are shown as squares, while receives are
The reservation fails because prices are too high. Send operations are shown as squares,
Pseudo-code for the speculative algorithm used to implement a total-order communi-
Two possible speculative executions of the speculative total order protocol
Syntactically valid terms
Syntactically valid terms
Notation used in the operational semantics
Syntactically valid terms
Syntactically valid terms
Graphical and algebraic representation of speculative states
Graphical and algebraic representation of nonspeculative states
Pointer table representation
The representation of a process in a stack-based compiler
Speculation variables
Heap data with multiple speculation levels
Interaction between user level processes and the Linux kernel
The speculation header sent as part of the new introduced IP option and the speculation
The speculate call uses the do fork() function to create the abort branch of the specu-
The copy-on-write mechanism associated with the do fork() function copies memory
The OSI Model and the Linux network stack
Data encapsulation between various network layers
The layout of the IP header encapsulated by IP datagrams
Skeleton of the benchmark program
Skeleton of the benchmark program
The overhead of randomly accessing the entries of an array of 128Mb for the first-time
Skeleton of the benchmark program
The overhead of randomly modifying various percentages of an array of 128Mb. The
Cost of other system calls and that of context switch time (in µs). The grayed-out
Each filename maps to a virtual data server group
An “out of order” interleaving of messages from P 3 and P 4 is allowed during the passive
Representation of a file and of a directory in MojaveFS. Only the shaded areas are
The layered architecture of MojaveFS
Object look-up mechanism
Views split and merge, changing authoritative status
The view change event
The state machine for our protocol