• Tidak ada hasil yang ditemukan

Routing in Faulty Networks

Reliability

4.1 Routing in Faulty Networks

In this section, we motivate and discuss the problems involved in performing message routing in a faulty multicomputer network. Before we proceed with the main discus- sion, we have to be more specific about our scope of interest. In particular, we want to make a distinction between fault-tolerant routing, our specific interest, and the much more general study of fault-tolerant communication, of which the first can be consid- ered a subtopic. In fact, fault-tolerant communication is a vast subject consisting of many different area of interests. For example, maintenance of reliable communication across unreliable physical channels has long been an intensively researched area, and many protocols designed to achieve reliable communication have been documented in

published literature [56,6]. Similarly, the study of the theory and practice of forward error-correcting codes [9,36] represents another active and important area in the field of fault-tolerant communication. In addition, the meaning of faults used in our current discussion also requires clarification. Roughly speaking, we can distinguish between two different classes of faults or failures in communication resources:

Soft failure - This is a transient error that occurs randomly and is mostly noise related. A typical example would be the garbling of a few bits during the trans- mission of data across the physical channels.

Hard failure - This is a persistent error, whose occurrence may also be random;

once it occurs, the error condition will persist until the faulty unit is physically attended to. An example would be an irreversible failure such as a damaged circuit component.

For our purpose, the most important distinction between these two classes of failures is that while the impact of soft errors can be minimized through a combination of temporal and hardware (i.e., physical resources) redundancies, the impact of hard errors can only be minimized by providing hardware redundancies. An interesting discussion of some high-level techniques and strategies for handling various types of soft failures in the message-communication network of multicomputers can be found in [3]. The success of many of these techniques, however, depends on the continuous operation of the physical communication network. Unfortunately, for the oblivious wormhole networks, even a single broken channel has the effect of disconnecting many source-destination pairs.

Obviously, having the ability to route and deliver messages among the subset of nodes that survived, in spite of a limited number of hard failures in the message network, is a sine qua non to supporting truly fault-tolerant computation in multicomputers. The research reported in this chapter represents an attempt toward achieving this goal.

4.1.1 The Fault-Tolerant Routing Problem

In order to understand the main problem involved in performing fault-tolerant routing in multicomputer networks, we observe that the popular connection topologies of mul- ticomputer networks such as k-ary-n-cubes or meshes are highly regular. Apart from the obvious advantages of having reasonably high bandwidths and systematic layouts,

the regularity in these topologies allows for simple algorithmic-routing procedures based entirely on local information. In our adaptive-routing formulation, the existence of an algorithmic-routing procedure is essential for keeping the cost of realizing the routing relations at each node at an acceptable level. Such capability is particularly important in fine-grain multicomputers where resources at each node are scarce. Equally impor- tant, the existence of simple algorithmic routing procedures in these regular topologies allows direct hardware realization of the routing functions, which is absolutely essential in high-performance systems. The Torus Routing Chip [11] and the Mesh Routing Chip [14] are successful examples of these hardware routers.

As individual nodes and channels fail, the regularity in these networks are destroyed and the algorithmic-routing procedures are no longer applicable. Routing in irregular networks can be systematically achieved by storing and consulting routing tables at each node of the network. However, such routing tables demand excessive resources at each node and become unacceptable as the networks grow in size. Schemes such as hierarchical clustering of network nodes have been proposed [28] to achieve savings in routing-table sizes. However, such methods appear to be unsatisfactory for multicom- puter networks for the following reasons:

1. The hierarchical clustering of nodes in the richly connected topologies that are commonly used in multicomputer networks typically eliminates the use of many alternative paths by confining routes to those paths that are consistent with the hierarchy.

2. As nodes and channels fail, the change in the network topology could force a global renaming of network nodes, in the course of adjusting the hierarchy.

3. The circuitry required to store, consult, and update the routing tables represent considerable hardware overhead cost paid in advance in every node, regardless of the presence or absence of failures.

Conceivably, another alternative would be to devise some systematic search technique with appropriate backtracking capabilities that would allow the packet to home in on its destination. However, such schemes tend to be difficult and complicated and, in general, are very inefficient in using the very precious remaining bandwidth. Essentially, these

approaches are developed to handle routing in highly irregular networks. A different and more satisfactory approach would try to exploit the regularity of the original non- faulty network. In this chapter, we suggest and investigate such an approach based on our adaptive multipath-routing formulation.

4.1.2 A Simple Fault Model

We now describe a simple fault model to be used in our subsequent discussion of fault- tolerant routing in multicomputer networks. As before, a multicomputer network, M, is a connected, undirected graph, M

=

G(N,C). The vertices of the graph, N, represent the set of computing nodes; the edges of the graph, C, represent the set of bidirectional communication channels. Recall from Chapter 2 that a computing node is conceptually divided into four subsystems: processor, node memory, message interface, and network router. Since our primary subject of interest is message routing, we shall not make the fine distinction between a node and its router. We now state the additional definitions and notations needed for discussing faulty multicomputer networks:

Definition 4.1 A node (i.e., its router) n E N is faulty if it is unable to perform its packet-to-channel routing-assignment functions correctly.

Definition 4.2 A channel c E C is faulty if it is unable to forward packets and follow the coherent protocol correctly.

Definition 4.3 A network is faulty if it contains at least one faulty node or one faulty channel. The non-faulty nodes and channels are referred to as the survived nodes and survived channels. Given a faulty network, the set of all survived nodes and survived channels together constitutes the survived subnet.

Definition 4.4 A route Pii from node ni to node nj generated by R of the original non-faulty network is legal in its faulty descendents if, and only if, both the source, ni,

destination, nj, and all intermediate nodes and channels of Pij remain non-faulty in the faulty network. Observe that whether a route remains legal depends on the particular fault pattern of the network under consideration.

We assume the following:

1. A packet forwarded to a faulty node is consumed by that node.

2. A packet forwarded along a faulty channel is lost.

3. A survived node is able to determine the status of all of its own channels locally.

4. Node faults and channel faults occur independently and randomly.

Together, the first three assumptions capture the essence of the well-known notion of a fail-stop processor that has well-defined failure-mode operating characteristics [46].

In particular, it assumes that a faulty node simply stops executing without performing incorrect routing or generating spurious messages into the network. More specifically, we approach the problem of fault-tolerant routing by assuming the existence of a mul- ticomputer network consisting of such fail-stop nodes and channels in order to motivate the issues, and discuss the solutions at an abstract level.

The independent and random occurrences of node faults and channel faults model an environment of random component failures such as malfunctioning integrated circuits or failed connections due to bad contacts. Fault distributions that are highly correlated and non-random in character will require a different approach from what is explored here. A very good example of such non-random faults is found in power-supply failures.

In a typical arrangement, a single power supply is used to provide power to a whole cluster of nodes, such as a contiguous submesh. Hence a single power-supply failure will disable the entire chunk of nodes supported by it. A natural solution to handle such failures is to provide redundancy at the location of the bottleneck, i.e., to provide for backup power supplies, instead of relying on the routing network.

It is interesting to observe that under our model, a non-faulty node cannot distin- guish a faulty neighboring node from a faulty channel joining them. In fact, a faulty node is empirically equivalent to a survived node whose channels are all faulty. In a limited way, the notion of a faulty node models a highly correlated local concentration of channel faults.