Reliability
4.2 Systematic Fault-Tolerant Routing
1. A packet forwarded to a faulty node is consumed by that node.
2. A packet forwarded along a faulty channel is lost.
3. A survived node is able to determine the status of all of its own channels locally.
4. Node faults and channel faults occur independently and randomly.
Together, the first three assumptions capture the essence of the well-known notion of a fail-stop processor that has well-defined failure-mode operating characteristics [46].
In particular, it assumes that a faulty node simply stops executing without performing incorrect routing or generating spurious messages into the network. More specifically, we approach the problem of fault-tolerant routing by assuming the existence of a mul- ticomputer network consisting of such fail-stop nodes and channels in order to motivate the issues, and discuss the solutions at an abstract level.
The independent and random occurrences of node faults and channel faults model an environment of random component failures such as malfunctioning integrated circuits or failed connections due to bad contacts. Fault distributions that are highly correlated and non-random in character will require a different approach from what is explored here. A very good example of such non-random faults is found in power-supply failures.
In a typical arrangement, a single power supply is used to provide power to a whole cluster of nodes, such as a contiguous submesh. Hence a single power-supply failure will disable the entire chunk of nodes supported by it. A natural solution to handle such failures is to provide redundancy at the location of the bottleneck, i.e., to provide for backup power supplies, instead of relying on the routing network.
It is interesting to observe that under our model, a non-faulty node cannot distin- guish a faulty neighboring node from a faulty channel joining them. In fact, a faulty node is empirically equivalent to a survived node whose channels are all faulty. In a limited way, the notion of a faulty node models a highly correlated local concentration of channel faults.
to use, with minimal change, the original routing hardware for the non-faulty network, so as to exploit the inherent regularity in these multicomputer networks. To proceed, observe that an immediate result of having only local information to guide routing is that pairs of survived nodes may not be able to communicate with each other even if they remain connected. In this section, we introduce and define two theoretical notions that characterize the situations under which we can continue to use the algorithmic routing relations, defined for the original non-faulty networks, to systematically direct routing in its faulty descendents.
One immediate advantage of this insistence in using only the set of original routes is that we can obtain a priori bounds on the length of routes joining pairs of sources and destinations in the survived subnet. Another advantage, as we shall see, is that it is very easy to detect messages destined to faulty nodes based only on localiy available information. Once detected, such messages can then be sunk into an intermediate node, thus triggering exception processing.
Obviously, a scheme that confines the routing paths to those used in the original non-faulty networks would fail miserably if the routing functions are oblivious. The effectiveness of this scheme under our adaptive routing formulation will ultimately de- pend on the connection topology and on the set of routing relations defined by the algorithmic-routing procedure.
4.2.1
The Convex Subset
We shall first introduce the notion informally using the 2D rectilinear mesh network that employs the usual shortest-path routing relation as an illustration. Consider Figure 4.1, which depicts a 2D mesh network whose faulty nodes are disconnected from the bulk of the survived nodes. It is straightforward to observe that the illustrated fault pattern has the following interesting property: There exists at least one legal route between every pair of survived nodes. Recall that a route of the original network is legal in its faulty descendents if the route lies completely within the set of survived nodes and channels.
This property is sufficient to allow all survived nodes to communicate by sending and receiving messages among one another according to the routing relations of the original non-faulty network. In particular, given a survived node, ni, and its corresponding set of survived channels,
Cf
~ Ci, let R!j ~Cf
denote the restricted routing relation of niFigure 4.1: A Convex Survived Set in a 2D Mesh Network
to another node, ni; i.e., Rii
=
~in Cf. The above sufficiency property is equivalent to having R1i cf0
for every pair of survived nodes ni and ni. Under this condition, each message packet that arrives at an intermediate node but is destined to another survived node will find from the restricted routing relation of the intermediate node at least one profitable channel that is non-faulty. The acyclicity of the original routing relations then guarantees the existence of a route. In summary, these observations motivate the following definition:Definition 4.5 Given the set R
=
{Rii} of routing relations of a non-faulty network, a set of survived nodes, S ~ N, is convex under R, if for every pair of nodes, ni, ni E S, there exists at least one legal route leading from ni to ni that is generated by R.Notice that the original non-faulty network is convex by definition. When the entire set of survived nodes and channels of our network forms a convex set, we have the happy situation where all survived nodes can continue to communicate with each other. These communications are achieved with virtually no change in the basic routing decision mechanism at each node. The only added requirement is for a router to be able to
recognize its own faulty channels. This property renders a direct hardware realization practical.
When the set of survived nodes and channels does not form a convex set under R, because certain pairs of nodes do not have legal routes joining them that are consistent with
R,
we have a problem. In the interest of pursuing simplicity and generality, we suggest and investigate the following simple alternative: to selectively discard certain survived nodes that are particularly difficult to communicate with, so that the remaining subset of survived nodes constitutes a convex set. In other words, instead of asking how to route messages in an irregular network, we ask the alternative question: how to restore regularity, or in this case, convexity, back to the survived network. In essence, nodes that become difficult to reach without global information are abandoned as a result of our insistence on using only local information to guide routing. This approach immediately suggests the following problem.Problem 4.1 (Maximum Convex Subset) Given the set
R
of routing relations of a non-faulty network, the set of survived nodes S ~ N, and survived channels, find the maximum cardinality subset Sc ~ S, which is convex under R.By restricting all computations and communications to within the convex subset of a survived network, routing of messages can again be carried out by the hardware router implementing the original set
R.
of routing relations, a very simple state of affairs. An illustration of this technique is depicted in Figure 4.2, where the broken nodes denote nodes that are deliberately disconnected from the bulk of the remaining network. The cardinality of the maximum convex subset of a survived network provides a useful figure of merit to gauge the effectiveness of this approach to fault-tolerant routing.4.2.2
The Communication Kernel
The finding of the Maximum Convex Subset of the set of survived nodes is one useful simple strategy that enables us to regularize the survived network and, hence, to be able to continue to use a restricted version of the original routing relations to guide message routing. We now describe another useful regularization strategy, namely, restraining certain survived nodes to operate only as pure switches: A switch can only forward messages but cannot itself generate and consume messages. In particular, a switch can
Figure 4.2: A Convex Survived Subset in a 2D Mesh Network
never be the destination of a message. Instead, its presence is to retain certain routes in order to enable communication between pairs of nodes that would otherwise be impos- sible. The rationale is that some survived nodes that are difficult to reach, and thus are discardable, might be located in positions that enable other pairs to communicate and, thus, should be retained. Restraining such nodes to act as pure switches represents an attempt to capture both conflicting objectives. Because of the nondeterministic nature of message trajectories in our adaptive formulation, a node is eligible to be a message destination if and only if legal routes that are generated by R exist between it and every survived node. This way, a message with sufficiently high priority will always be for- warded toward its destination, regardless of where the message is. These observations motivate the following definition:
Definition 4.6 Given the set R
= {
Rij} of routing relations of a non-faulty network, and S ~ N, the set of survived nodes, the set K ~ S, called the communication kernel of S under R, is the set of survived nodes { nj}, where for all ni E S, at least one legal route that is generated by R exists that leads from ni to nj.Notice that the kernel of a non-faulty network is by definition the entire network. Given a fixed nonempty set of survived nodes, its communication kernel always exists and is unique. The word kernel is jargon borrowed from computational geometry, where the kernel of a simple polygon is defined to be the set of points inside the polygon that are visible from every point in the polygon. The cardinality of the communication kernel of a survived network provides a useful figure of merit to gauge the effectiveness of this strategy.
These two regularization strategies, namely, selectively discarding nodes, and selec- tively restraining certain nodes to operate as pure switches, can be combined to achieve even better node-reclamation results. By deliberately discarding certain nodes that are excessively difficult to reach, it may be possible to increase the cardinality of the communication kernel of the remaining nodes. This observation suggests the following problem.
Problem 4.2 (Maximum Communication Kernel) Given the set R, of routing re- lations of a non-faulty network, the set of survived nodes S ~ N, and the survived channels, find the subset SK ~ S that contains a communication kernel of maximum cardinality, over all such subsets, under R,.
Figure 4.3 shows a typical communication kernel obtained for a 2D faulty mesh network.
The blank nodes in the figure are nodes that have been restrained to operate as pure switches. The ratio of this maximum kernel cardinality to the size of the original non- faulty network serves as a natural reliability measure for our subsequent investigation.