Reliability
4.3 Computational Considerations
Notice that the kernel of a non-faulty network is by definition the entire network. Given a fixed nonempty set of survived nodes, its communication kernel always exists and is unique. The word kernel is jargon borrowed from computational geometry, where the kernel of a simple polygon is defined to be the set of points inside the polygon that are visible from every point in the polygon. The cardinality of the communication kernel of a survived network provides a useful figure of merit to gauge the effectiveness of this strategy.
These two regularization strategies, namely, selectively discarding nodes, and selec- tively restraining certain nodes to operate as pure switches, can be combined to achieve even better node-reclamation results. By deliberately discarding certain nodes that are excessively difficult to reach, it may be possible to increase the cardinality of the communication kernel of the remaining nodes. This observation suggests the following problem.
Problem 4.2 (Maximum Communication Kernel) Given the set R, of routing re- lations of a non-faulty network, the set of survived nodes S ~ N, and the survived channels, find the subset SK ~ S that contains a communication kernel of maximum cardinality, over all such subsets, under R,.
Figure 4.3 shows a typical communication kernel obtained for a 2D faulty mesh network.
The blank nodes in the figure are nodes that have been restrained to operate as pure switches. The ratio of this maximum kernel cardinality to the size of the original non- faulty network serves as a natural reliability measure for our subsequent investigation.
Figure 4.3: An Example Communication Kernel in the 2D Mesh 4.3.1 Computational Complexity
Before one can meaningfully derive effective computational schemes to determine the MCS and MCK for any survived network, it is natural to first investigate the com•
plexity of the involved computations. We shall start by proving the NP.completeness of the two decision problems; this is sufficient to establish the NP.hardness of the two corresponding search problems
[16].
Theorem 4.1 The maximum convex subset (MCS) decision problem is NP.complete:
Given an arbitrary network, M
=
(N,C), an arbitrary set of routing relations, an arbitrary set of survived nodes and channels, and a positive integer J ~/NI,
determine if M contains a survived convex subset of size J or more.Proof. MCS, as defined, is in the class NP; i.e., given a subset of survived nodes, there exists polynomial time algorithms for checking the convexity of the subset. We now transform the Maximum Clique decision problem for an arbitrary graph, G, of
/NI
vertices into MCC. To proceed, we first define our original non.faulty network M to be the completely connected graph, Kn, and routing relations, R.
=
{~j} with Rij=
{cij},where Cij is the only channel joining node ni to node ni in M
=
Kn, The transformation from the arbitrary graph, G, to a survived subset of M is defined as follows: Vertex Viin G is mapped to node ni in M, and the edge eii in G is mapped to a survived channel
Cij in M. All remaining channels in Mare considered to be faulty. Under this mapping, a clique in G is equivalent to a convex subset in M under R. Hence, a solution to the MCS applied to the network M can be trivially transformed back to give a maximum clique for G. This establishes the NP-completeness of MCS. ■
Theorem 4.2 The maximum communication kernel (MCK) decision problem is NP- complete: Given an arbitrary network, M
=
(N, C), an arbitrary set ofrouting relations, an arbitrary set of survived nodes and channels, and a positive integer, J SINI,
de-termine if M contains a survived subset having a communication kernel of size J or more.
Proof. The proof is almost identical to that in MCS. The network, M, the set of routing relations, R, and the vertex-to-node and edge-to-channel mappings are all identically defined. The proof is completed by observing that, for the defined network, M, if a survived subset contains a communication kernel of size J or more, it also contains a convex subset of size J or more, and vice versa, by definition. Hence, a solution to the MCK problem in this case can also be transformed back to give a maximum clique for G. This establishes the NP-completeness of MCK. ■ The above NP-completeness results imply that the corresponding search problems of ac- tually finding the maximum convex subset and maximum communication kernel subset are NP-hard; hence, unless P
=
NP, exact solutions that run in polynomial time will not be found. The question as to whether the restriction of certain fixed topologies and cer- tain fixed routing relations would reduce the problem to polynomial time remains open.In any case, in the interest of studying the performance of different network topologies and routing relations at the current stage, we shall have to rely on computations that have been done by approximating heuristics.
4.3.2
Approximating Heuristics
We now proceed to describe a simple but effective heuristic elimination procedure to find reasonable solutions to the MCK problem. Whether there exists approximation
algorithms capable of providing a priori performance guarantees is yet another open question.
The heuristic elimination procedure to be described is motivated naturally by the objectives behind our introduction of the MCK problem. The objectives are:
1. Selectively discard a subset of survived nodes that are difficult to reach given the survived structure.
2. Selectively restrain a subset of survived nodes to operate as pure switches. These switches maintain the reachability among other pairs of nodes without insisting on being reachable themselves.
We observe that the second objective can be readily achieved, once we have decided which subset of survived nodes to discard. This is because every fixed subset of sur- vived nodes has a unique communication kernel, by definition. Any survived node that is not inside the kernel will then be restrained to operate only as a switch. On the contrary, achieving the first objective optimally is far more difficult since there are an exponential number of candidate subsets to consider. Here we suggest the following heuristic approach that delivered reasonable results for all our subsequent simulation studies.
Sequentially discard one node at a time by eliminating the one that is the most difficult to reach at that moment. During the entire elimination pro- cess, keep track of the maximum cardinality kernel ever obtained. Termi- nate when the number of remaining survived nodes is equal to the size of the recorded maximum kernel; at this point, we are sure no further improvement is possible under this heuristic.
In order to apply this heuristic, we need a way to quantify the vague notion of a node being difficult to reach. We suggest the following intuitive definition, which appears to give good empirical results:
Given a survived node, n;, let h(n;) denote the total number of survived nodes, ni
i=-
n;, for which there is no legal route leading from ni ton;. The node, n*, with the maximum count, i.e., h(n*) 2:: h(n;), VJ is designated as the node that is the most difficult to reach. Ties can be broken arbitrarily.Oftentimes, the computation of the cost estimate, h( nj), can be rather expensive; in such case, a simpler estimate, h1 ( n3), which counts the number of other survived nodes, ni for which R:3
= 0,
may be used instead. The simpler estimate, h' ( n3), shares with the more accurate estimate, h(n3), the property that at each stage of the computation, all nodes in the communication kernel have their respective cost estimates equal to zero.Thus, computing the suggested cost estimates after each elimination also simultaneously computes the desired new kernel configuration. Furthermore, since the elimination procedure evolves sequentially until it is suspended when the desired maximum kernel has been found, it is possible to resume and continue the elimination procedure when new faulty nodes or channels are identified. In other words, it is conceivable that we can run the elimination procedure as a background process in each node that serves as an exception handler, to be activated when additional faulty nodes or channels are detected.
An implementation of this heuristic running on the iPSC1/d7 Cubes [23] provides the principal tool for our simulation studies of the effectiveness of this approach to fault-tolerant routing in multicomputer networks.