Reliability
4.4 Simulation Experiments and Results
Oftentimes, the computation of the cost estimate, h( nj), can be rather expensive; in such case, a simpler estimate, h1 ( n3), which counts the number of other survived nodes, ni for which R:3
= 0,
may be used instead. The simpler estimate, h' ( n3), shares with the more accurate estimate, h(n3), the property that at each stage of the computation, all nodes in the communication kernel have their respective cost estimates equal to zero.Thus, computing the suggested cost estimates after each elimination also simultaneously computes the desired new kernel configuration. Furthermore, since the elimination procedure evolves sequentially until it is suspended when the desired maximum kernel has been found, it is possible to resume and continue the elimination procedure when new faulty nodes or channels are identified. In other words, it is conceivable that we can run the elimination procedure as a background process in each node that serves as an exception handler, to be activated when additional faulty nodes or channels are detected.
An implementation of this heuristic running on the iPSC1/d7 Cubes [23] provides the principal tool for our simulation studies of the effectiveness of this approach to fault-tolerant routing in multicomputer networks.
Notice that even in a perfect scenario, the yield would be identical to the fraction of survived nodes, which is always< 1, in any faulty network. In a crude sense, it represents a first-order measure of the degradation in computing power due to the ensuing faults.
Specifically, our goals in performing these simulation studies are:
1. To collect the empirical statistics on the effectiveness of our approach for providing fault-tolerant routing to a popular class of networks that has practical importance.
2. To quantitatively compare the tradeoff of arity versus dimension under the random- and independent-fault assumption.
3. To assess and interpret the empirical behavior of these tradeoffs in order to provide insights into suggesting means for improvement.
To help differentiate the impact of node failures and channel failures on the effectiveness of our approach, we simulated separately the two distinct cases of having purely node faults and having purely channel faults. The faults are generated independently using identical probabilities chosen over a range from one to ten percent. Figures 4.4 to 4.9 plot the simulation and computation results obtained for three different networks: the 2-ary-10-cube, the 4-ary-5-mesh, and the 32-ary-2-mesh. Notice that all three networks were chosen to have 1024 nodes, representing the range from low- through medium- to high-dimensional networks. In these figures, we have chosen the scattered plots in order to convey pictorially the statistical distributions of the collected simulation results.
From these plots we notice the following general trends:
1. In all three networks, the yield of the purely channel failure case is worse than the corresponding yield of the purely node failure case of the same failure percentage.
That such should be the case is not straightforward, since each node that is faulty renders all its incident channels faulty. In fact, for our chosen range of fault probability, under the same percentage of faults, the total number of channels that are rendered faulty due to purely random node faults would be approximately twice that of the corresponding number due to purely random channel faults.
Apparently, the very dispersive occurrences of pure channel faults is much more effective in destroying routes in these survived networks.
2. The yield obtained under both the random node faults and random channel faults decreases as we decrease the dimension of the networks. This fact is consistent with our intuition, since the average number of distinct paths between source and destination is a combinatorial function that increases steeply with the dimension of the networks. In fact, the differences are so great that, from the empirical figures, one is justified in concluding that the binary-n-cube is extremely robust under our adaptive scheme, whereas the 2D mesh, at least in its present form, is not.
3. The fitted curve is a much better representation of the actual statistical distri- bution of simulation results in the higher-dimensional networks than the lower- dimensional ones. The lower-dimensional networks have distributions that are highly dispersive. Again, this fact can be understood statistically from the com- binatorial disparity in the number of paths across networks with different dimen- sions. A larger number of distinct paths implies a better convergence to the mean value according to the law of large numbers.
While the above empirical results indicate that the binary-n-cube network is much more robust than the 2D mesh connection, this advantage is offset by the excessive wire bi- section required to connect this topology. A lucid explanation of the wire bisection argument can be found in [13]. Perhaps most importantly, low-dimensional networks such as 2D or 3D meshes or tori are practical in that they are readily realizable. In particular, 2D structures are very desirable since they are much cheaper to build. Fur- thermore, they leave the third physical dimension behind; this is extremely convenient for service and maintenance purposes that are particularly important in large networks.
However, the above empirical results for the 2D mesh connection are rather disappoint- ing. It is therefore desirable to look for methods to improve the yield while remaining in 2D. Such a network will be presented in the next section. To proceed, let us first examme more closely the reason behind the poor performance of the 2D rectilinear mesh.
Figure 4.12 shows a typical kernel obtained in a 2D mesh network when there are only a few faults. In this figure, the solid boxes denote reclaimed nodes, the blank boxes denote switching nodes, the broken boxes denote discarded nodes, and the missing nodes
and channels are faulty. Observe that, except for the possibility of discarding whole corner regions, each faulty channel has to be accommodated by eliminating either an entire row or an entire column of nodes, whereas, node fault causes both an entire row and an entire column to be eliminated. It is straightforward to see that it does not take many faulty nodes or channels before the mesh completely collapses. The problem exists because although our adaptive routing scheme allows us to exploit multiple alternate routes, if such exist, the routing relations defined by the city block metric leave many source-destination pairs with only a unique route between them. In particular, any pair of nodes that lies on the same row or column is connected via a single route defined under the city block metric. In essence, the set of routes defined by the routing relation for the 2D mesh topology simply is not rich enough in connectivity. It is clear that in order to improve the reliability, such bottlenecks must be removed.
1.0 0.9 y 0.8 E I L D 0.7
0.6 0.5
1.0
0.8 y E I 0.6 D L
0.4
0.2 0
0
2
1024 Nodes, 5120 Channels
4 6 8
Percentage of Node Faults
10
Figure 4.4: Binary-10-Cube with Node Faults
1024 Nodes, 5120 Channels
2 4 6 8 10
Percentage of Channel Faults Figure 4.5: Binary-10-Cube with Channel Faults
+
12
12
1024 Nodes, 3840 Channels 1.0
0.9 0.8 y I 0.7
E L 0.6 +
+ + D
0.5
+ ++
0.4 0.3
0 2 4 6 8 10 12 14
Percentage of Node Faults
Figure 4.6: 4-Ary-5-Mesh with Node Faults
1024 Nodes, 3840 Channels 1.0
0.9 y 0.8
I
E 0.7 D L
0.6 0.5 0.4
0 1 2 3 4 5 6
Percentage of Channel Faults Figure 4.7: 4-Ary-5-Mesh with Channel Faults
1024 Nodes, 1984 Channels 0.8 +
0.7 + 0.6 y 0.5 E I 0.4 D L 0.3
0.2 +
0.1 +'
0.0 _I_
0 2 4 6 8 10 12
Percentage of Node Faults
Figure 4.8: 32 x 32 Rectilinear Mesh with Node Faults
1024 Nodes, 1984 Channels 1.0
0.8 y 0.6 E I D L 0.4
0.2 0.0
0 1 2 3 4 5 6
Percentage of Channel Faults
Figure 4.9: 32 x 32 Rectilinear Mesh with Channel Faults
1024 Nodes 1.0
0.8
f
0.6E L D0.4
0.2 32-Ary-2-Mesh
0.0
0 2 4 6 8 10 12 14
Percentage of Node Faults
Figure 4.10: A Comparison of Yield with Node Faults
1024 Nodes 1.0
0.8
yo.6
E L D0.4
0.2 0.0
0 1 2 3 4 5 6
Percentage of Channel Faults
Figure 4.11: A Comparison of Yield with Channel Faults
.--,
I I
L-.J
.--,
I I
L_.J
.--,
I I
L-.J
.--, .--,
I I I I
L_.J L-.J
.--, .--,
I I I I
L-.J L-.J
.--, .--,
I I I I
L-.J L-..>
.--, .--, .--, .--, .--,
I I I I I I I I I I
L-.J L-.J L-.J L-.J L-.J
.--, .--, .--, .--, .--,
I I I I I I I I I I
L-.J L-.J L-.J L-.J L-.J
.--, ,--, ,--, ,--, ,--,
I I I I I I I I I I
L-.J L-.J L-.J L-.J L-.J
Figure 4.12: A Typical Kernel for the 2D Mesh Network
Figure 4.13: The Octagonal Mesh Network