Performance
3.5 The Simulation Results
3.5.1 Single-Packet Messages
The relevant statistics for simulation runs of single-packet message traffic are shown in Figures 3.7 to 3.14. Figures 3.7 and 3.8 plot the latency versus throughput tradeoff curves for the 16 x 16 2D torus and the 8 X 8 X 8 3D torus. Also shown in these two figures are the theoretically predicted curves obtained from our theoretical model. Starting at
~ 10% throughput, the latencies in both networks are very close to their theoretical lower bound, i.e., 32+2 (\6)
~
40 cycles for the 2D torus and 32+3(¾) ~
38 cycles for the 3D torus. Both curves stay relatively flat until the throughput increases to ~ 70%, at which point the latencies start to climb rapidly and approach values comparable to those expected in store-and-forward switching. This phenomenon is characteristic of the virtual cut-through switching technique: The message latency is close to that of circuit switching at low traffic density, and approaches that of store-and-forward switching at high traffic density [25]. As we shall see later, one of the main differences between the oblivious wormhole scheme and our adaptive cut-through scheme would be in the location of the transition points. Observe that the matching between the theoretical and experimental curves is closer for the 3D torus than between those of the corresponding 2D torus. We conjecture that this is due to the increased number of channels per node in the case of the 3D torus, which makes the Poisson arrival assumption more accurate.Figures 3.9 and 3.10 plot the corresponding latency-versus-throughput tradeoff curves for the 2D and 3D mesh networks. Shown in these figures are the tradeoff curves for
both the oblivious wormhole and for our adaptive cut-through results. Oblivious worm- hole routing differs from cut-through switching in that it treats the routing paths as data pipelines that join the source and destination nodes, allowing messages to ripple through. Messages trying to access channels currently being used are blocked, which in turn, block other messages behind them in the pipelines [11]. Again, the characteristic shape mentioned above for the virtual cut-through switching tradeoff curves is obtained.
For oblivious wormhole routing, the transition points lie at ~ 30 to 40% of normalized throughput, with the maxima never exceeding 50% and 40% for the 2D and 3D mesh, respectively. Figures 3.11 and 3.12 plot the relationships between sustained throughput and applied load for all the relevant networks and routing schemes. Observe that for the oblivious schemes, the sustained throughputs remain stable at their respective maxima even after the applied load exceeds the capacity that can be handied. In addition, the network latencies also remain stable at their respective maximum values. Apparently, the blocking that occurs at those traffic densities is also sufficient to throttle further congestions created by excessive injection. On the other hand, the figures indicate that our adaptive cut-through switching can sustain as much as 85% normalized throughput, given a total 15 packet buffers per node, which is the case in our simulations. Again, the transition points in the latency curves occur at about 70% normalized throughput.
While the average network latencies of the oblivious wormhole routing remain stable for applied loads that exceed their sustained throughput capacities, the corresponding source queueing times increase without bound. Figures 3.13 and 3.14 plot the relation- ships between source queueing time and applied load for all the relevant networks and routing schemes. Observe that under adaptive routing, the torus network curves resem- ble those predicted by the Pollaczek-Khinchin Formula for the
M/
D/1 queues. These facts are consistent with the memoryless message-generation distribution used in our simulation experiments. Furthermore, we observe that the curve of the 3D torus is much steeper than that of the 2D torus. That such should be the case can be understood if we recall that the maximum steady-state injection rate of the 3D torus is identical to 1, whereas it is½
for the 2D torus. Hence, for the 2D torus, the average utilization of the internal channel never exceeds one-half of its capacity. Another interesting point to ob- serve in these two figures is that the average source queueing time for the mesh networks is lower than for the corresponding torus networks for all normalized applied loads upto ~ 75%. This may seem like a contradiction at first, as we expect the congestion created at the center of the mesh to result in a much higher average source queueing time than that obtained from the balanced traffic generated over the torus. However, only those nodes at the center of mesh experience this congestion, whereas the majority of nodes at the periphery are operating with their channel utilizations much less than 1.
Hence, there is little or no interference to the traffic carried by the internal channel at these nodes, which are injecting at a rate that is half of the corresponding normalized value for the torus. It is only after the congestion area begins to grow in response to increases in traffic density that the overall average source queueing time begins to climb rapidly and far exceeds that for the torus networks. Hence, in this regard, if a mesh is expected to operate under very heavy traffic density, fairness-guarantee schemes (such as those presented in Chapter 2) should be employed. However, in such cases, because the figures indicate that the torus connections are superior in terms of flow control and fair access to the network, the torus may prove to be a much better topology than the mesh. On the other hand, we might adopt the point of view that a network should never be driven to support a traffic density beyond that of the transition point in its respective latency curve, in which case, the torus has no clear advantage over the mesh.
In fact, under a constant bisection capacity assumption, the channel width of a mesh is twice that of a torus, which more than compensates for the longer average message distance to travel.
2D Torus, 256 Nodes 400
350
300 Model Prediction
L 250 A T E 200 N C 150 y
100 50 0
0.0 0.2 0.4 0.6 0.8 1.0
Normalized Throughput
Figure 3.7: Single-Packet Message Latency of 2D Torus
3D Torus, 512 Nodes 300
250 L 200 A T
E 150 N C y 100
50 0
0.0 0.2 0.4 0.6 0.8 1.0
Normalized Throughput
Figure 3.8: Single-Packet Message Latency of 3D Torus
2D Mesh, 256 Nodes 350
300 L 250 A 200 T E
N 150 C y
100 50 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Normalized Throughput
Figure 3.9: Single-Packet Message Latency of 2D Mesh
3D Mesh, 512 Nodes 300
250 L 200 A T
E 150 N C
y 100 50 0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
Normalized Throughput
Figure 3.10: Single-Packet Message Latency of 3D Mesh
T H R 0 G
u
H p T
u
T H R 0
u
G H pu
T2D Networks, 256 Nodes 1.0
0.8 0.6 0.4 0.2 0.0
0.0 0.2 0.4 0.6 0.8 1.0
Normalized Applied Load
Figure 3.11: Single-Packet Message Throughput for 2D Networks
1.0 0.8 0.6 0.4 0.2 0.0
0.0 0.2
3D Networks, 512 Nodes
Oblivious, Mesh
0.4 0.6
Normalized Applied Load
0.8 1.0
Figure 3.12: Single-Packet Message Throughput for 3D Networks
2D Networks, 256 Nodes
s
350 0u
R 300 C
E 250 Oblivious, Mes
Q
u
200 Eu
150E I
G N 100 T 50
I Adaptive, Toru
M 0
E
0.0 0.2 0.4 0.6 0.8 1.0
Normalized Applied Load
Figure 3.13: Single-Packet Message Source-Queueing Time for 2D Networks
3D Networks, 512 Nodes
s
3500
u
R 300 C E 250
Q
u
200 Ob ivious, Meshu
E 150 E IN G 100 T 50
I
M 0
E
0.0 0.2 0.4 0.6 0.8 1.0
Normalized Applied Load
Figure 3.14: Single-Packet Message Source-Queueing Time for 3D Networks