In the last three chapters, we have examined a variety of theoretical issues that are fundamental to performing adaptive cut-through routing in multicomputer networks. In this chapter, we proceed to examine a number of realization issues concerning a practical implementation for achieving high-speed routing. There are a number of hidden design and implementation problems, as well as some partial solutions in realizing the adaptive- routing approach developed in this thesis. Our objective here is to highlight the various major architectural issues that must be addressed in any practical implementation of the proposed framework, rather than to propose a complete final architectural design. In fact, from the experience in the development and refinement of the oblivious wormhole router, it is clear that many iterations will be required before a highly competitive design of the adaptive router will emerge. Rather, what we are going to describe here summarizes what has been learned, and is intended to serve as a starting point for any serious attempt to implement the adaptive router. This chapter differs from the previous ones in that it raises many more questions than it gives answers; the hope is that these will provoke new ideas.
Specifically, in section 5.1 of this chapter, we shall take a second look at the packet delivery guarantee issue, this time with the emphasis on implementation rather than on theoretical feasibility. We shall argue for a more practical approach based on congestion control, a feedback mechanism that is an extension of the injection-fairness-guarantee protocol described in section 2.5. In section 5.2, we shall examine the header-encoding issues in detail, and shall propose possible encoding formats for both the rectilinear and the octagonal meshes; these have a number of desirable properties. This is followed in
section 5.3 with a discussion of the buffering issues involved in the storage management of variable-length packets, and a possible solution is presented that can be viewed as a starting point for further investigation. In section 5.4, the various problems involved in performing fast adaptive control will be discussed. A possible pipelined control scheme is examined that provides the context for the discussion of the various decision processes.
An observation that may lead to certain additional speedup opportunities is presented, with the hope that it will stimulate further thought. In section 5.5, we shall describe a handshake protocol between neighbors that maintains a set of consistent clock signals at each node in a multicomputer network. This allows the use of synchronous logic design techniques, and yet still is arbitrarily extensible. Finally, in section 5.6, we summarize the chapter.
5.1 Congestion Control
The first realization issue we shall discuss concerns the practical assurance of packet delivery. Recall from our feasibility study in Chapter 2 that it is theoretically possible to guarantee delivery of every individual packet in transit inside the network. The scheme calls for the assignment of a priority to each packet, and resolves channel-access conflicts according to the specified ordering. A priority assignment provides information to allow the network to pick consistent winners in these conflicts, so that these winners will eventually be delivered. A direct implementation of the suggested priority scheme, however, is rather unlikely as there are several difficult issues yet to overcome. For example, the entire priority field, consisting of the packet's distance from destination and its age, is likely to occupy at least a few flits; hence, a number of cycles will pass before its priority can even be examined, which is somewhat undesirable in fine-grain machines (see section 5.2 for more detail on this). In addition, the requirement to dynamically and consistently increment the age parameter is another nontrivial task.
However, perhaps the most challenging of all is the priority-discrimination task. With packets dynamically arriving and leaving, resolution of packet priorities will require the maintenance of a dynamically changing priority queue, where each comparison is performed on a multi-flit data field. Then, on top of all these are the necessary tasks of performing profitable channel assignments, and occasionally misrouting to avoid buffer
overflow, both of which are quite complex even without any requirement for following the priority ordering. Clearly, a practical hardware implementation will have to employ something quite a bit simpler.
It is interesting to observe that the congestion-control mechanism employed in the experiments described in section 3.5.4 actually suggests a possible practical alternative that is readily implementable on silicon. In particular, an important property of the congestion-control protocol described there is that it helps to minimize, if not completely eliminate, misrouting. As we recall, by allowing misrouting we destroy monotonicity in packet-to-destination distance reduction; this raises the whole issue of guarantee- ing delivery. In this sense, the congestion-control protocol described there employs a negative feedback approach to confine the network operating points to regions where misroutings are generaily rare; hence, the monotonicity in distance reduction is approx- imately restored. The experimental data obtained in those simulations presented strong circumstantial evidence that this negative feedback technique is generally successful.
Another main ad vantage of employing the congestion-control mechanism is that it also assures approximate fairness in network access. The assurance is no longer absolute because one no longer has absolute assurance of packet delivery, which is fundamental to establishing an absolute fairness guarantee. In other words, the congestion-control protocol is one mechanism that attempts to serve two masters, i.e., guaranteeing packet delivery and packet injection, but that achieves both objectives only approximately.
Whether it is possible to further modify the protocol such that one can restore the absolute guarantee remains an open question.
Implementation of the congestion control protocol is rather straightforward. Follow- ing the description in section 2.5.3, each channel needs one extra wire for passing the injection information required to carry out the protocol. Depending on whether or not a node has advanced its injection count, a 1 or a O will be passed to its neighbors in each cycle, according to the protocol specification. Whether the node will be allowed to inject or not in the next cycle depends on the signals received from its neighbors, and on whether misrouting has occurred in this cycle. It is clear that a few simple logic gates are sufficient to generate the required injection-control signals.
In our simulation experiments described in section 3.5.4, channel assignments are performed in a first-come first-served manner; this is a fair policy for a dynamically
changing set of packets. Sometimes the maintenance of the FCFS ordering may itself be rather expensive. For example, as packets come and go, different buffer locations may be randomly assigned in such a way that a packet's arrival order cannot be directly inferred from the assigned buffer location. In such cases, the simpler round-robin assignment ordering, which is another fair policy, can be used to provide an excellent alternative.
5.2 Header Encoding
The second problem we shall discuss concerns the format of the header of a packet.
A main advantage of wormhole routing is its extremely low message latency under relatively light network traffic conditions. This is made possible by having a node forward an incoming packet as soon as enough header information has been accumulated to determine the correct output channel. Performance considerations and the desire for hardware simplicity both suggest that we choose a header format that provides all relevant routing information in the first flit of the packet. In an oblivious scheme, where routing proceeds in dimension order so as to avoid deadlock, it is natural to pack the delta values of the corresponding dimensions in that same order. For example, for the rectilinear meshes, we require the following information: LlX, LlY for the 2D meshes, and also LlZ for the 3D meshes. Suppose the oblivious scheme routes a message first along the X dimension, reducing LlX to zero; then along the Y dimension, reducing LlY to zero; and finally along the Z dimension, reducing LlZ to zero; the header will be organized with LlX in front, followed by LlY, and then .ti.Z. The foremost delta value is decremented by one every time the message header is forwarded one step toward its destination, and when a zero has been detected, the message is switched to the next dimension. Under such a scheme, the message latency, T, under light network traffic conditions, can be expressed as:
T~pD+L,
where p is the delay per stage required to perform the incremental update, D is the distance the message has to travel, and L is the length of the message. To facilitate the zero-detection and the decrement-by-one operations, a coding scheme that explicitly encodes the notion of a leading-zero was developed [14] that allows a message to ripple through an intermediate node with a delay of only two clock ticks, i.e., p
=
2.In contrast, for an adaptive scheme that attempts to take advantage of possible alternate routes, it is essential to optimal routing decision to have the direction infor- mation of every dimension simultaneously available. In particular, the presence of such direction information in the first flit of the header is extremely desirable in order to allow the assignment logic to make immediate direct cut-through routing decisions. One possible scheme is to have the delta value corresponding to each dimension be present explicitly in parallel, starting from the first flit of the header. In addition, we shall need to perform bit-serial increment- and decrement-by-one operations on the encoded numbers. It is clear that for the adaptive scheme to stay competitive with the oblivious scheme, the delay per stage, p, must be kept to a minimum. This is especially true in a fine-grain machine, where the average message distance, D, tends to be large, and the average message length, L, tends to be small. In such cases, pD will become the dominant component of T; hence, it is essential to keep p as small as possible. In the following, we describe an encoding scheme for the adaptive router with a delay per stage which matches that of the oblivious scheme, i.e., with p
=
2.5.2.1 Rectilinear Mesh
In this subsection, we shall describe a coding scheme for the rectilinear meshes, assuming the routing relations are defined according to the reduction of a message's L1-metric from destination. To help simplify the subsequent exposition, we shall first describe the encoding of the delta value for a single dimension. To proceed, we observe that having the sign of delta displacement value is necessary and sufficient to determine the profitable direction for that dimension. In particular, for each dimension, an encoding of its sign must be able to distinguish between three possible alternatives: positive-, negative- or zero-delta displacement. This consideration naturally suggests the use of a sign-and-magnitude encoding. Furthermore, our need to perform bit-serial arithmetic on the numbers also suggests the use of a least-significant-bit-first arrangement. As in the case of the oblivious encoding, let us try to use an alphabet with 3 symbols,