Section 9.2 Overhead and Latency

DODO DODO

Section 9.2 Section 9.2 Overhead and Latency

Along path B, we encounter a series of multicomputers with progressively smaller nodes. Those with single-board nodes are called the medium-grain multicomputers; examples of medium-grain multicomputers are the Cosmic Cube, the iPSC/1, the iPSC/2, and the Symult 2010. Those with single-chip nodes are called the fine-grain multicomputers; an example of a fine-grain multicomputer is the Mosaic. Due to the reduced node cost when nodes become smaller and more abundant, the programming emphasis for a multicomputer shifts from one of achieving a linear speedup to one of exploiting the maximum concurrency.

Since medium-grain nodes are few and expensive. the primary goal of programming such multicomputers is to profitably utilize all available CPU c.vcles. Cycles can be lost to sources in the application itself: load-imbalance. extra synchronization, and insufficient concurrency: these internal delays are called overheads. Cycles can also be lost to sources in the system: message handling, kernel operation. and network congestion: these external delays are called latencies. In a medium-grain multicomputer, overheads and latencies are countered by employing at least several times more concurrency in the program than there are nodes in the multicomputer. The weak law of large numbers, together with the clustering of related elements. covers most of the problems. Nodes are seldom idle because the chance that all of their elements are blocked is low. The cost of message transactions is low because clustering causes most of t.he interactions to take place between elements of the same node.

To exploit more concurrency, we must use more nodes in the multicomputer and fewer program elements in each node. Although we can no longer overwhelm overheads and latencies by an abundance of concurrency, we no longer have to be obsessed with linear speedup, because nodes become cheaper as they decrease in size. Instead, programming for fine-grain multicomputers emphasizes the exploitation of all available concurrency in the program. Factors that prevent the exploitation of available concurrency are distinguished from factors that merely require the use of more nodes.

Latencies are factors that can prevent the full exploitation of concurrency. For example, when a message is delayed enroute to a waiting element. the element is blocked and the program may not progress as fast as it could. Overheads, on the other hand, do not prevent the full exploitation of concurrency. When an element is blocked waiting for a message that has not been produced, it is blocked only because the program has less concurrency than there are nodes. Synchronization operations, such as the use of null events in the conservative discrete-event simulators, are also overheads: They keep more of the nodes busy without interfering with the exploitation of concurrency in the system being simulated.

An element with unconsumed normal events may still be blocked awaiting a null event. If the required null event has been produced and sent, we would attribute the blockage to message latency; if the null event has not been produced, then we would attribute the blockage to lack of concurrency.

Section 9.3 Fine-Grain Multicomputer Programming

To fully exploit the concurrency of a program, we must remove all latencies and overheads.

Overheads can be mitigated by putting one program element in each node. but latencies can only be reduced by careful hardware and software design.

On the hardware side. message latency can be reduced with high-speed routers. These routers move messages in the network via a modified form of circuit switching called worm- hole or cut-through routing, which moves a message one step through the network in a time comparable to one memory cycle. Since a router is able to store and fetch messages at a rate close to the bandwidth of the memory, sending a message from one node to any other node is comparable to copying the same message from one buffer to another buffer.

On the software side, we must. without giving up generality, provide the thinnest cush- ion possible between the processes and the hardware. The Reactive Kernel and a fine-grain, light-weight programming environment, such as Reactive-Cor Cantor, make an ideal com- bination because the program is never further than one function call away from the system.

The execution units for these programming environments, especially the more restricted ones like Cantor, are small enough that nearly all of the concurrency in the program can

be exploited.

We have aimed in the direction of fine-grain multicomputers in all of our research. and our work on the discrete-event simulation is no exception. The CMB-variant simulator is ideally suited for fine-grain machines because it is written in a fine-grain notation, and is able to fully exploit the concurrency of the system it simulates. The simulator takes on a large overhead at N

=

1, but this overhead does not prevent the simulation from attaining

a large speedup at a large N. In many of the logic circuits we tested. near-linear speedup continues until there are only two or three elements in each node.

Since the C~v!B-variant simulator does not use any special techniques to reduce the o\·er- head on a medium-grain multicomputer, the qualities that contribute to the performance characteristics of the simulator persist as the simulation becomes more distributed. The hybrid simulators were created to demonstrate the effect of those techniques. The overhead is reduced when N is small, but the effect of these techniques vanishes and the performance converges to that of the CMB-variant simulator when N is large.

Section 9.4 The Next Frontier

vVe have fully dispersed all available concurrency in a discrete-event simulation program when we put one element on each node. If there were more nodes in a multicomputer than elements in the simulation, we would not be able to utilize those leftover nodes. However, we can still change the program to one that contains more concurrency. In a medium- grain multicomputer, where it is necessary to use concurrency to overwhelm latencies and overheads, rollback simulators such as Time Warp seek to produce additional concurrency by computing on speculation.

The memory in each node of a fine-grain multicomputer is insufficient for storing the previous states of its element in a rollback simulator. However, when there are more nodes than elements, previous states can be stored on unused nodes. When an element has reached a synchronization point, where its future is to be decided by a message that has yet to arrive.

the element picks a possible outcome and ships a copy of its old self to an unused node for storage. Alternatively, the element can make a copy of its new self, which it spawns and runs on an unused node. But rather than becoming dormant, the old self can continue to run and produce more copies until all possible outcomes have been exhausted. This is the concurrent branch-and-bound simulator; it is the next frontier to be explored when a fine-grain multicomputer becomes available.

Dalam dokumen Distributed Discrete-Event Simulation (Halaman 185-189)