Distributed Discrete-Event Simulation

To my friends from UC Davis, Glenn Saito and John Bakos, for their help during my college years. 1 Discretion when using b-layer functions 3.3.2 The RPC Discretion Layer (r Layer) 3.3.3 The CSP Discretion Layer (csp Layer).

Chapter 1 Introduct ion

COMMU 1 I CATIO N NETWORK

A number of reactive process-based programming languages have been developed in anticipation of fine-grained multicomputing. It is useful for exposing the simplicity of programming systems with reactive processes—a level of simplicity that is necessary for any fine-grained multicomputer programming system.

Chapter 2 Reactive-Process Programming

After receiving the message (Figure 2.4), the factorial process calculates the product of all integers within the closed interval: [LD,HI]. The other unusual aspect of the Reactive-C primitives is that rc_spawn does not return the ID of the new process; therefore. the only_v direct way for a parent process to get the ID of the sibling is to receive the ID from a message sent by the sibling.

Chapter 3 Reactive-Process Layers

T he non-blocking-receive layer (nb -layer)

When a process is interrupted by the timer. the interrupt servtce routine of the process calls a receive function to relinquish control. While a timer may still be required as a backup mechanism to stop runaway processes. the preferred method is to convert a non-reactive process into a reactive process by having the process periodically call a receive function during extended computations.

Chapter 4 Cosmic Environment

CE preserves the value of our work by making it easy to provide efficient implementations for its specifications on many multicomputers that are otherwise incompatible with the software. Therefore, CE has simplified the interpretation of b12en whenever possible to shorten the manuals.

Section 4.2 Our Cosmic Environment Implementation

Structure of our CE implementation

In our implementation, the host system is built on top of the TCP/IP network. The node system is built on top of the multicomputer network and can include either a surrogate kernel on each node or a set of emulation routines for CE functions. The ifc process then converts the message to a TCP/IP message and sends it to the switch.

The cube da>mon process passes the allocate and deallocate commands to the ifc process over the same connection. In iPSC/1 and earlier versions of iPSC/2, the CE node system is layered on top of their native systems - the NX kernel. When we layer a CE node system on top of the native node kernel, the ifc process links to the host's native library for the multicomputer.

To the native system running underneath, the ifc process appears to be just an ordinary host process of the native system. The CE node system can operate within the bounds of user-accessible functions of the native. The bottom of a data structure must also be rounded by padding it to the alignment boundary of the largest data item in the structure.

Breaking a s imulation into sma ller slices

Since the set of equations above covers a chunk of time. let's simply refer to it as a cut. The operation of a composite simulator that moves one slice at a time can be described by the actions of its element simulators. 1f the state description over the interval (t.t +a) can be calculated for each arc in the system.

Since element simulators are assumed to be progressive, the simulator will calculate for e S'e(t.t +a) from Se(t.t) and .Sinp(e)(t.l +a). If the delay is zero (Figure 5.12), the simulator front end contains no output state description beyond the start time of the slice (Figure 5.12(a)). The equation for Sa depends on the state of other arcs, and the simulator is unable to produce Sa(t, t +a) until it has received Sinp(e)(t, t +a) (Figure 5.12 (b)).

A slice that contains no zero-delay circuits can be solved by simple substitution of variables: a slice that contains zero-delay circuits (called a mandatory slice) requires simultaneous solving of equations. A mandatory piece. However. can have three possible outcomes: no solution. a solution. and multiple solutions. All three outcomes can be found in the zero-delay cross-coupled XOR-NOR circuit in Figure .13.

Section 5.3 The Generic Simulator Model and Its D e rivatives

The total volume of messages in the simulation is reduced without hindering the progress of the simulation. However, whether a message is needed depends on the state of the simulation. and it is often impossible to determine based on local information alone. In a sequential simulator implementation, the array of tapes is replaced by a concatenated list of pending events.

The position of the read heads is kept in a single variable called the global clock. The repeatecll.v simulator sets the global clock to the time of the earliest event in the list, pulls that event from the list. and delivers it to the destination element. The event drawn from the top of the list will never need to be revisited because the assumption of the element that posted it is now shown to be correct.

The sequence of events taken from the list represents the result of the simulation. time simulated tim.e : -- classified event bst. Global virtual time tries to track the minimum time of all events and elements. When it is encountered and if the state assignment of the included elements is already self-consistent.

Chapter 6 Logic- Circuit Simulator Experiments

C irc uit topology vs. activity le vel

Since its simulation time is only weakly dependent on the content of these fragments, it is more strongly influenced by the static characteristics of the circuit connection, such as the degree of fan-out. than by the clynamic characteristics. A 1376-gate multiplier for 100J1s on a Symult 2010 fast oscillation. of the circuit operation. such as number of events produced. E.g. if a circuit contains a cross-coupled latch, the delay of the gates in the latch determines the number and span of the fragments produced. and the number of fragments produced determines the simulation time of the CMB variant simulator.

The number of times the latch is used determines the number of events generated in the latch, and the number of events generated determines the simulation time for a sequential simulator. We can expect the sequential simulator performance to change to a greater extent compared to the CMB variant simulator if we run the simulation with the same multiplier orbit but with a different activity level. Four times as many events are produced, and the time taken by the sequential simulator has increased by a factor of 4.

However, the time taken by the simulators in the CMB variant has only increased by a factor of 2. Because fragments are more likely to carry transitions. the possibility of successive fragments merging into a single fragment is reduced. Because the elements in a macro element must be in the same address space, and because their operations must be interleaved, it is tempting to think that there may be a way to introduce sequential simulator efficiency into the simulation of elements in a macro element. . When N = 1, all logic gates would be in the same node; the simulator would have the same performance as a sequential simulator.

Chapter 8 Additional Performance Results

In the FIFO shown in Figure 8.2, the output of element C is connected to the input of element C in the next unit. The inverted output of element C is connected to the input of element C in the preceding unit. Unlike the multiplier example we used in previous chapters. the hourly network shows a much larger difference between the most zealous version and the laziest versions.

Also. Unlike the multiplier example, load balancing is easy because a clock network ::.ho,v has a stable and uniform level of activity on every part of the circuit. The performance at N = l and the linear acceleration for most lazier CMB variants fit well with the sweE'p mode prediction. As a result, the result of the simulation can vary significantly from run to run, but when N is small. the behavior is more limited and the prediction of the sweep-mode simulation prevails.

The hybrid-1 and hybrid-2 curves are similar to those of the multiplier circuit, except that these curves show a higher speed due to better load balancing for the clock network. The activity level for this multiplier is more uniform because a new wave of activities is injected into the multiplier before the old ones are finished. h_ybrid-2 curves continue straight down as)· close to the CMB variant curves. However, there are still significant invalid messages in the simulation, and the overall estimate of the sweep mode simulation is between 2 and 3 octaves.

Chapter 9 Summary

DODO

DODO DODO

Section 9.2 Overhead and Latency

Those with single-chip nodes are called the fine-grained multicomputers; an example of a fine-grained multicomputer is the Mosaic. In a medium-grain multicomputer, overheads and latencies are mitigated by using at least several times more concurrency in the program than there are nodes in the multicomputer. To take advantage of more concurrency, we need to use more nodes in the multicomputer and fewer program elements in each node.

Instead, programming for fine-grained multicomputers emphasizes the utilization of all available concurrency in the program. Synchronization operations, such as the use of zero events in the conservative discrete-event simulators, are also overhead: they keep several of the nodes busy without interfering with the utilization of concurrency in the system being simulated. These routers move messages in the network via a modified form of circuit switching called wormhole or cut-through routing, which moves a message one step through the network in a time comparable to one memory cycle.

The execution units for these programming environments, especially the more limited ones like Cantor, are small enough that almost all the concurrency in the program can. We aimed in the direction of fine-grained multicomputers in all our research. If there were more nodes in a multicomputer than elements in the simulation, we would not be able to use those remaining nodes.

Chapter 10 Bibliography