Router - The SpiNNaker Chip 17

Chapter 2 The SpiNNaker Chip 17

2.2 Architecture

2.2.3 Router

Table 2.1. Memory map (continued).

Start End Actual size Access Function

F100_0000 F1FF_FFFF 96 KByte Global Router

F200_0000 F2FF_FFFF 516 Byte Global System peripherals

F300_0000 F3FF_FFFF 4 KByte Global Watchdog

F400_0000 F4FF_FFFF 48 KByte Global Ethernet

F500_0000 F5FF_FFFF 32 KByte Global SharedSRAM(unbuffered)

F600_0000 F6FF_FFFF 32 KByte Global Shared ROM

F700_0000 FEFF_FFFF – None Bus error

FF00_0000 FFFF_FFFF 16 MByte Global Boot area

FFFF_0000 FFFF_0FFF 32 Byte Global Boot vectors

FFFF_F000 FFFF_FFFF 4 KByte Local Interrupt controller

It was also anticipated that in an expanded system, the soft error rate in the aggregate SDRAM would be non-negligible. The DMAC therefore includes a programmable Cyclic Redundancy Check (CRC) generator/checker that can append aCRCword when a transfer is written toSDRAMor verify aCRCwhen it is read.

Also containedwithintheDMAC, although not aDMAfunction, is a bus bridge that allows theARMdirect access to theSDRAM, although this form of access is not particularly efficient. A write buffering option is available to reduce the latency if desired.

The only peripheral of particular note is the communications controller. This provides bidirectional on-chip communication with the router. The input inter- connection isblocking, so it is important to read arriving packets with low latency;

theARM’sFast Interrupt Request (FIQ)is typically used for this. Failure to read packets will cause the appropriate network buffers to fill and, ultimately, stall the on-chip router. Similarly, the outgoing link is blocking but the back-pressure may partially rely on software checking availability.

to neighbouring chips. Largely the links are identical, the only difference being that off-chip links (only) are notionallypaired, so that there is a default output associated with each input which is used in some cases if no other routeing information is found.

All router packets areshort. They comprise an 8-bit header field, a 32-bit data field and anoptional32-bit payload. Much of the network is (partially) serialised, so omitting the payload when not required reduces the demand on bandwidth and saves some energy.

There are four types of packet:

•

Multicast (MC) packets are intended to support neural spike communications.

•

Point-to-Point (P2P)packets are for chip-to-chip messages and are intended for machine management.

•

Nearest Neighbour (NN) packets primarily support the machine boot and debugging functions.

•

Fixed Route packets contain no key information and are always routed the same way: they can provide facilities such as carrying extra status data to a host.

Each of the packet types is separated and routed according to its particular rules.

The simplest areP2Ppackets that providechipinterconnection. A fully expanded SpiNNakersystem is designed to have2¹⁶chips, so a 16-bit field in aP2Ppacket determines thedestination chip. This is used as an index into a RAMtable that specifies which output link to use for that packet. Each entry in the table is 3 bits long, which permits the selection of any of the six chip-to-chip links plus aninternal option, used for when the packet has reached its destination chip; the routeing of all possible packets is therefore fully specified in this table.

When theP2P packet reaches its destination chip, it has to be directed to a particular processor. All internal P2P packets are sent to a preselected processor subsystem, programmed into that router. The design intention is that thismonitor subsystem will, at least primarily, manage the computer itself rather than run applications. It can forward messages to other systems if required in software, using the sharedRAMon the chip.

MCpacket routeing is rather more complicated. As previously mentioned, it is not feasible to store a complete routeing table for a billion neurons, so the neurons are grouped and only a subset of the groups need be recognised by any particular router. The first job is to recognise a packet (or not). This function is performed

by aTCAMin which the packet key is compared withall the entries. Each table entry consists of akeyand amask. Within each entry, each bit is compared with the corresponding stored state, which can be:

Mask bit Key bit Function

0 0 Always match

0 1 Never match

1 0 Match if 0

1 1 Match if 1

Subsequently, all the bitmatchesare ANDed, and if the result is true, the entry is a ‘hit’. These combinations allow each entry to match with particular patterns of ‘0’s and ‘1’s in the key, disregarding some other bits. For example, an entry with key=0x5a5a5a00andmask=0xffffff00will match the 256 packet keys in the range [0x5a5a5a00,0x5a5a5aff] as it ignores the 8 least-significant bits. Including anever matchbit anywhere in the entry indicates that the entry is unused, as it will never produce a match.

The inclusion of don’t care fields means that it is possible to match multiple differentTCAMentries quite legitimately. This is an exploitable feature since the matches are prioritised and the highest priority match is isolated for the subsequent stage. Placing more specific entries in higher priority positions can simulate having more entries than are physically present. For example, an entry withkey= 0x5a5a5a5aandmask=0xffffffffwill match the single packet key0x5a5a5a5a, which is part of the range matched by the entry listed in the previous paragraph.

If the new entry is included in the table at a higher priority than the previous entry, it will make that entry only ever match the other 255 keys in the range.

Matching a set of 255 packet keys would require a larger set of non-prioritised entries.

If a match has been made, the next step is to look up the output vector. This comprises a 24-bit word where each ‘1’ bit indicates that the packet should be copied onto that link. This facilitates the multicast operation.

Fixed route packets are very simple to direct. Each router has a single, programmable register that says which output link(s) to use. They are really a special case ofMCpackets with a single,always matchedkey field and they require almost no additional hardware. They can be used for specific purposes, such as building network trees to funnel monitoring data back to host interfaces but can only provide one such structure in any single configuration.

Unlike the other communication packets,NNpackets can be routed before the network tables are initialised; their routeing is determined by the chip hardware and the network topology. They are provided:

•

for boot purposes

•

for local systems communication

•

as a debugging aid.

For the first two purposes, packets are routed:

Source Destination

Any processor on this chip One or all inter-chip links inter-chip link monitor processor on this chip

By convention, only the local monitor processor should originate such packets;

just like the other packets, they carry a 32-bit data field with an optional 32-bit extra payload.

For debug purposes, a different type ofNNpacket is used. These are trapped by the router on the destination device, which becomes amasterof the shared address space on that chip. This means that one chip can read and write some of the state of any neighbouring device. The convention adopted here was that only 32-bit words can be moved and the presence of a payload: in a request indicates a write request;

in a response indicates a returned read value.

All the routeing units deliver packets to an output stage together with abit vector indicating their output direction(s). All being well, copies are dispatched simultane- ously on each of the indicated links. However there can be congestion which causes back-pressureon an output; in this circumstance the router output stalls and waits for the link(s) to clear.MCpackets stall ifanyoutput is blocked rather than trans- mitting on the unblocked links first; this facilitates some error recovery, if necessary, later.

The network isnotguaranteed deadlock free! In particular, the cloning ofMC packets can generate a lot more traffic than is initially injected. It is also infeasible to implement an end-to-end flow control protocol on such packets. There is therefore a risk – indeed a significant probability! – that the network could deadlock, at least unless some other protection exists. This contingency is handled by using a time-out on blocked packets. If a packet has been stuck for a pre-programmed time, it isdropped and the next is output instead. Dropped packets are caught in software and can be re-injected later. Ensuring that (multicast) transmission is all-or-nothing means that only the packet needs to be saved, the packet routeing being re-derived on re-injection.

The time-out period is software programmable using a short (8-bit) floating- pointvalue, allowing times from 0 (i.e. discard immediately if an output is blocked) to almost a million clock cycles; the ultimate value iswait foreverfor experimental purposes: packets are not lost but, under many circumstances, system deadlock means that functionality is!

The time-out introduces elastic buffering at a router and can alleviate any conceived deadlock problem. In a case of severe congestion, it does impose a potential performance penalty in that the time-out has to be slow enough to allow a processor to respond to an interrupt and read the packet before the next one is dropped, limiting the minimum time-out interval. In the original conception, it was believed that for neural spike packets, the vast majority of traffic in the intended applications could be dropped without re-injection because it is likely that biological systems would tolerate suchnoise. In hindsight, this was a questionable decision since computational neuroscientists can be more protective about their simulations!

Another level of defence against faults was provided byemergency routeing. This is a mechanism primarily to protect against a physical link failure. At time-out, instead of immediately dropping a packet, the router can attempt to route around the blockage. This assumes that the inter-chip mesh is wired in a particular (i.e., the originally intended, triangular) way and uses an alternate, two-hop path to the destination (Figure2.10). Information is included in the header of any packets treated this way, so that the subsequent routers can allow for the diversion, including know- ing the appropriatedefault routeingdirection. The emergency routed packet is, in some sense,superimposed on the network, so that the case where it exploits a link which was already committed to a valid routeis also correctly handled. Packets can also stall waiting for emergency routeing to be available; a second (independently programmable) time-out mechanism is applied which, if set to zero, can effectively disable emergency routeing.

In addition to its routeing links, the router has anAHBslave interface for pro- gramming and monitoring. This allows the tables to be set up. Although this is normally the responsibility of the localmonitorARM, it is also possible usingNN packets and the router as its own bus master. It is possible, with care, to read and write the look-up tables while the system is operating since requests are arbitrated into the packet stream to reach theTCAM,RAMs, etc. and removed before the output stage when a response can be generated. This also facilitates testing of the RAMs. Testing theTCAMis somewhat more complicated since it is not directly readable.

Neither full custom design norTCAMmacros were feasible or available for the design. While theRAMscould use conventionally produced (and efficient) macro- cells, the TCAM is composed of standard cells. This means that it significantly dominates the silicon area occupied by the router. This cost was alleviated using

Break or blockage Router

Router

Figure 2.10. Emergency routeing.

latch, rather than D-type flip-flop, cells for storage, which roughly halves the area.

To meet timing constraints,writingto these latches requires two clock cycles with a resultinghiccup in the pipeline flow; however, writing is rare, so this is not a serious issue. To further reduce cost, the multiplexer trees that would be needed to read back the contents were omitted. Some means of production test is still required though, and a scan chain through the latches is a difficult (and costly) alternative.

Instead, theTCAM is tested by association. Akey pattern can be written to a test register location and the presence or absence of a match can be determined, together with the internal address of the first match. The test is conducted by one of the on-chip processors during the boot process.

In a fault-free environment, all packets arriving at a router will be intact, correct and intentionally present. However, the router does some straightforward checks to increase the robustness of the system. Firstly, an arriving packet has to have a legal size, as counted by the number of symbols (‘flits’) arriving, delimited byEnd- of-Packet (EoP)markers. It was conceived that noise on the asynchronous links could easily introduce spurious symbols and corrupt packets. (In practice, such problems have not been observed in existing machines given that the long, cabled links which had been envisaged on the original design were avoided in the end.)

Packet corruption could still occur though, if a chip is reset (due to local problems) while sending to its neighbours. There is also a parity bit in packets where space allows, as a crude intactness check.

Finally, there is atimestampon potentially long-lived packets, intended to guard against misprogrammed routeing allowing packets to circulate in the system-wide network indefinitely. This is a simple, slowly changing phase number known by all routers and appended to packets as they are transmitted. To use this mechanism, all the routers in the system need to be synchronised, to some resolution. Synchro- nisation will not be perfect and, in any case, the time phase may change while a packet isin flight. A 2-bit Gray code is therefore used for the time phase, where a router will detect a mismatch onbothbits and will remove the packetbeforetry- ing to route it; this is separate from the dropping due to congestion. A packet will then time out if undelivered somewhere between one and two time phases after transmission. The time phases are set in software but envisaged to be of the order of a few milliseconds; legitimate deliveries should be completed in much less time than this.

Dalam dokumen SPINNAKER (Halaman 53-59)