Chapter 2 The SpiNNaker Chip 17
2.2 Architecture
2.2.1 An Overview
Imagine a large array of microprocessors where each processor simulates the bio- logical computing of a number of neurons. In imagination, the array is almost infinitely scalable, since the neurons themselves are largely independent. There is then a choice as to how many neurons are mapped onto each processor, which is governed by speed – both of the processors themselves and the desired speed of simulation – and the memory capacity of each processor.
Outside the world of imagination, there are other pragmatic limits. Building a customised microprocessor, specialised for neuron modelling, is impractically expensive, not (just) from the hardware development view but from the software support: an established architecture is much to be preferred. Then, there is the con- sideration of powering and cooling a machine of any size. Finally, if a custom logic is to be made, the design and verification effort must not be impractically high.
To provide significant (and convenient) computing power without excessive elec- trical power dissipation suggests a 32-bit architecture. A 32-bit integer can provide 232or about four billion unique codes, which is (very) approximately a match for the number of neurons in a mammalian brain. (A human has about 86 billion neurons; a domestic cat has around three-quarters of a billion [19,90].) As aback of the envelope initial figure, ‘one billion neurons’ seemed a credible target. This could be spread over a million processors – each simulating 1,000 neurons – with the processors grouped into chips, each chip being a multicoreApplication-Specific Integrated Circuit (ASIC).
The chosen processor was an ARM968 [6]. This ARM9 device was already mature at the time of selection but still gave good power/performance efficiency and, crucially, was kindly licensed, on a non-commercial basis, byARMLtd. For
manufacture, a 130 nm process was selected: again not state-of-the-art even at the time of design but cost-effective and without too many new process issues for the (necessarily) limited design team. With this process and this processor, a target operating clock frequency of 200 MHz seemed reasonable and staticRAMmacros that supported this target were available. Calculation suggested that this could sup- port the target number of neurons in real time, with some flexibility to cope with a varying load. Energy efficiency is important not so much on an individual processor basis but when multiplied by a million processors in the system or, indeed, twenty or so in the same package; theARM968 is a power-efficient microprocessor when executing and is able to ‘sleep’ – consuming almost no dynamic power – when there is nothing to do, which may be expected frequently in a real-time system.
The amount ofRAMneeded to balance this model was also reckoned. In prac- tice, for the intended application, theRAMwas infeasibly large; however, much of this is relatively infrequently used, so the model was subdivided in a memory hierarchy, with a fastSRAMand a much larger but slowerSDRAMcomponent. A local data space of 64 KByte plus 32 KByte of code space (small, since the processors are running dedicated, embedded application code) was allocated. This needs to be backed up by tables up toa few megabytesin size. Available (low power) technology meant a single cost-effective die supplied 128 MByte but the relatively low demands expected meant that one die could reasonably be shared amongst several processors.
With area estimates for the processor subsystems – including their SRAM – and a feasibleASICdie size – it appeared thatabout20 processors on eachASIC, together with a single, sharedSDRAM, would provide an appropriately balanced system. This implied that 50,000ASICswould be needed for a 1,000,000 processor machine – a number which would (attractively) fit in a 16-bit binary index.
Neurons alone do not compute; there needs to be interconnection and, indeed, there is overwhelming evidence that it is the patterns and strengths of connec- tionswhichprogramme biological computers [115]. The problem for the system architect is that, in biology, the output from any one neuron may be routed to a unique set of hundreds, thousands and even tens of thousands of destination neu- rons (Figure 2.1). This far exceeds typical computer communications uses, other than with a broadcast mechanism; here, with a million possible sources, broadcast is not practical, either from the communications bandwidth needed or the power requirement for inter-chip communications.
It is therefore the specialist communications network, designed to support the specific spiking neural network applications, that differentiates SpiNNaker from most other multiprocessor systems.
SpiNNaker communicates withshortpackets. In neural operation, each packet represents a particular neuron firing. A packet is identified usingAER[152]; it is tagged only with its originator’s identifier. (With 1 billion neurons, this requires at
Figure 2.1.Neurons interconnecting.
least 30 bits; a 32-bit field is allocated for convenience.) Packets are thenmulticast to their destinations with most of the routeing and duplication being done in (and by) the network itself.
The first important point in the design is that theaggregatebandwidth of the run- ning system – where packets are duplicatedin flightbut only as needed to reach all their destinations – is not infeasibly high. Just like the processor – neuron relation- ship, a single network link can carry many, multiplexedspikelinks as the electronic connections are much faster than the biological axons. Indeed, practically, the time to deliver a spike is typically negligible compared to biological transmission. Thus, the actual network topology is not particularly important although, since neural systems themselves (and their traffic) are fairly homogeneous, some form of mesh is suitable – and amenable to the construction of scalable systems.
The chosen topology for the SpiNNakernetwork is a two-dimensional mesh.
The mesh is triangular (Figure2.2) rather than Cartesian, with eachASIC con- nected to six neighbours; this provides more potential bandwidth over the given links and was also intended as a provision for automatically routeing around faulty connections. (In practice, it has been observed that this latter feature was over- cautious and is little used.) The edges of the mesh can be closed to form a torus that reduces the longest paths; the maximum expected system – 216 chips or a 256×256 grid – would therefore have a longest path of 128hopsalthough most would be much shorter.
Although there are other packet delivery mechanisms, the novelty and speciali- sation inSpiNNakeris in handlingmulticastpackets. These are optimised to model biological neuron interconnection, where each neuron has a single output that feeds its own set of targets. Biological destinations are not entirely random; there is some structure and neurons tend to be clustered withinpopulationswith an output feeding some subset of the neurons in several populations. This structure can be abstracted as atree(Figure2.3).
Router
Router
Router
Router
Router
Router Router
Figure 2.2.Network mesh.
For simulation, it is logical to map neurons within a population to the same pro- cessor(s). This means that a single packet delivered to a processor can be multicast to the neurons – the last branching of the tree – by software. The populations them- selves need to be distributed across the mesh network. In this manner, it is likely that multicast packets can share part of their journey, effectively extending the tree structure to multiple (series) branches (Figure2.4). This also reduces the network traffic as a packet is often not cloned until some way towards its destination.
The routeing from chip to chip is managed by a custom router on each ASIC. Logically speaking, each router checks the (neuron) source ID – the only information in the packet – and looks up asetof outputs, potentially including both chip-to-chip links and processor systems on that chip itself. The packet is then duplicated to all the specified outputs.
Neuron
Neuron Neuron
Synapse
Axon
Network routeing
Software routeing
Figure 2.3.Neuron tree.
With a 32-bit neuron AER, each router is potentially holding 4 billion words of routeing look-up table: this is impractical. However, the logical table can be compressed considerably in practice:
•
Not all IDs are expected at a given node.•
A high proportion of connections – particularly over long distances – are simply routedstraight through.•
Many entries will be the same, as a result of the population-to-population connectivity, rather than a random structure.These properties are exploited to shrink the routeing tables to a manageable size.
This makes the table sparse, so rather than a simple array it is stored as anassociative structure usingContent-Addressable Memory (CAM)to identify IDs of interest. If an ID isnotrecognised, a topological assumption is made about the interconnection mesh and the packet is simply forwarded to theoppositelink from which it arrived:
this is referred to as default routeing (Figure 2.5). Default routeing reduces the number of table entries to those corresponding to packets which are both expected and need some action: changing direction in the mesh, being duplicated or arriving at their destination – or any combination of these.
Lastly, providing the neurons in a given population are identifiedsensibly– i.e., with similar IDs – they can usually be routed with a single table entry. This is
R
R R
R
R R R
Figure 2.4.A single neuron tree mapped onto a SpiNNaker chip network. The source neuron is on the shaded chip. ‘R’ indicates a router table entry; other involved routers use defaultrouteing. Solid dots are processors and spikes are typically duplicated to many neurons in each by software.
Router
A
B
B
A
Figure 2.5.Examples of default routeing.
SRAM I D ARM
(Private)
SDRAM
(shared)
Figure 2.6. Memoryhierarchyon a single processor.
because the CAM contains a binary mask that specifies which bits in each key are significant to that router. For example, if a population contains around 2,000 neurons, it can have a 21-bit ID with the remaining 11 bits determining the par- ticular neuron. One routeing table entry can provide for all 2,000 neurons. For implementation, the number of table entries is arbitrary: 1,024 was chosen for SpiNNaker.
The final stage of neuron packet routeing takes place after delivery to a proces- sor subsystem. Here aspike is multicast to a subset of the local neurons; however there is now more information needed. Each connection has some associated information:
•
the strength orweightof the connection•
the unique delay of that connection, simulating the biological connection delays.The details of these variables are not important here. What does matter is that there is one entry persynapse. Even with a very rough calculation – say 1,000 neu- rons each with 1,000 synapses – it becomes clear that several megabytes of storage are required for each processor subsystem. This is the data that reside in the (shared) SDRAMand is fetched on demand.
Each processor has its fast, private memory and shared access to the SDRAM (Figure2.6). Although it can be used for communications, the main intended pur- pose of theSDRAMis to act as a backing store for the large, relatively infrequently accessed data tables. For this purpose, the SDRAM space is partitioned in software with each processor allocated space according to its needs. For many applications, data are simply copied in as needed although synaptic weights could be modified and written back if the network is adaptive.
ARM9 processor
ITCM 32 KB
DTCM 64 KB
Peripherals
ARM968
System NoC
to SDRAM, etc.
AHB
DMAC & bus bridge AHB
AXI
Figure 2.7. Processor subsystem block diagram.
The act of moving data around the memory map is simple but tedious and inefficient for software. Each processor subsystem therefore contains a memory- to-memoryDMAController (DMAC) that can download these structures in the background. The unit is also capable of uploading data if the synaptic weights change, which will occur if the neural network islearning. The impact of trans- fers on the processor is minimal since the localSRAMis bank-interleaved, always assuming the processor has other work to do.
The impact ofDMAtransfers on theprocessingshould also be small as the fetch- ing of data is a background task. To decouple the process further, the DMAC has a command buffer, allowing a request to be queued while its predecessor is in progress;DMAtransfers can therefore run continuously (if necessary) with con- siderable leeway in servicing thecompletioninterrupts.
Other than the ARM968, its RAM and the DMAC, there is very little else within a subsystem. The only peripherals are timers, a communications interface, which allows the processor to send and receive packets and an interrupt controller (Figure2.7).
The ASIC was planned to contain about 20 such processor systems. All the processor subsystems used identical layout for development convenience, meaning the timing closure was only necessary once; on the chosen manufacturing pro- cess, it is permissible to rotate, as well as reflect, the hardened layout macrocells.
When possible floor plans were examined and the feasible chip area was taken into consideration, it became apparent that 18 processor – memory combinations,
Figure 2.8.The SpiNNaker chip floor plan.
together with the router, fitted better. As the specific number was not critical, this was adopted (Figure2.8). This can be post-rationalised into 16 neuron processors, amonitorprocessor to manage the chip as a computer component plus a spare, but the constraint was primarily physical. The processor count does have some impact on the router since, when multicasting packets, it is necessary to specify whether eachof the 24 destinations – 6 chip-to-chip connections plus 18 local processors – is used; 24 bits is a reasonably convenient size to pack into theRAMtables, so this is a bonus.
There are also a few shared resources on each chip, to facilitate operation as a computer component. These provide features such as the clock generators;
interrupt and watchdog reset control and communications; multiprocessor inter- lock support; a small, shared SRAM for inter-processor messaging; an Ethernet interface (as a host link) and, inevitably, some general purpose I/O bits for opera- tions such as run-time configuration and status indication. A bootROMcontaining some preliminary self-test and configuration software completes the set of shared resources. The details of some of these components are discussed in the following sections.