A Scalable Hardware Architecture for Neural Simulation

Steve Furber – 8 May 2005 1.4.1 Introduction

Over the last couple of years, I have been struggling with several aspects of the proposed neural hardware system. Issues that have come to the fore are the importance of modelling axonal delays, the importance of the sparse connectivity of biological neurons, the cost issues relating to the use of very large on-chip memories, and the need to keep as many decisions open for as long as possible. I have now found a way to resolve all of these issues at once through a radical change in the architecture proposal: push the memory off chip into a standardSDRAMand implement the on-chip neural functions through parallel programmable processors of a fairly conventional nature.

This approach yields a highly programmable system of much greater power than that previously proposed and a safer (more familiar) development path.

It also points directly towards a development route that can be used to prove the proposed plan using technology already to hand.

1.4.2 Intellectual Property

The aspect of the overall design that seems to survive all of my attempts to find problems is the neural event routeing mechanism. This system is very simple, yet different from earlier inter-processor networks in that it supports one-to-many communication, whereas most inter-processor networks support one-to-one and/or broadcast communication. We could consider filing a patent on this.²

1.4.3 Market Opportunity

The more folk I talk to in the neuroinformatics area, the more I get the impression that this is a technology whose time may be about to come.

2. We did!

In the first instance, I see this as a system that is well suited to support- ing research into complex neuro-dynamics, and I think this will be the pri- mary market until/unless there is a breakthrough in our understanding. This system is well positioned both to expedite that breakthrough and to exploit the consequences of it.

It is also possible that there might be a market for this system as a general- purpose low-cost high-performance computer system. It has very high integer performance and could be well suited to code-cracking, database search and similar applications that do not need floating-point hardware. However, this will require further investigation.

Potential products include neural simulation software, chips, boards and full-blown systems. We could also sell time on systems.

1.4.4 System Organisation

As before, the system comprises a regular tessellation covering a square 2D area with neural processing nodes (see Figure1.3), each of which now comprises a special-purposeSystem-on-Chip (SoC)and a singleSDRAMchip.

Each node has eight bidirectional communication channels that will probably be used to connect to the eight neighbouring nodes, including diagonal connections.

Other configurations, such as a 4D hypercube, are possible, but the 2D arrangement with diagonals is very easy to put onto aPrinted Circuit Board (PCB)and works pretty well.

The billion neuron system will require of the order of 10,000 nodes and will therefore be built in a rack system on a large number ofPCBs(e.g. 40 PCBseach containing a 16×16 array of nodes).

Figure 1.3 System-level organisation of nodes.

1.4.5 Node Organisation

The internal organisation of a node is illustrated in Figure1.4.

This figure is, in fact, unchanged from the previous version. What it hides, however, is the fact that the implementation of the fascicle³ processor is now completely different. This difference is highlighted in Figures 1.5and1.6.

Instead of each fascicle processor incorporating its own on-chip memory and hardware to implement the neural processing function, we now have off- chip memory in the form of a standardSDRAMthat is shared between the fascicle processors.

This will allow the memory at each node to be increased by an order of magnitude (at lower cost), but presents an obvious bottleneck in the interface between the chips. ModernSDRAMshave very high bandwidth interfaces, but relatively long latencies. For this to work we will have to use theSDRAM very carefully, being sure to arrange data structures that are accessed together

fascicle processor

fascicle processor Rx i/f

Rx i/f Rx i/f Rx i/f

Rx i/f Rx i/f

Tx i/f Tx i/f

Tx i/f Tx i/f monitor

processor monitor processor

router router

arbiterarbiter

fascicle processor

fascicle processorfaaasssssccciccicciclleeleeeee

pr pr pr pr pr prpr profascicleoooooocooocccccceccceeeeeeseeessssssssssssssssosssoorororororororor processor

fascicle processor

fascicle processorfaaasssssccciccicciclleeleeeee

pr pr pr pr pr prpr profascicleoooooocooocccccceccceeeeeeseeessssssssssssssssosssoorororororororor processor

fascicle processor

fascicle processorfaaasssssccciccicciclleeleeeee

pr pr pr pr pr prpr profascicleoooooocooocccccceccceeeeeeseeessssssssssssssssosssoorororororororor processor

fascicle processor

fascicle processorfaaasssssccciccicciclleeleeeee

pr pr pr pr pr prpr profascicleoooooocooocccccceccceeeeeeseeessssssssssssssssosssoorororororororor processor Rx i/f

Rx i/fRRRRRR R Rxxx xx iixxxxx///f/ff/fiii/f/f/ff Rx i/f

Rx i/f Rx i/fRRRRRR R Rxxx xx iiixxxxx///f/f/f/fiii/f/f/ff Rx i/f

Tx i/f Tx i/fTxTxTxTxTxTx TxTx ii///f/ff/fiii/f/f/ff Tx i/f

Tx i/f Tx i/fTxTxTxTxTxTx TxTx iii///f/f/f/fiii/f/f/ff Tx i/f

Tx i/f Tx i/fTxTxTxTxTxTx TxTx iii///f/f/f/fiii/f/f/ff Tx i/f monitor

processor monitor processormmomoooonnnnniiitttooororor

pr pr pr pr pr prpr promonitoroooooocooocccccceccceeeeeeseeessssssssssssssssosssoorororororororor processor

router routerrooooouuuuutttee rrerererrrr router arbiterarbiteraraabrr

bb i bb t bbii er

tteearbiter

Figure 1.4 Node organisation.

3. At this time, I had seen ‘fascicle’ used to describe a bundle of neuron fibres and thought it was widely used this way. I was wrong! We now use ‘population’ to describe a bunch of neurons with common inputs and outputs.

Figure 1.5 Detail of fascicle processor memory interface.

Figure 1.6 Detail of fascicle processor organisation.

in contiguous memory locations to make best use of streaming data performance, and so on. I believe that this can be made to work, but it is a crit- ical design issue. In particular, it will be important to ensure that all of the instruction fetch activity of the processor is satisfied locally from the instruc- tionRAM. I think this will be easy as the codes will be very small, but this can be checked by developing them in ARMassembly code. A few kBytes should be enough.

The fascicle processors share access to theSDRAMvia aNetwork-on-Chip (NoC) along the lines of CHAIN, a very flexible interconnect technology proven on the smartcard chip (and currently being exploited by Silistix).

The internals of the fascicle processor node are illustrated in Figure1.6.

Here we see the biggest change. In place of hardware neural processing logic, we have a conventional processor, for example, Amulet3 [267]. It is configured as a conventional system on chip with its separate instruction and data buses connected to a minimal set of local memories and commu- nications devices. Each FP could have its own boot Read-Only Memory (ROM)(as illustrated), or there could be a single bootROMshared via the

NoC– this needs to be investigated. The instructionSRAMis dual-ported solely to enable its contents to be modified by the processor itself.

1.4.6 System Architecture Issues OK, so why does all (any?) of this make sense?

We have turned the neural model into software, making it fully reconfig- urable, and providing scope for much more flexible axonal delay modelling, data structures that efficiently store sparse connectivity information, sophis- ticated Hebbian synaptic update rules and so on.

The design of theSoCis now a much more familiar job. We need to design the pretty simple inter-chip communication interfaces and router, an indi- vidual fascicle processor (which is then replicated many times on the chip), theNoC, and theSDRAMinterface (which we may be able to buy-in – it’s a conventional horribly synchronous block). This is still a big job, but it’s a whole lot easier than what was proposed before.

The advantage of theSoC structure is that it is comprised principally of several copies of the fascicle processor. Even the monitor processor shown in Figure 2 can simply be another copy of this subsystem. This gives us a significant manufacturing cost advantage – if a fault causes one of these subsystems to fail, this does not greatly impair the usefulness of the chip; the rest of the system is likely to work fine. We just lose a bit of throughput.

The full billion neuron system is an interesting beast. It will require 10,000 SoCchips and the same number ofSDRAMs. It will incorporate in the order of 200,000 processors between them delivering 64 Tera Instructions Per Sec- ond (TIPS). It will consume tens of kW. It will probably cost in the region of £0.5 million.

A single-board system comprising 256 nodes would simulate 25 million neurons and cost around £12,000.

Each node has about the performance of aPersonal Computer (PC)in this application and is less than 10% of the price. In addition,PCsdo not have the appropriate communication structures to enable a computing cluster to scale as well as the proposed design, so the price advantage at the high end is significantly greater.

1.4.7 Development Plan

This set of ideas is rather new so it will take some time to check that it is really feasible. In the meantime, there are several things we can do to check it out.

It is feasible to use the Excalibur parts we obtained from Altera to prove the ideas. These chips include a 200 MHz ARM9 with caches and anSDRAM interface, and an area of programmable logic. We can prototype the neural algorithms on theARM9 and prototype the router and inter-chip commu- nications in the programmable logic. We have 2 development systems to get started and 10 chips that could be used to build a 100,000 neuron engine.

Such a system would be an asset if we wished to attract venture capital fund- ing to support theSoCdevelopment and/or production.

Alternatively, I could put in a largeEPSRCproposal to support theSoC design.

Timescales (rough estimates):

•

2 persons for 1 year to develop algorithms and Excalibur prototype (∼£150 k).

•

4 persons/3 years to developSoCdesign (∼£900 k).

•

Some significant amount of time for configuration software development.

This will yield prototype silicon. Moving this into production will incur a large mask charge ($1.5 million) and, at this stage, this will require a part- nership and/or investment.

Dalam dokumen SPINNAKER (Halaman 34-39)