improved bufferless routing via balanced pipeline stages

I would like to thank all members of the Security and Fault Tolerance (SAF-T) Research Group for sharing their ideas and feedback with me since I joined. To solve this problem, the packet-switched on-chip network has been proposed [3]. In the TRIPS prototype processor [9], 75% of the NoC area is filled with buffers; for the RAW microprocessor [10], 36% of the chip power is consumed by the on-chip network.

Ideally, the bufferless router should produce a deflection rate as small as possible. CHIPPER has a two-cycle router pipeline structure, and the permutation structure in the second phase of CHIPPER is used to allocate the output resource [15]. The critical path in the new second phase is a little longer than the previous second phase, but retains the original speed of CHIPPER; the original architecture of CHIPPER used a two-stage pipeline structure where the first stage was significantly longer.

Various logic elements, memory blocks and I/O allow us to quickly design the on-chip network [18]. It also provides a platform that allows us to build the on-chip network to measure the average packet delay.

A Network-on-Chip

The NoC uses a network method, and the IP core implements on-chip communication through a router just like the terminal works in the real world. Unlike buses and p2p, the networking method is included in the digital system design for the first time, and compared to networking, the digital system design focuses on different parameters. For example, suppose a packet whose start position and destination are neither in the same row nor in the same column.

Specifies that a packet passes through the x-axis first after being sent to the network until its current position is in the same column as the destination. However, since the route for each packet is invariant, it is very likely for each node that more than one packet will fight in the same direction. The route for each packet is heavily based on real-time network traffic.

Flow control determines the basic unit in the network, the connection capacity and, if necessary, the buffer capacity. A packet is typically divided into a pair of flits, and the flit is the basic unit in the network.

Figure 1 Examples of communication structures in Systems-on-Chip. a) traditional bus- bus-based communication,b) dedicated point-to-point links, c) a chip area network

B Virtual-Channel Router Architecture

Due to the additional allocation of VCs and router communication, the uptime in a buffered router is considered longer than in an unbuffered router. As an example in [24], it states that the maximum clock frequency for the non-pipeline VC router structure is 72.37 MHz calculated with an FPGA.

C Bufferless Deflection Routing

Compared to the typical virtual channel buffered network, eliminating buffers can achieve significant area and energy savings [15]. In the BLESS and CHIPPER networks, each packet splits its resources into several flits, and the flit as a basic unit moves independently through the network [16]. If a flight enters the network, it must keep moving and cannot stop until it is ejected from the network.

The Unique ID field is necessary because the flash moves independently through the network, and the deflection is unpredictable; flashes from the same packet do not arrive at the destination in sequence. Then the network must provide buffer space at the exit of each node to reassemble the packet. Ideally, if two flashes arrive at the router in the same cycle and their desired output is not the same, then they will move away from this router to their desired output.

However, if two flashes compete for the same output, one flash will win and the other flash must move away from this router to a free output.

D CHIPPER

D.1 Injection

D.2 Microarchitecture

In Table 2, Flit 1 (requires direction east) and Flit 2 (requires direction north) enter Arbiter 1 (Figure 4, top left); based on the rules, the judge decides that these two flits must be exchanged. After Flit 1 and Flit 3 are switched midway between the two stages, Flit 2 and Flit 3 enter Arbiter 3 (Figure 4, top right), Flit 1 and Flit 4 enter Arbiter 4 (Figure 4, bottom right). ), and the judges decide the next route assignments. Compared with the previous BLESS design, the two-stage permutation structure implements the parallel arbiter scheme in both stages, which divides the critical path efficiently [ 15 ].

Although the result shows that the performance of CHIPPER is lower than BLESS, as the critical path of CHIPPER is reduced by 29.1% over BLESS, it guarantees that CHIPPER is better than BLESS [15].

Figure 4 Two-stage permutation with four arbiters [15]

D.3 Ejection

The result shows that two-ejection CHIPPER increases the average performance by 3.7% over the original CHIPPER [16].

D.4 Golden Packet

D.5 Field Programmable Gate Array (FPGA)

In this thesis, the Altera FPGA DE2-115 board (with a Cyclone IV EP4CE115F29 FPGA) [27] is used to perform the investigation on CHIPPER and build a 4x4 mesh topology network with the bufferless routers. The network was developed using the VHDL language; simulation and synthesis of the design using the Electronic Design Automation (EDA) tool Quartus II [28], and the function was verified at the gate level with ModelSim [29]. Power consumption and critical path information is produced by the PowerPlay function and the TimeQuest function in Quartus II.

It is an easy way to observe the content that has been stored in a memory block [28]. To evaluate the average delay, SRAM is used in each node to make two memory blocks. One memory block is used to connect to the injection port and store the injection strokes; the other is used to connect to the two ejection ports and records the hits and the time they were ejected from the local node.

During the investigation for CHIPPER, the critical path between the computation stage and the permutation stage was found to be unbalanced. Since the router is a two-stage pipeline architecture and the permutation stage is faster than the computation stage, it gives the opportunity to improve the architecture. Moreover, the permutation in CHIPPER cannot correspond to an ideal case; there is some unnecessary swing deflection with the CHIPPER.

In the ideal case, the flashes can move through the router in their desired directions. Because the critical path in the permutation stage is shorter than the critical path in the computation stage, there is timing slack to improve the permutation as long as the critical path is not longer than the computation stage. It was decided to keep the 2x2 referee scheme and change their rules to achieve the idea of improving the design.

An additional component called Final Chance will increase the critical path during the permutation phase. The critical path in the new permutation stage is almost identical to the critical path in the calculation stage. The improved design is shown in Figure 7 and the critical path information is discussed further in Section IV.2.

Improved permutation steering algorithm Improved permutation

A Methodology

The data is preloaded into a memory block connected to an injection port in each node. It consists of 10 bits for Unique ID, 4 bits for destination address and 16 bits for source. As the fleet moves away from the network, the Unique ID and the time it arrives at the destination are recorded in another memory block.

B Router’s Operating Frequency

From the table, we can see that the critical path in the improved second stage has increased by 36.8% compared to the original second stage, but the improved second stage is more balanced with the critical path of the first stage. Since the critical path for the first stage is 3.22 ns in each router, the maximum clock frequency for both implementations is 310.6 MHz.

C Area

Since the CHIPPER has a significant saving in area (169%) compared to the conventional buffered router (4 virtual channels (VCs), 4 flash per VC) [16], the increased area in the improved router still offers significant savings in comparison with buffered routes.

D Power

Dynamic power increases linearly with transition rate when you frequently change the state of logic elements.

Figure 9 Dynamic and Static power consumption for each router

E Latency

Sparso, "Virtual Channel Designs for Bandwidth Assurance in Asynchronous Network-on-Chip," in Norchip Conference, 2004. Wolkotte, "Virtual Channel Network-on-Chip for GT and BE Traffic," in Emerging VLSI Technologies and Architectures, 2006 Moraes, "Virtual Channels in Networks on a Chip: Implementation and Evaluation on Hermes NoC," in Proceedings of the 18th Annual Symposium on Integrated Circuits and System Design, 2005, p.

De Micheli, "A Buffer Sizing Algorithm for Networks on Chips Using TDMA and Credit-Based End-to-End Flow Control," in Hardware/Software Codesign and System Synthesis, 2006. 34;MinBD: Minimally Buffered Deflection Routing for Energy-Efficient Interconnect, " in Networks on Chip (NoCS), 2012 Sixth IEEE/ACM International Symposium on, 2012, pp. Kozyrakis, "Evaluating bufferless flow control for on-chip networks," in Proceedings of the 2010 Fourth ACM/IEEE International Symposium on Networks- on-chip, 2010, pp.

Moore, “Low-latency virtual channel routers for on-chip networks,” in ACM SIGARCH Computer Architecture News, 2004, p.